Columbia Mountain Institute for Applied Ecology A short course on ...

<strong>Columbia</strong> <strong>Mountain</strong> <strong>Institute</strong> <strong>for</strong> <strong>Applied</strong> <strong>Ecology</strong> 

A <strong>short</strong> <strong>course</strong> on regression methods. 

These notes are a subset of a more complete set of notes 

available at 

http: 

//www.stat.sfu.ca/~cschwarz/CourseNotes 

C. J. Schwarz 

Department of Statistics and Actuarial Science, Simon Fraser University 

8888 Universit Drive 

Burnaby, BC V5A 1S6 

cschwarz@stat.sfu.ca 

November 23, 2012

Contents 

1 Correlation and simple linear regression 7 

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

1.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

1.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

1.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

1.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.3.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

1.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

1.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

1.4.2 Equation <strong>for</strong> a line - getting notation straight (no pun intended) . . . . . . . . . . . . 23 

1.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

1.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

Correct scale of predictor and response . . . . . . . . . . . . . . . . . . . . . . . . 25 

Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

No outliers or influential points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

Equal variation along the line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

Normality of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

1.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

1.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 

1.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

1.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

1.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

1.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

1.4.11 Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

1.4.12 Example: Monitoring Dioxins - trans<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . 50 

1.4.13 Example: Weight-length relationships - trans<strong>for</strong>mation . . . . . . . . . . . . . . . . 62 

Using the Fit Special . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

Using derived variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

1

CONTENTS 

A non-linear fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

1.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

1.4.15 The perils of R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

1.5 A no-intercept model: Fulton’s Condition Factor K . . . . . . . . . . . . . . . . . . . . . . 80 

1.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

1.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . . . . . 86 

2 Detecting trends over time 89 


2.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

2.2.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 


Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

Scale of Y and X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 




Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 


X measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 


2.2.4 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

2.2.5 Inverse predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

2.2.6 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

2.2.7 Example: The Grass is Greener (<strong>for</strong> longer) . . . . . . . . . . . . . . . . . . . . . . 103 

2.3 Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

2.3.1 Example: Monitoring Dioxins - trans<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . 115 

2.3.2 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

2.4 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 


2.4.2 Getting the necessary in<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

2.4.3 How does power vary as in<strong>for</strong>mation changes? . . . . . . . . . . . . . . . . . . . . 134 

2.4.4 Finally - how many year do I need to monitor? . . . . . . . . . . . . . . . . . . . . 145 

2.4.5 Summary of plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 

2.5 Testing <strong>for</strong> common trend - ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 


2.5.2 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 

2.5.3 Example: Degradation of dioxin - pooling locations . . . . . . . . . . . . . . . . . . 169 

2.5.4 Change in yearly average temperature with regime shifts . . . . . . . . . . . . . . . 186 

2.6 Dealing with Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 

2.6.1 Example: Mink pelts from Saskatchewan . . . . . . . . . . . . . . . . . . . . . . . 204 

2.7 Dealing with seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 

2.7.1 Empirical adjustment <strong>for</strong> seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . 219 

General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 

Example: Total phosphorus from Klamath River . . . . . . . . . . . . . . . . . . . 219 

2.7.2 Using the ANCOVA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 

c○2012 Carl James Schwarz 2 November 23, 2012

CONTENTS 

General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 

Example: Total phosphorus levels on the Klamath River - revisited . . . . . . . . . . 228 

2.7.3 Fitting cyclical patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 

General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 

Example: Total phosphorus from Klamath River . . . . . . . . . . . . . . . . . . . 233 

Example: Comparing air quality measurements using two different methods . . . . . 240 

2.7.4 Further comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 

2.8 Seasonality and Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 

2.9 Non-parametric detection of trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 

2.9.1 Cox and Stuart test <strong>for</strong> trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 

2.9.2 Non-parametric regression - Spearman, Kendall, Theil, Sen estimates . . . . . . . . 258 

Non-parametric does NOT mean no assumptions . . . . . . . . . . . . . . . . . . . 258 

Example: The Grass is Greener (<strong>for</strong> longer) revisited . . . . . . . . . . . . . . . . . 260 

Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 

2.9.3 Dealing with seasonality - Seasonal Kendall’s τ . . . . . . . . . . . . . . . . . . . . 265 

Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 

Example: Total phosphorus on the Klamath River revisited . . . . . . . . . . . . . . 268 

Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 

2.9.4 Seasonality with Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 

General ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 

3 Estimating power/sample size using Program Monitor 282 

3.1 Mechanics of MONITOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 

3.2 How does MONITOR work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 

3.3 Incorporating process and sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 

3.4 Presence/Absence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 

3.5 WARNING about using testing <strong>for</strong> temporal trends . . . . . . . . . . . . . . . . . . . . . . 309 

4 Regression - hockey sticks, broken sticks, piecewise, change points 311 

4.1 Hockey-stick, piecewise, or broken-stick regression . . . . . . . . . . . . . . . . . . . . . . 311 

4.1.1 Example: Nenana River Ice Breakup Dates . . . . . . . . . . . . . . . . . . . . . . 312 

4.2 Searching <strong>for</strong> the change point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 

4.2.1 Change point model <strong>for</strong> the Nenana River Ice Breakup . . . . . . . . . . . . . . . . 318 

4.3 How NOT to search <strong>for</strong> a change point! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 

5 Analysis of Covariance - ANCOVA 327 


5.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 

5.3 Comparing individual regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 

5.4 Comparing Means after covariate adjustments . . . . . . . . . . . . . . . . . . . . . . . . . 335 

5.5 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 

5.6 Example - Degradation of dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 

5.7 Change in yearly average temperature with regime shifts . . . . . . . . . . . . . . . . . . . 351 

5.8 Example - More refined analysis of stream-slope example . . . . . . . . . . . . . . . . . . . 360 

5.9 Comparing Fulton’s Condition Factor K . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 

5.10 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 


CONTENTS 

6 Multiple linear regression 389 


6.1.1 Data <strong>for</strong>mat and missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 

6.1.2 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 


Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 




Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 


X variables measured without error . . . . . . . . . . . . . . . . . . . . . . . . . . 393 


6.1.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 

6.1.6 Example: blood pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 

6.2 Regression problems and diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 


6.2.2 Preliminary characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 

6.2.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 

6.2.4 Actual vs. Predicted Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 

6.2.5 Detecting influential observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 

Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 

Hats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 

Caution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 

6.2.6 Leverage plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 

6.2.7 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 

6.3 Polynomial, product, and interaction terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 


6.3.2 Example: Tomato growth as a function of water . . . . . . . . . . . . . . . . . . . . 423 

6.3.3 Polynomial models with several variables . . . . . . . . . . . . . . . . . . . . . . . 440 

6.3.4 Cross-product and interaction terms . . . . . . . . . . . . . . . . . . . . . . . . . . 441 

6.4 The general linear test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 


6.4.2 Example: Predicting body fat from measurements . . . . . . . . . . . . . . . . . . . 443 

6.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 

6.5 Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 


6.5.2 Defining indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 

6.5.3 The ANCOVA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 


6.5.5 Comparing individual regression lines . . . . . . . . . . . . . . . . . . . . . . . . . 455 

6.5.6 Example: Degradation of dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 

6.5.7 Example: More refined analysis of stream-slope example . . . . . . . . . . . . . . . 474 

6.6 Example: Predicting PM10 levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 

6.7 Variable selection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 



CONTENTS 

6.7.2 Maximum model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 

6.7.3 Selecting a model criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 

R 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 

F p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 

MSE p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 

C p and AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 

6.7.4 Which subsets should be examined . . . . . . . . . . . . . . . . . . . . . . . . . . 501 

All possible subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 

Backward elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 

Forward addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 

Stepwise selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 

Closing words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 

6.7.5 Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 

6.7.6 Example: Calories of candy bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 

6.7.7 Example: Fitness dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 

6.7.8 Example: Predicting zoo plankton biomass . . . . . . . . . . . . . . . . . . . . . . 514 

7 Logistic Regression 528 


7.1.1 Difference between standard and logistic regression . . . . . . . . . . . . . . . . . . 528 

7.1.2 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 

7.1.3 Odds, risk, odds-ratio, and probability . . . . . . . . . . . . . . . . . . . . . . . . . 530 

7.1.4 Modeling the probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . 532 

7.1.5 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 

7.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 

7.3 Assumptions made in logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 

7.4 Example: Space Shuttle - Single continuous predictor . . . . . . . . . . . . . . . . . . . . . 546 

7.5 Example: Predicting Sex from physical measurements - Multiple continuous predictors . . . 552 

7.6 Examples: Lung Cancer vs. Smoking; Marijuana use of students based on parental usage - 

Single categorical predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 

7.6.1 Retrospect and Prospective odds-ratio . . . . . . . . . . . . . . . . . . . . . . . . . 563 

7.6.2 Example: Parental and student usage of recreational drugs . . . . . . . . . . . . . . 565 

7.6.3 Example: Effect of selenium on tadpoles de<strong>for</strong>mities . . . . . . . . . . . . . . . . . 574 

7.7 Example: Pet fish survival as function of covariates - Multiple categorical predictors . . . . 586 

7.8 Example: Horseshoe crabs - Continuous and categorical predictors. . . . . . . . . . . . . . 601 

7.9 Assessing goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 



7.10.2 Example: Predicting credit worthiness . . . . . . . . . . . . . . . . . . . . . . . . . 624 

7.11 Model comparison using AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 

7.12 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 

7.12.1 Two common problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 

Zero counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 

Complete separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 

7.12.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 

Choice of link function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 


CONTENTS 

More than two response categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 

Exact logistic regression with very small datasets . . . . . . . . . . . . . . . . . . . 633 

More complex experimental designs . . . . . . . . . . . . . . . . . . . . . . . . . . 634 

7.12.3 Yet to do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 

8 Poisson Regression 635 


8.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 

8.3 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 

8.4 Single continuous X variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 

8.5 Single continuous X variable - dealing with overdispersion . . . . . . . . . . . . . . . . . . 644 

8.6 Single Continuous X variable with an OFFSET . . . . . . . . . . . . . . . . . . . . . . . . 661 

8.7 ANCOVA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 

8.8 Categorical X variables - a designed experiment . . . . . . . . . . . . . . . . . . . . . . . . 685 

8.9 Log-linear models <strong>for</strong> multi-dimensional contingency tables . . . . . . . . . . . . . . . . . 695 


8.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695 


Chapter 1 

Correlation and simple linear regression 

1.1 Introduction 

A nice book explaining how to use JMP to per<strong>for</strong>m regression analysis is: Freund, R., Littell, R., and 

Creighton, L. (2003) Regression using JMP. Wiley Interscience. 

Much of statistics is concerned with relationships among variables and whether observed relationships 

are real or simply due to chance. In particular, the simplest case deals with the relationship between two 

variables. 

Quantifying the relationship between two variables depends upon the scale of measurement of each of 

the two variables. The following table summarizes some of the important analyses that are often per<strong>for</strong>med 

to investigate the relationship between two variables. 

7

CHAPTER 1. CORRELATION AND SIMPLE LINEAR REGRESSION 

Type of variables 

X is Interval or Ratio or 

what JMP calls Continuous 

• Scatterplots 

me- 

• Running 

dian/spline fit 

• Regression 

• Correlation 

Y is Interval 

or Ratio 

or what JMP 

calls Continuous 

X is Nominal or Ordinal 

• Side-by-side dot plot 

• Side-by-side box 

plot 

• ANOVA or t-tests 

Y is Nominal 

or Ordinal 

• Logistic regression 

• Mosaic chart 

• Contingency tables 

• Chi-square tests 

In JMP these combination of two variables are obtained by the Analyze->Fit Y-by-X plat<strong>for</strong>m, the 

Analyze->Correlation-of-Ys plat<strong>for</strong>m, or the Analyze->Fit Model plat<strong>for</strong>m. 

When analyzing two variables, one question becomes important as it determines the type of analysis that 

will be done. Is the purpose to explore the nature of the relationship, or is the purpose to use one variable 

to explain variation in another variable? For example, there is a difference between examining height and 

weight to see if there is a strong relationship, as opposed to using height to predict weight. 

Consequently, you need to distinguish between a correlational analysis in which only the strength of the 

relationship will be described, or regression where one variable will be used to predict the values of a second 

variable. 

The two variables are often called either a response variable or an explanatory variable. A response 

variable (also known as a dependent or Y variable) measures the outcome of a study. An explanatory 

variable (also known as an independent or X variable) is the variable that attempts to explain the observed 

outcomes. 



1.2 Graphical displays 

1.2.1 Scatterplots 

The scatter-plot is the primary graphical tool used when exploring the relationship between two interval or 

ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X plat<strong>for</strong>m – be sure that both 

variables have a continuous scale. 

In graphing the relationship, the response variable is usually plotted along the vertical axis (the Y axis) 

and the explanatory variables is plotted along the horizontal axis (the X axis). It is not always perfectly 

clear which is the response and which is the explanatory variable. If there is no distinction between the two 

variables, then it doesn’t matter which variable is plotted on which axis – this usually only happens when 

finding correlation between variables is the primary purpose. 

For example, look at the relationship between calories/serving and fat from the cereal dataset using JMP. 

[We will create the graph in class at this point.] 

What to look <strong>for</strong> in a scatter-plot 

Overall pattern. - What is the direction of association? A positive association occurs when above-average 

values of one variable tend to be associated with above-average variables of another. The plot will 

have an upward slope. A negative association occurs when above-average values of one variable are 

associated with below-average values of another variable. The plot will have a downward slope. What 

happens when there is “no association” between the two variables? 

Form of the relationship. Does a straight line seem to fit through the ‘middle’ of the points? Is the line 

linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to <strong>for</strong>m 

a curve)? 

Strength of association. Are the points clustered tightly around the curve? If the points have a lot of scatter 

above and below the trend line, then the association is not very strong. On the other hand, if the 

amount of scatter above and below the trend line is very small, then there is a strong association. 

Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far from the 

trend curve - i.e., they are further away from the trend curve than you would expect from the usual 

level of scatter. There is no <strong>for</strong>mal rule <strong>for</strong> detecting outliers - use common sense. [If you set the 

role of a variable to be a label, and click on points in a linked graph, the label <strong>for</strong> the point will be 

displayed making it easy to identify such points.] 

One’s usual initial suspicion about any outlier is that it is a mistake, e.g., a transcription error. Every 

ef<strong>for</strong>t should be made to trace the data back to its original source and correct the value if possible. If 

the data value appears to be correct, then you have a bit of a quandary. Do you keep the data point 

in even though it doesn’t follow the trend line, or do you drop the data point because it appears to be 

anomalous? Fortunately, with computers it is relatively easy to repeat an analysis with and without an 

outlier - if there is very little difference in the final outcome - don’t worry about it. 



In some cases, the outliers are the most interesting part of the data. For example, <strong>for</strong> many years the 

ozone hole in the Antarctic was missed because the computers were programmed to ignore readings 

that were so low that ‘they must be in error’! 

Lurking variables. A lurking variable is a third variable that is related to both variables and may confound 

the association. 

For example, the amount of chocolate consumed in Canada and the number of automobile accidents 

are positively related, but most people would agree that this is coincidental and each variable is independently 

driven by population growth. 

Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by using 

a different plotting symbol to distinguish between the values of the third variables. For example, 

consider the following plot of the relationship between salary and years of experience <strong>for</strong> nurses. 

The individual lines show a positive relationship, but the overall pattern when the data are pooled, 

shows a negative relationship. 

It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points. 

From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markers 

menu. 

1.2.2 Smoothers 

Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example, 

consider the following data: 



There are several common methods available to fit a line through this data. 

By eye The eye has remarkable power <strong>for</strong> providing a reasonable approximation to an underlying trend, 

but it needs a little education. A trend curve is a good summary of a scatter-plot if the differences between 

the individual data points and the underlying trend line (technically called residuals) are small. As well, a 

good trend curve tries to minimize the total of the residuals. And the trend line should try and go through 

the middle of most of the data. 

Although the eye often gives a good fit, different people will draw slightly different trend curves. Several 

automated ways to derive trend curves are in common use - bear in mind that the best ways of estimating 

trend curves will try and mimic what the eye does so well. 

Median or mean trace The idea is very simple. We choose a “window” width of size w, say. For 

each point along the bottom (X) axis, the smoothed value is the median or average of the Y -values <strong>for</strong> 

all data points with X-values lying within the “window” centered on this point. The trend curve is then 

the trace of these medians or means over the entire plot. The result is not exactly smooth. Generally, the 

wider the window chosen the smoother the result. However, wider windows make the smoother react more 

slowly to changes in trend. Smoothing techniques are too computationally intensive to be per<strong>for</strong>med by 

hand. Un<strong>for</strong>tunately, JMP is unable to compute the trace of data, but splines are a very good alternative (see 

below). 

The mean or median trace is too unsophisticated to be a generally useful smoother. For example, the 

simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights of troughs. 

(Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of trying to summarize 

a pattern in a weak relationship <strong>for</strong> a moderately large data set. In a very weak relationship it can even help 

you to see the trend. 



Box plots <strong>for</strong> strips The following gives a conceptually simple method which is useful <strong>for</strong> exploring a 

weak relationship in a large data set. The X-axis is divided into equal-sized intervals. Then separate box 

plots of the values of Y are found <strong>for</strong> each strip. The box-plots are plotted side-by-side and the means or 

median are joined. Again, we are able to see what is happening to the variability as well as the trend. There 

is even more detailed in<strong>for</strong>mation available in the box plots about the shape of the Y -distribution etc. Again, 

this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a new variable that 

groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X plat<strong>for</strong>m using these 

groupings. This is illustrated below: 

Spline methods A spline is a series of <strong>short</strong> smooth curves that are joined together to create a larger 

smooth curve. The computational details are complex, but can be done in JMP. The stiffness of the spline 

indicates how straight the resulting curve will be. The following shows two spline fits to the same data with 

different stiffness measures: 





1.3 Correlation 

WARNING!: Correlation is probably the most abused concept in statistics. Many people use the 

word ‘correlation’ to mean any type of association between two variables, but it has a very strict technical 

meaning, i.e. the strength of an apparent linear relationship between the two interval or ratio scaled 

variables. 

The correlation measure does not distinguish between explanatory and response variables and it treats the 

two variables symmetrically. This means that the correlation between Y and X is the same as the correlation 

between X and Y. 

Correlations are computed in JMP using the Analyze->Correlation of Y’s plat<strong>for</strong>m. If there are several 

variables, then the data will be organized into a table. Each cell in the table shows the correlation of the 

two corresponding variables. Because of symmetry (the correlation between variable 1 and variable 2 is the 

same as between variable 2 and variable 1 ), only part of the complete matrix will be shown. As well, the 

correlation between any variable and itself is always 1. 



1.3.1 Scatter-plot matrix 

To illustrate the ideas of correlation, look at the FITNESS dataset in the DATAMORE directory of JMP. 

This is a dataset on 31 people at a fitness centre and the following variables were measured on each subject: 

• name 

• gender 

• age 

• weight 

• oxygen consumption (high values are typically more fit people) 

• time to run one mile (1.6 km) 

• average pulse rate during the run 

• the resting pulse rate 

• maximum pulse rate during the run. 

We are interested in examining the relationship among the variables. At the moment, ignore the fact 

that the data contains both genders. [It would be interesting to assign different plotting symbols to the two 

genders to see if gender is a lurking variable.] 

One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze- 

>Correlation of Ys to get the following scatter-plot: 



Interpreting the scatter plot matrix 

The entries in the matrix are scatter-plots <strong>for</strong> all the pairs of variables. For example, the entry in row 1 

column 3 represents the scatter-plot between age and oxygen consumption with age along the vertical axis 

and oxygen consumption along the horizontal axis, while the entry in row 3 column 1 has age along the 

horizontal axis and oxygen consumption along the vertical axis. 



There is clearly a difference in the ’strength’ of relationships. Compare the scatter plot <strong>for</strong> average 

running pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and resting pulse 

rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2). 

Similarly, there is a difference in the direction of association. Compare the scatter plot <strong>for</strong> the average 

running pulse rate and maximum pulse rate (row 5 column 7) and that <strong>for</strong> oxygen consumption and running 

time (row 3, column 4). 

1.3.2 Correlation coefficient 

It is possible to quantify the strength of association between two variables. As with all statistics, the way the 

data are collected influences the meaning of the statistics. 

The population correlation coefficient between two variables is denoted by the Greek letter rho (ρ) and 

is computed as:. 

ρ = 1 N∑ (X i − µ X ) (Y i − µ Y ) 

N σ x σ y 

The corresponding sample correlation coefficient is denoted r has a similar <strong>for</strong>m: 1 

i=1 

r = 1 

n − 1 

n∑ 

i=1 

( 

Xi − X ) 

s x 

( 

Yi − Y ) 

s y 

If the sampling scheme is simple random sample from the corresponding population, then r is an estimate 

of ρ. This is a crucial assumption. If the sampling is not a simple random sample, the above 

definition of the sample correlation coefficient should not be used! It is possible to find a confidence interval 

<strong>for</strong> ρ and to per<strong>for</strong>m statistical tests that ρ is zero. However, <strong>for</strong> the most part, these are rarely done in 

ecological research and so will not be pursued further in this <strong>course</strong>. 

The <strong>for</strong>m of the <strong>for</strong>mula does provide some insight into interpreting its value. 

• ρ and r (unlike other population parameters) are unitless measures. 

• the sign of ρ and r is largely determined by the pairing of the relationship of each of the (X,Y) values 

with their respective means, i.e. if both of X and Y are above the mean, or both X and Y are below 

their mean, this pair contributes a positive value towards ρ or r, while if X is above and Y is below, or 

X is below and Y is above their respective means contributes a negative value towards ρ or r. 

• ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlation; a 

value of ρ or r equal to 1 implies a perfect positive correlation; a value of ρ or r equal to 0 implied 

no correlation. A perfect population correlation (i.e. ρ or r equal to 1 or -1) implies that all points lie 

1 Note that this <strong>for</strong>mula SHOULD NOT be used <strong>for</strong> the actual computation of r, it is numerically unstable and there are better 

computing <strong>for</strong>mulae available. 



exactly on a straight line, but the slope of the line has NO effect on the correlation coefficient. This 

latter point is IMPORTANT and often is wrongly interpreted - give some examples. 

• ρ and r are unaffected by linear trans<strong>for</strong>mations of the individual variables, e.g. unit changes such as 

converting from imperial to metric units. 

• ρ and r only measures the linear association, and is not affected by the slope of the line, but only by 

the scatter about the line. 

Because correlation assumes both variables have an interval or ratio scale, it makes no sense to compute 

the correlation 

• between gender and oxygen (gender is a nominal scale data); 

• between non-linear variables (not shown on graph); 

• <strong>for</strong> data collected without a known probability scheme. If a sampling scheme other than simple random 

sampling is used, it is possible to modify the estimation <strong>for</strong>mula; if a non-probability sample 

scheme was used, the patient is dead on arrival, and no amount of statistical wizardry will revive the 

corpse. 

The data collection scheme <strong>for</strong> the fitness data set is unknown - we will have to assume that a some sort 

of random sample <strong>for</strong>m the relevant population was taken be<strong>for</strong>e we can make much sense of the number 

computed. 

Be<strong>for</strong>e looking at the details of its computation, look at the sample correlation coefficients <strong>for</strong> each 

scatter plot above. These can be arranged into a matrix: 

Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse 

Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41 

Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24 

Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23 

Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22 

RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92 

RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30 

MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00 

Notice that the sample correlation between any two variables is the same regardless of ordering of the 

variables – this explains the symmetry in the matrix between the above and below diagonal elements. As 

well each variable has a perfect sample correlation with itself – this explains the value of 1 along the main 

diagonal. 

Compare the sample correlations between the average running pulse rate and the other variables and 

compare them to the corresponding scatter-plot above. 



1.3.3 Cautions 

• Random Sampling Required Sample correlation coefficients are only valid under simple random 

samples. If the data were collected in a haphazard fashion or if certain data points were oversampled, 

then the correlation coefficient may be severely biased. 

• There are examples of high correlation but no practical use and low correlation but great practical use. 

These will be presented in class. This illustrates why I almost never talk about correlation. 

• correlation measures ‘strength’ of a linear relationship; a curvilinear relationship may have a correlation 

of 0, but there will still be a good correlation. 

• the effect of outliers and high leverage points will be presented in class 

• effects of lurking variables. For example, suppose there is a positive association between wages of 

male nurses and years of experience; between female nurses and years of experience; but males are 

generally paid more than females. There is a positive correlation within each group, but an overall 

negative correlation when the data are pooled together. 



• ecological fallacy - the problem of correlation applied to averages. Even if there is a high correlation 

between two variables on their averages, it does not imply that there is a correlation between individual 

data values. 

For example, if you look at the average consumption of alcohol and the consumption of cigarettes, 

there is a high correlation among the averages when the 12 values from the provinces and territories 

are plotted on a graph. However, the individual relationships within provinces can be reversed or 

non-existent as shown below: 



The relationship between cigarette consumption and alcohol consumption shows no relationship <strong>for</strong> 

each province, yet there is a strong correlation among the per-capita averages. This is an example of 

the ecological fallacy. 

• correlation does not imply causation. This is the most frequent mistake made by people. There are 

set of principles of causal inference that need to be satisfied in order to imply cause and effect. 

1.3.4 Principles of Causation 

Types of association 

An association may be found between two variables <strong>for</strong> several reasons (show causal modeling figures): 

• There may be direct causation, e.g. smoking causes lung cancer. 

• There may be a common cause, e.g. ice cream sales and number of drownings both increase with 

temperature. 

• There may be a confounding factor, e.g. highway fatalities decreased when the speed limits were 

reduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and people drove 

fewer miles. 



• There may be a coincidence, e.g., the population of Canada has increased at the same time as the 

moon has gotten closer by a few miles. 

Establishing cause-and effect. 

How do we establish a cause and effect relationship? Brad<strong>for</strong>d Hill (Hill, A. B.. 1971. Principles of 

Medical Statistics, 9th ed New York: Ox<strong>for</strong>d University Press) outlined 7 criteria that have been adopted by 

many epidemiological researchers. It is generally agreed that most or all of the following must be considered 

be<strong>for</strong>e causation can be declared. 

Strength of the association. The stronger an observed association appears over a series of different studies, 

the less likely this association is spurious because of bias. 

Dose-response effect. The value of the response variable changes in a meaningful way with the dose (or 

level) of the suspected causal agent. 

Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The ability to 

establish this time pattern will depend upon the study design used. 

Consistency of the findings. Most, or all, studies concerned with a given causal hypothesis produce similar 

findings. Of <strong>course</strong>, studies dealing with a given question may all have serious bias problems that can 

diminish the importance of observed associations. 

Biological or theoretical plausibility. The hypothesized causal relationship is consistent with current biological 

or theoretical knowledge. Note, that the current state of knowledge may be insufficient to 

explain certain findings. 

Coherence of the evidence. The findings do not seriously conflict with accepted facts about the outcome 

variable being studied. 

Specificity of the association. The observed effect is associated with only the suspected cause (or few other 

causes that can be ruled out). 

IMPORTANT: NO CAUSATION WITHOUT MANIPULATION! 

Examples: 

Discuss the above in relation to: 

• amount of studying vs. grades in a <strong>course</strong>. 

• amount of clear cutting and sediments in water. 

• fossil fuel burning and the greenhouse effect. 



1.4 Single-variable regression 

1.4.1 Introduction 

Along with the Analysis of Variance, this is likely the most commonly used statistical methodology in 

ecological research. In virtually every issue of an ecological journal, you will find papers that use a regression 

analysis. 

There are HUNDREDS of books written on regression analysis. Some of the better ones (IMHO) are: 

Draper and Smith. <strong>Applied</strong> Regression Analysis. Wiley. 

Neter, Wasserman, and Kutner. <strong>Applied</strong> Linear Statistical Models. Irwin. 

Kleinbaum, Kupper, Miller. <strong>Applied</strong> Regression Analysis. Duxbury. 

Zar. Biostatistics. Prentice Hall. 

Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of regression 

analysis. Please consult the above references <strong>for</strong> all the gory details. 

It turns out that both Analysis of Variance and Regression are special cases of a more general statistical 

methodology called General Linear Models which in turn are special cases of Generalized Linear Models 

(covered in Stat 402/602), which in turn are special cases of Generalized Additive Models, which in turn are 

special cases of ..... 

The key difference between a Regression analysis and an ANOVA is that the X variable is nominal 

scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This implies that in 

ANOVA, the shape of the response profile was unspecified (the null hypothesis was that all means were 

equal while the alternate was that at least one mean differs), while in regression, the response profile must 

be a linear line. 

Because both ANOVA and regression are from the same class of statistical models, many of the assumptions 

are similar, the fitting methods are similar, hypotheses testing and inference are similar as well. 

1.4.2 Equation <strong>for</strong> a line - getting notation straight (no pun intended) 

In order to use regression analysis effectively, it is important that you understand the concepts of slopes and 

intercepts and how to determine these from data values. 

This will be QUICKLY reviewed here in class. 

In previous <strong>course</strong>s at high school or in linear algebra, the equation of a straight line was often written 

y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, the authors 

decided to write the equation of a line as y = a + bx. Now a is the intercept, and b is the slope. Statisticians, 



<strong>for</strong> good reasons, have rationalized this notation and usually write the equation of a line as y = β 0 + β 1 x or 

as Y = b 0 + b 1 X. (the distinction between β 0 and b 0 will be made clearer in a few minutes). The use of the 

subscripts 0 to represent the intercept and the subscript 1 to represent the coefficient <strong>for</strong> the X variable then 

readily extends to more complex cases. 

Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unit change 

in X. 

1.4.3 Populations and samples 

All of statistics is about detecting signals in the face of noise and in estimating population parameters from 

samples. Regression is no different. 

First consider the the population. As in previous chapters, the correct definition of the population is 

important as part of any study. Conceptually, we can think of the large set of all units of interest. On 

each unit, there is conceptually, both an X and Y variable present. We wish to summarize the relationship 

between Y and X, and furthermore wish to make predictions of the Y value <strong>for</strong> future X values that may 

be observed from this population. [This is analogous to having different treatment groups corresponding to 

different values of X in ANOVA.] 

If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma or P V = nRt. 

However, in ecology, the relationship between Y and X is much more tenuous. If you could draw a scatterplot 

of Y against X <strong>for</strong> ALL elements of the population, the points would NOT fall exactly on a straight 

line. Rather, the value of Y would fluctuate above or below a straight line at any given X value. [This is 

analogous to saying that Y varies randomly around the treatment group mean in ANOVA.] 

We denote this relationship as 

Y = β 0 + β 1 X + ɛ 

where now β 0 , β 1 are the POPULATION intercept and slope respectively. We say that 

E[Y ] = β 0 + β 1 X 

is the expected or average value of Y at X. [In ANOVA, we let each treatment group have its own mean; 

here in regression we assume that the means must fit on a straight line.] 

The term ɛ represent random variation of individual units in the population above and below the expected 

value. It is assumed to have constant standard deviation over the entire regression line (i.e. the spread of data 

points in the population is constant over the entire regression line). [This is analogous to the assumption of 

equal treatment population standard deviations in ANOVA.] 

Of <strong>course</strong>, we can never measure all units of the population. So a sample must be taken in order to 

estimate the population slope, population intercept, and population standard deviation. Unlike a correlation 

analysis, it is NOT necessary to select a simple random sample from the entire population and more elaborate 

schemes can be used. The bare minimum that must be achieved is that <strong>for</strong> any individual X value found in 

the sample, the units in the population that share this X value, must have been selected at random. 



This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X from the 

extremes and then only at those X value, randomly select from the relevant subset of the population, rather 

than having to select at random from the population as a whole. [This is analogous to the assumptions made 

in an analytical survey, where we assumed that even though we can’t randomly assign a treatment to a unit 

(e.g. we can’t assign sex to an animal) we must ensure that animals are randomly selected from each group]. 

Once the data points are selected, the estimation process can proceed, but not be<strong>for</strong>e assessing the assumptions! 

1.4.4 Assumptions 

The assumptions <strong>for</strong> a regression analysis are very similar to those found in ANOVA. 

Linearity 

Regression analysis assume that the relationship between Y and X is linear. Make a scatter-plot between 

Y and X to assess this assumption. Perhaps a trans<strong>for</strong>mation is required (e.g. log(Y ) vs. log(X)). Some 

caution is required in trans<strong>for</strong>mation in dealing with the error structure as you will see in later examples. 

Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g. 

quadratic curve), this usually indicates that the relationship between Y and X is not linear. Or, fit a model 

that includes X and X 2 and test if the coefficient associated with X 2 is zero. Un<strong>for</strong>tunately, this test could 

fail to detect a higher order relationship. Third, if there are multiple readings at some X-values, then a test 

of goodness-of-fit can be per<strong>for</strong>med where the variation of the responses at the same X value is compared 

to the variation around the regression line. 

Correct scale of predictor and response 

The response and predictor variables must both have interval or ratio scale. In particular, using a numerical 

value to represent a category and then using this numerical value in a regression is not valid. For example, 

suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using these values in a 

regression either as predictor variable or as a response variable is not sensible. 

Correct sampling scheme 

The Y must be a random sample from the population of Y values <strong>for</strong> every X value in the sample. Fortunately, 

it is not necessary to have a completely random sample from the population as the regression line is 

valid even if the X values are deliberately chosen. However, <strong>for</strong> a given X, the values from the population 

must be a simple random sample. 



No outliers or influential points 

All the points must belong to the relationship – there should be no unusual points. The scatter-plot of Y vs. 

X should be examined. If in doubt, fit the model with the points in and out of the fit and see if this makes a 

difference in the fit. 

Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the single 

point is an outlier and and influential point: 

Equal variation along the line 

The variability about the regression line is similar <strong>for</strong> all values of X, i.e. the scatter of the points above and 

below the fitted line should be roughly constant over the entire line. This is assessed by looking at the plots 

of the residuals against X to see if the scatter is roughly uni<strong>for</strong>mly scattered around zero with no increase 

and no decrease in spread over the entire line. 

Independence 

Each value of Y is independent of any other value of Y . The most common cases where this fail are time 

series data where X is a time measurement. In these cases, time series analysis should be used. 

This assumption can be assessed by again looking at residual plots against time or other variables. 



Normality of errors 

The difference between the value of Y and the expected value of Y is assumed to be normally distributed. 

This is one of the most misunderstood assumptions. Many people erroneously assume that the distribution of 

Y over all X values must be normally distributed, i.e they look simply at the distribution of the Y ’s ignoring 

the Xs. The assumption only states that the residuals, the difference between the value of Y and the point 

on the line must be normally distributed. 

This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, <strong>for</strong> small 

sample sizes, you have little power of detecting non-normality and <strong>for</strong> large sample sizes it is not that 

important. 

X measured without error 

This is a new assumption <strong>for</strong> regression as compared to ANOVA. In ANOVA, the group membership was 

always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However, 

in regression, it can turn out that that the X value may not be known exactly. 

This general problem is called the “error in variables” problem and has a long history in statistics. 

It turns out that there are two important cases. If the value reported <strong>for</strong> X is a nominal value and the 

actual value of X varies randomly around this nominal value, then there is no bias in the estimates. This is 

called the Berkson case after Berkson who first examined this situation. The most common cases are where 

the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X that occurs 

would vary randomly around this target value. 

However, if the value used <strong>for</strong> X is an actual measurement of the true underlying X then there is 

uncertainty in both the X and Y direction. In this case, estimates of the slope are attenuated towards zero 

(i.e. positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimate 

are no longer consistent, i.e. as the sample size increases, the estimates no longer tend to the true population 

values! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may not 

be located exactly at the plot where the crop is grown, but may be recorded a nearby weather station a fair 

distance away. The reading at the weather station is NOT a true reflection of the rainfall at the test plot. 

This latter case of “error in variables” is very difficult to analyze properly and there are not universally 

accepted solutions. Refer to the reference books listed at the start of this chapter <strong>for</strong> more details. 

The problem is set up as follows. Let 

Y i =η i + ɛ i 

X i =ξ i + δ i 



with the straight-line relationship between the true (but unobserved) values: 

η i =β 0 + β 1 ξ i 

Note the (true, but unknown) regression equation uses ξ i rather than the observed (with error) values X i . 

Now if the regression is done on the observed X (i.e. the error prone measurement), the regression 

equation reduces to: 

Y i = β 0 + β 1 X i + (ɛ i − β 1 δ i ) 

Now this violates the independence assumption of ordinary least squares because the new “error” term is not 

independent of the X i variable. 

If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p. 90) 

with 

E[̂β 1 ] = β 1 − β 1r(ρ + r) 

1 + 2ρr + r 2 

where ρ is the correlation between ξ and δ; and r is the ratio of the variance of the error in X to the error in 

Y . 

The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ + r > 0). This is 

known as attenuation of the estimate, and in general, pulls the estimate towards zero. 

The bias will be small in the following cases: 

• the error variance of X is small relative to the error variance in Y . This means that r is small (i.e. 

close to zero), and so the bias is also small. In the case where X is measured without error, then r = 0 

and the bias vanishes as expected. 

• if the X are fixed (the Berkson case) and actually used 2 , then ρ + r = 0 and the bias also vanishes. 

The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p. 91) 

<strong>for</strong> more details. 

1.4.5 Obtaining Estimates 

To distinguish between population parameters and sample estimates, we denote the sample intercept by b 0 

and the sample slope as b 1 . The equation of a particular sample of points is expressed Ŷi = b 0 + b 1 X i where 

b 0 is the estimated intercept, and b 1 is the estimated slope. The symbol Ŷ indicates that we are referring to 

the estimated line and not to a line in the entire population. 

2 For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based on the thermostat 

readings rather than the (true) unknown temperature, this corresponds to the Berkson case. 



How is the best fitting line found when the points are scattered? We typically use the principle of least 

squares. The least-squares line is the line that makes the sum of the squares of the deviations of the data 

points from the line in the vertical direction as small as possible. 

Mathematically, the least squares line is the line that minimizes 1 n 

∑ ( Y i − Ŷi) 2 

where Ŷ i is the point 

on the line corresponding to each X value. This is also known as the predicted value of Y <strong>for</strong> a given value 

of X. This <strong>for</strong>mal definition of least squares is not that important - the concept as expressed in the previous 

paragraph is more important – in particular it is the SQUARED deviation in the VERTICAL direction that 

is used.. 

It is possible to write out a <strong>for</strong>mula <strong>for</strong> the estimated intercept and slope, but who cares - let the computer 

do the dirty work. 

The estimated intercept (b 0 ) is the estimated value of Y when X = 0. In some cases, it is meaningless 

to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a plot of income vs. 

year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretation of 

the intercept, and it merely serves as a placeholder <strong>for</strong> the line. 

The estimated slope (b 1 ) is the estimated change in Y per unit change in X. For every unit change in the 

horizontal direction, the fitted line increased by b 1 units. If b 1 is negative, the fitted line points downwards, 

and the increase in the line is negative, i.e., actually a decrease. 

As with all estimates, a measure of precision can be obtained. As be<strong>for</strong>e, this is the standard error of 

each of the estimates. Again, there are computational <strong>for</strong>mulae, but in this age of computers, these are not 

important. As be<strong>for</strong>e, approximate 95% confidence intervals <strong>for</strong> the corresponding population parameters 

are found as estimate ± 2 × se. 

Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameter as 

this is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is no relationship 

between Y and X (can you draw a scatter-plot showing such a relationship?) More <strong>for</strong>mally the null 

hypothesis is: 

H : β 1 = 0 

Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms of a 

sample statistic. 

The alternate hypothesis is typically chosen as: 

A : β 1 ≠ 0 

although one-sided tests looking <strong>for</strong> either a positive or negative slope are possible. 

The test-statistics is found as 

T = b 1 − 0 

se(b 1 ) 

and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. This is usually 

automatically done by most computer packages. The p-value is interpreted in exactly the same way as in 



ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relationship were true. 

As be<strong>for</strong>e, the p-value does not tell the whole story, i.e. statistical vs. biological (non)significance must 

be determined and assessed. 

1.4.6 Obtaining Predictions 

Once the best fitting line is found it can be used to make predictions <strong>for</strong> new values of X. 

There are two types of predictions that are commonly made. It is important to distinguish between them 

as these two intervals are the source of much confusion in regression problems. 

First, the experimenter may be interested in predicting a SINGLE future individual value <strong>for</strong> a particular 

X. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses at a 

particular X. 3 The prediction interval <strong>for</strong> an individual response is sometimes called a confidence interval <strong>for</strong> 

an individual response but this is an un<strong>for</strong>tunate (and incorrect) use of the term confidence interval. Strictly 

speaking confidence intervals are computed <strong>for</strong> fixed unknown parameter values; predication intervals are 

computed <strong>for</strong> future random variables. 

Both of the above intervals should be distinguished from the confidence interval <strong>for</strong> the slope. 

In both cases, the estimate is found in the same manner – substitute the new value of X into the equation 

and compute the predicted value Ŷ . In most computer packages this is accomplished by inserting a new 

“dummy” observation in the dataset with the value of Y missing, but the value of X present. The missing 

Y value prevents this new observation from being used in the fitting process, but the X value allows the 

package to compute an estimate <strong>for</strong> this observation. 

What differs between the two predictions are the estimates of uncertainty. 

In the first case, there are two sources of uncertainty involved in the prediction. First, there is the uncertainty 

caused by the fact that this estimated line is based upon a sample. Then there is the additional 

uncertainty that the value could be above or below the predicted line. This interval is often called a prediction 

interval at a new X. 

In the second case, only the uncertainty caused by estimating the line based on a sample is relevant. This 

interval is often called a confidence interval <strong>for</strong> the mean at a new X. 

The prediction interval <strong>for</strong> an individual response is typically MUCH wider than the confidence interval 

<strong>for</strong> the mean of all future responses because it must account <strong>for</strong> the uncertainty from the fitted line plus 

individual variation around the fitted line. 

Many textbooks have the <strong>for</strong>mulae <strong>for</strong> the se <strong>for</strong> the two types of predictions, but again, there is little to 

be gained by examining them. What is important is that you read the documentation carefully to ensure that 

3 There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice. 



you understand exactly what interval is being given to you. 

1.4.7 Residual Plots 

After the curve is fit, it is important to examine if the fitted curve is reasonable. This is done using residuals. 

The residual <strong>for</strong> a point is the difference between the observed value and the predicted value, i.e., the residual 

from fitting a straight line is found as: residual i = Y i − (b 0 + b 1 X i ) = (Y i − Ŷi). 

There are several standard residual plots: 

• plot of residuals vs. predicted (Ŷ ); 

• plot of residuals vs. X; 

• plot of residuals vs. time ordering. 

In all cases, the residual plots should show random scatter around zero with no obvious pattern. Don’t 

plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and don’t mean 

anything. 

1.4.8 Example - Yield and fertilizer 

We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) <strong>for</strong> tomato plants. An 

experiment was conducted in the Schwarz household one summer on 11 plots of land where the amount of 

fertilizer was varied and the yield measured at the end of the season. 

The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. While 

the levels were not systematically chosen (e.g. they were not evenly spaced between the highest and lowest 

values), they represent commonly used amounts based on a preliminary survey of producers. At the end of 

the experiment, the yields were measured and the following data were obtained. 

Interest also lies in predicting the yield when 16 kg/ha are assigned. 



Fertilizer 

(kg/ha) 

Yield 

(Liters) 

12 24 

5 18 

15 31 

17 33 

14 30 

6 20 

11 25 

13 27 

15 31 

8 21 

18 29 

The raw data is also available in a JMP datasheet called fertilizer.jmp available from the Sample Program 

Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the response variable 

(Y ) is the yield. 

The population consists of all possible field plots with all possible tomato plants of this type grown under 

all possible fertilizer levels between about 5 and 18 kg/ha. 

If all of the population could be measured (which it can’t) you could find a relationship between the yield 

and the amount of fertilizer applied. This relationship would have the <strong>for</strong>m: Y = β 0 +β 1 ×(amount of fertilizer)+ 

ɛ where β 0 and β 1 represent the true population intercept and slope respectively. The term ɛ represents random 

variation that is always present, i.e. even if the same plot was grown twice in a row with the same 

amount of fertilizer, the yield would not be identical (why?). 

The population parameters to be estimated are β 0 - the true average yield when the amount of fertilizer 

is 0, and β 1 , the true average change in yield per unit change in the amount of fertilizer. These are taken 

over all plants in all possible field plots of this type. The values of β 0 and β 1 are impossible to obtain as the 

entire population could never be measured. 

Here is the data entered into a JMP data sheet. Note the scale of both variables (continuous) and that an 

extra row was added to the data table with the value of 16 <strong>for</strong> the fertilizer and the yield left missing. 



The ordering of the rows in the data table is NOT important; however, it is often easier to find individual 

data points if the data is sorted by the X value and the rows <strong>for</strong> future predictions are placed at the end of 

the dataset. Notice how missing values are represented. 

Use the Analyze->Fit Y-by-X plat<strong>for</strong>m to start the analysis. Specify the Y and X variable as needed. 



Notice that JMP “reminds” you of the analysis that you will obtain based on the scale of the X and Y 

variables as shown in the bottom left of the menu. In this case, both X and Y have a continuous scale, so 

JMP will per<strong>for</strong>m a bi-variate fitting procedure. It starts by showing the scatter-plot between Yield (Y ) and 

fertilizer (X). 



The relationship look approximately linear; there don’t appear to be any outlier or influential points; the 

scatter appears to be roughly equal across the entire regression line. Residual plots will be used later to 

check these assumptions in more detail. 

The drop-down menu item (from the red triangle beside the Bivariate Fit...) allows you to fit the leastsquares 

line. This produces much output, but the three important part of the output are discussed below. 

First, the actual fitted line is drawn on the scatter plot, and the equation of the fitted line is printed below 

the plot. 



The estimated regression line is 

Ŷ = b 0 + b 1 (fertilizer) = 12.856 + 1.10137(amount of fertilizer) 

In terms of estimates, b 0 =12.856 is the estimated intercept, and b 1 =1.101 is the estimated slope. 

The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit. 

In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount is increased by 

1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not the value of Y when 

X = 1. 

The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the estimated 

yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningful 

interpretation, but I’d be worried about extrapolating outside the range of the observed X values. If the 

intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to 13? 

Once again, these are the results from a single experiment. If another experiment was repeated, you 

would obtain different estimates (b 0 and b 1 would change). The sampling distribution over all possible 

experiments would describe the variation in b 0 and b 1 over all possible experiments. The standard deviation 

of b 0 and b 1 over all possible experiments is again referred to as the standard error of b 0 and b 1 . 



The <strong>for</strong>mulae <strong>for</strong> the standard errors of b 0 and b 1 are messy, and hopeless to compute by hand. And 

just like inference <strong>for</strong> a mean or a proportion, the program automatically computes the se of the regression 

estimates. 

The estimated standard error <strong>for</strong> b 1 (the estimated slope) is 0.132 L/kg. This is an estimate of the standard 

deviation of b 1 over all possible experiments. Normally, the intercept is of limited interest, but a standard 

error can also be found <strong>for</strong> it as shown in the above table. 

Using exactly the same logic as when we found a confidence interval <strong>for</strong> the population mean, or <strong>for</strong> 

the population proportion, a confidence interval <strong>for</strong> the population slope (β 1 ) is found (approximately) as 

b 1 ± 2(estimated se) In the above example, an approximate confidence interval <strong>for</strong> β 1 is found as 

of fertilizer applied. 

1.101 ± 2 × (.132) = 1.101 ± .264 = (.837 → 1.365)L/kg 

An “exact” confidence interval can be computed by JMP as shown above. 4 The “exact” confidence 

interval is based on the t-distribution and is slightly wider than our approximate confidence interval because 

the total sample size (11 pairs of points) is rather small. We interpret this interval as ‘being 95% confident 

that the true increase in yield when the amount of fertilizer is increased by one unit is somewhere between 

(.837 to 1.365) L/kg.’ 

Be sure to carefully distinguish between β 1 and b 1 . Note that the confidence interval is computed using 

b 1 , but is a confidence interval <strong>for</strong> β 1 - the population parameter that is unknown. 

In linear regression problems, one hypothesis of interest is if the true slope is zero. This would correspond 

to no linear relationship between the response and predictor variable (why?) Again, this is a good 

time to read the papers by Cherry and Johnson about the dangers of uncritical use of hypothesis testing. In 

many cases, a confidence interval tells the entire story. 

JMP produces a test of the hypothesis that each of the parameters (the slope and the intercept in the 

population) is zero. The output is reproduced again below: 

4 If your table doesn’t show the confidence interval, use a Control-Click or Right-Click in the table and select the columns to be 

displayed. 



The test of hypothesis about the intercept is not of interest (why?). 

Let 

• β 1 be the true (unknown) slope. 

• b 1 be the estimated slope. In this case b 1 = 1.1014. 

The hypothesis testing proceeds as follows. Again note that we are interested in the population parameters 

and not the sample statistics. 

1. Specify the null and alternate hypothesis: 

H: β 1 = 0 

A: β 1 ≠ 0. 

Notice that the null hypothesis is in terms of the population parameter β 1 . This is a two-sided test as 

we are interested in detecting differences from zero in either direction. 

2. Find the test statistic and the p-value. The test statistic is computed as: 

T = 

estimate − hypothesized value 

estimated se 

= 1.1014 − 0 

.132 

= 8.36 

In other words, the estimate is over 8 standard errors away from the hypothesized value! 

This will be compared to a t-distribution with n − 2 = 9 degrees of freedom. The p-value is found to 

very small (less than 0.0001). 

3. Conclusion. There is strong evidence that the true slope is not zero. This is not too surprising given 

that the 95% confidence intervals show that plausible values <strong>for</strong> the true slope are from about .8 to 

about 1.4. 

It is possible to construct tests of the slope equal to some value other than 0. Most packages can’t do 

this. You would compute the T value as shown above, replacing the value 0 with the hypothesized value. 

It is also possible to construct one-sided tests. Most computer packages only do two-sided tests. Proceed 

as above, but the one-sided p-value is the two-sided p-value reported by the packages divided by 2. 

If sufficient evidence is found against the hypothesis, a natural question to ask is ‘well, what values of the 

parameter are plausible given this data’. This is exactly what a confidence interval tells you. Consequently, 

I usually prefer to find confidence intervals, rather than doing <strong>for</strong>mal hypothesis testing. 

What about making predictions <strong>for</strong> future yields when certain amounts of fertilizer are applied? For 

example, what would be the future yield when 16 kg/ha of fertilizer are applied? 

The predicted value is found by substituting the new X into the estimated regression line. 

Ŷ = b 0 + b 1 (fertilizer) = 12.856 + 1.10137(16) = 30.48L 



This can also be found by using the cross hairs tool on the actual graph (to be demonstrated in class).JMP 

can compute the predicted value by selecting the appropriate option under the drop down menu item in the 

Linear Fit item: 

and then going back to look at the new column in the data table: 



As noted earlier, there are two types of estimates of precision associated with predictions using the 

regression line. It is important to distinguish between them as these two intervals are the source of much 

confusion in regression problems. 

First, the experimenter may be interested in predicting a single FUTURE individual value <strong>for</strong> a particular 

X. This would correspond to the predicted yield <strong>for</strong> a single future plot with 16 kg/ha of fertilizer added. 

Second the experimenter may be interested in predicting the average of ALL FUTURE responses at a 

particular X. This would correspond to the average yield <strong>for</strong> all future plots when 16 kg/ha of fertilizer is 

added. The prediction interval <strong>for</strong> an individual response is sometimes called a confidence interval <strong>for</strong> an 

individual response but this is an un<strong>for</strong>tunate (and incorrect) use of the term confidence interval. Strictly 

speaking confidence intervals are computed <strong>for</strong> fixed unknown parameter values; predication intervals are 

computed <strong>for</strong> future random variables. 

Both intervals can be computed and plotted byJMP by again using the pop-down menu beside the Linear 

Fit box: 



In this menu, the Confid Curves Fit correspond to confidence intervals <strong>for</strong> the MEAN response, while 

the Confid Curves Indiv correspond to prediction intervals <strong>for</strong> the future single response. Both can be plotted 

on the graph. Un<strong>for</strong>tunately, there does not appear to be a way to save the prediction limits into a data table 

from this plat<strong>for</strong>m - the cross hairs tool must be used, or the Analyze->Fit Model plat<strong>for</strong>m should be used. 



The innermost set of lines represents the confidence bands <strong>for</strong> the mean response. The outermost band 

of lines represents the prediction intervals <strong>for</strong> a single future response. As noted earlier, the latter must be 

wider than the <strong>for</strong>mer to account <strong>for</strong> an additional source of variation. 

The numerical values from the Analyze->Fit Model plat<strong>for</strong>m are shown below: 

Here the predicted yield <strong>for</strong> a single future trial at 16 kg/ha is 30.5 L, but the 95% prediction interval is 

between 26.1 and 34.9 L. The predicted AVERAGE yield <strong>for</strong> ALL future plots when 16 kg/ha of fertilizer 



is applied is also 30.5 L, but the 95% confidence interval <strong>for</strong> the MEAN yield is between 28.8 and 32.1 L. 

Finally, residual plots can be made using the pop-down menu: 

The residuals are simply the difference between the actual data point and the corresponding spot on the 

line measured in the vertical direction. The residual plot shows no trend in the scatter around the value of 

zero. 

The same items are available from the Analyze->Fit Model plat<strong>for</strong>m. Here you would specify Yield as the 

Y variable, Fertilizer as the X variable in much the same way as in the Analyze->Fit Y-by-X plat<strong>for</strong>m. Much 

of the same output is produced. Additionally, you can save the actual confidence bounds <strong>for</strong> predictions into 

the data table (as shown above). This will be demonstrated in class. 

1.4.9 Example - Mercury pollution 

Mercury pollution is a serious problem in some waterways. Mercury levels often increase after a lake 

is flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessive 

consumption of mercury is well known to be deleterious to human health. It is difficult and time consuming 

to measure every persons mercury level. It would be nice to have a quick procedure that could be used to 

estimate the mercury level of a person based upon the average mercury level found in fish and estimates 

of the person’s consumption of fish in their diet. The following data were collected on the methyl mercury 

intake of subjects and the actual mercury levels recorded in the blood stream from a random sample of 

people around recently flooded lakes. 

Here are the raw data: 



Methyl Mercury 

Intake 

(ug Hg/day) 

Mercury in 

whole blood 

(ng/g) 

180 90 

200 120 

230 125 

410 290 

600 310 

550 290 

275 170 

580 375 

600 150 

105 70 

250 105 

60 205 

650 480 

The data are available in a JMP datasheet called mercury.jmp available from the Sample Program Library 

at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The ordering of the rows in the data table is NOT important; however, it is often easier to find individual 

data points if the data is sorted by the X value and the rows <strong>for</strong> future predictions are placed at the end of 

the dataset. Notice how missing values are represented. 

The population of interest are the people around recently flooded lakes. 

This experiment is an analytical survey as it is quite impossible to randomly assign people different 

amounts of mercury in their food intake. Consequently, the key assumption is that the subjects chosen to be 

measured are random samples from those with similar mercury intakes. Note it is NOT necessary <strong>for</strong> this to 

be a random sample from the ENTIRE population (why?). 

The explanatory variable is the amount of mercury ingested by a person. The response variable is the 

amount of mercury in the blood stream. 

We start by producing the scatter-plot. 



There appears to be two outliers (identified by an X). To illustrate the effects of these outliers upon the 

estimates and the residual plots, the line was fit using all of the data. 



The residual plot shows the clear presence of the two outliers, but also identifies a third potential outlier 

not evident from the original scatter-plot (can you find it?). 

The data were rechecked and it appears that there was an error in the blood work used in determining the 

readings. Consequently, these points were removed <strong>for</strong> the subsequent fit. 



. 

The estimated regression line (after removing outliers) is 

Blood = −1.951691 + 0.581218Intake 

The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/day when 

the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in the context of this 

experiment. The negative value is merely a placeholder <strong>for</strong> the line. Also notice that the estimated intercept 

is not very precise in any case (how do I know this and what implications does this have <strong>for</strong> worrying that it 



is not zero?) 5 

What was the impact of the outliers if they had been retained upon the estimated slope and intercept? 

The estimated slope has been determined relatively well (relative standard error of about 10% – how is 

the relative standard error computed?). There is clear evidence that the hypothesis of no relationship between 

blood mercury levels and food mercury levels is not tenable. 

The two types of predictions would also be of interest in this study. First, an individual would like to 

know the impact upon personal health. Secondly, the average level would be of interest to public health 

authorities. 

JMP was used to plot both intervals on the scatter-plot: 

1.4.10 Example - The Anscombe Data Set 

Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable. 

All four datasets gave exactly the same results when a regression line was fit, yet are quite different in their 

interpretation. 

The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-650/ 

5 It is possible to fit a regression line that is constrained to go through Y = 0 when X = 0. These must be fit carefully and are not 

covered in this <strong>course</strong>. 



Notes/MyPrograms. Fitting of regression lines to this data will be demonstrated in class. 

1.4.11 Trans<strong>for</strong>mations 

In some cases, the plot of Y vs. X is obviously non-linear and a trans<strong>for</strong>mation of X or Y may be used to 

establish linearity. For example, many dose-response curves are linear in log(X). Or the equation may be 

intrinsically non-linear, e.g. a weight-length relationship is of the <strong>for</strong>m weight = β 0 length β1 . Or, some 

variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100 

km or km/L? You are already with some variables measured on the log-scale - pH is a common example. 

Often a visual inspection of a plot may identify the appropriate trans<strong>for</strong>mation. 

There is no theoretical difficulty in fitting a linear regression using trans<strong>for</strong>med variables other than an 

understanding of the implicit assumption of the error structure. The model <strong>for</strong> a fit on trans<strong>for</strong>med data is of 

the <strong>for</strong>m 

trans(Y ) = β 0 + β 1 × trans(X) + error 

Note that the error is assumed to act additively on the trans<strong>for</strong>med scale. All of the assumptions of linear 

regression are assumed to act on the trans<strong>for</strong>med scale – in particular that the population standard deviation 

around the regression line is constant on the trans<strong>for</strong>med scale. 

The most common trans<strong>for</strong>mation is the logarithmic trans<strong>for</strong>m. It doesn’t matter if the natural logarithm 

(often called the ln function) or the common logarithm trans<strong>for</strong>mation (often called the log 10 trans<strong>for</strong>mation) 

is used. There is a 1-1 relationship between the two trans<strong>for</strong>mations, and linearity on one trans<strong>for</strong>m is 

preserved on the other trans<strong>for</strong>m. The only change is that values on the ln scale are 2.302 = ln(10) times 

that on the log 10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302. 

There is some confusion in scientific papers about the meaning of log - some papers use this to refer to the 

ln trans<strong>for</strong>mation, while others use this to refer to the log 10 trans<strong>for</strong>mation. 

After the regression model is fit, remember to interpret the estimates of slope and intercept on the trans<strong>for</strong>med 

scale. For example, suppose that a ln(Y ) trans<strong>for</strong>mation is used. Then we have 

and 

. 

ln(Y t+1 ) = b 0 + b 1 × (t + 1) 

ln(Y t ) = b 0 + b 1 × t 

ln(Y t+1 ) − ln(Y t ) = ln( Y t+1 

Y t 

) = b 1 × (t + 1 − t) = b 1 

exp(ln( Y t+1 

)) = Y t+1 

= exp(b 1 ) = e b1 

Y t Y t 

Hence a one unit increase in X causes Y to be MULTIPLED by e b1 . As an example, suppose that on 

the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a 



multiplicative factor or e −.07 = .93, i.e. roughly a 7% decline per year. 6 

Similarly, predictions on the trans<strong>for</strong>med scale, must be back-trans<strong>for</strong>med to the untrans<strong>for</strong>med scale. 

In some problems, scientists search <strong>for</strong> the ‘best’ trans<strong>for</strong>m. This is not an easy task and using simple 

statistics such as R 2 to search <strong>for</strong> the best trans<strong>for</strong>mation should be avoided. Seek help if you need to find 

the best trans<strong>for</strong>mation <strong>for</strong> a particular dataset. 

JMP makes it particularly easy to fit regressions to trans<strong>for</strong>med data as shown below. SAS and R have 

an extensive array of functions so that you can create new variables based the trans<strong>for</strong>mation of an existing 

variable. 

1.4.12 Example: Monitoring Dioxins - trans<strong>for</strong>mation 

An un<strong>for</strong>tunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material. This 

material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulated in living 

organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in the organisms 

takes a long time to degrade. 

Government environmental protection agencies take samples of crabs from affected areas each year and 

measure the amount of dioxins in the tissue. The following example is based on a real study. 

Each year, four crabs are captured from a monitoring station. The liver is excised and the livers from 

all four crabs are composited together into a single sample. 7 The dioxins levels in this composite sample is 

measured. As there are many different <strong>for</strong>ms of dioxins with different toxicities, a summary measure, called 

the Total Equivalent Dose (TEQ) is computed from the sample. 

Here is the raw data. 

6 It can be shown that on the log scale, that <strong>for</strong> smallish values of the slope that the change is almost the same on the untrans<strong>for</strong>med 

scale, i.e. if the slope is −.07 on the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughly a 7% increase 

per year. 

7 Compositing is a common analytical tool. There is little loss of useful in<strong>for</strong>mation induced by the compositing process - the only 

loss of in<strong>for</strong>mation is the among individual-sample variability which can be used to determine the optimal allocation between samples 

within years and the number of years to monitor. 



Site Year TEQ 

a 1990 179.05 

a 1991 82.39 

a 1992 130.18 

a 1993 97.06 

a 1994 49.34 

a 1995 57.05 

a 1996 57.41 

a 1997 29.94 

a 1998 48.48 

a 1999 49.67 

a 2000 34.25 

a 2001 59.28 

a 2002 34.92 

a 2003 28.16 

The data is available in a JMP data file dioxinTEQ.jmp in the http://www.stat.sfu.ca/~cschwarz/ 

Stat-650/Notes/MyPrograms. 

As with all analyses, start with a preliminary plot of the data. Use the Analyze->Fit Y-by-X plat<strong>for</strong>m. 



The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Why is 

this so? In many cases, a fixed fraction of dioxins degrades per year, e.g. a 10% decline per year. This can 

be expressed in a non-linear relationship: 

T EQ = Cr t 

where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this is 

plotted over time, this leads to the non-linear pattern seen above. 

If logarithms are taken, this leads to the relationship: 

which can be expressed as: 

log(T EQ) = log(C) + t × log(r) 

log(T EQ) = β 0 + β 1 × t 

which is the equation of a straight line with β 0 = log(C) and β 1 = log(r). 

JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashion. 

A plot of log(T EQ) vs. year gives the following: 



The relationship look approximately linear; there don’t appear to be any outlier or influential points; the 

scatter appears to be roughly equal across the entire regression line. Residual plots will be used later to 

check these assumptions in more detail. 

A line can be fit as be<strong>for</strong>e by selecting the Fit Line option from the red triangle in the upper left side of 

the plot: 



This gives the following output: 



. 

The fitted line is: 

log(T EQ) = 218.9 − .11(year) 



The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly nonsensical. The slope 

(−.11) is the estimated log(ratio) from one year to the next. For example, exp(−.11) = .898 would mean 

that the TEQ in one year is only 89.8% of the TEQ in the previous year or roughly an 11% decline per year. 

The standard error of the estimated slope is .02. 

A 95% confidence interval <strong>for</strong> the slope can be obtained by pressing a Right-Click (<strong>for</strong> Windoze machines) 

or a Ctrl-Click (<strong>for</strong> Macintosh machines) in the Parameter Estimates summary table and selecting 

the confidence intervals to display in the table. 

The 95% confidence interval <strong>for</strong> the slope is (−.154 to −.061). If you take the anti-logs of the endpoints, 

this gives a 95% confidence interval <strong>for</strong> the fraction of TEQ that remains from year to year, i.e. between 

(0.86 to 0.94) of the TEQ in one year, remains to the next year. 

Several types of predictions can be made. For example, what would be the estimated mean TEQ in 2010? 

The computations could be done by hand, or by using the cross-hairs on the plot from the Analyze- 

>Fit Y-by-X plat<strong>for</strong>m. Confidence intervals <strong>for</strong> the mean response, or prediction intervals <strong>for</strong> an individual 

response can be added to the plot from the pop-down menu. 

However, a more powerful tool is available from the Analyze->Fit Model plat<strong>for</strong>m. 

Start first, by adding rows to the original data table corresponding to the years <strong>for</strong> which a prediction is 

required. In this case, the additional row would have the value of 2010 in the Year column with the remainder 

of the row unspecified. Missing values will be automatically inserted <strong>for</strong> the other variables. 



Then invoke the Analyze->Fit Model plat<strong>for</strong>m: 



This gives much the same output as from the Analyze->Fit Y-by-X plat<strong>for</strong>m with a few new (useful) 

features, a few of which we will explore in the remainder of this section. 

Next, save the prediction <strong>for</strong>mula, and the confidence interval <strong>for</strong> the mean, and <strong>for</strong> an individual prediction 

to the data table (this will take three successive saves): 



Now the data table has been augmented with additional columns and more importantly predictions <strong>for</strong> 

2010 are now available: 



The estimated mean log(T EQ) is 2.60 (corresponding to an estimated MEDIAN TEQ of exp(2.60) = 

13.46). A 95% confidence interval <strong>for</strong> the mean log(T EQ) is (1.94 to 3.26) corresponding to a 95% confidence 

interval <strong>for</strong> the actual MEDIAN TEQ of between (6.96 and 26.05). 8 Note that the confidence interval 

after taking anti-logs is no longer symmetrical. 

Why does a mean of a logarithm trans<strong>for</strong>m back to the median on the untrans<strong>for</strong>med scale? Basically, 

because the trans<strong>for</strong>mation is non-linear, properties such mean and standard errors cannot be simply 

anti-trans<strong>for</strong>med without introducing some bias. However, measures of location, (such as a median) are 

unaffected. On the trans<strong>for</strong>med scale, it is assumed that the sampling distribution about the estimate is symmetrical 

which makes the mean and median take the same value. So what really is happening is that the 

median on the trans<strong>for</strong>med scale is back-trans<strong>for</strong>med to the median on the untrans<strong>for</strong>med scale. 

Similarly, a 95% prediction interval <strong>for</strong> the log(T EQ) <strong>for</strong> an INDIVIDUAL composite sample can be 

found. Be sure to understand the difference between the two intervals. 

Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to some 

particular value. For example, health regulations may require that the TEQ of the composite sample be 

below 10 units. 

The Analyze->Fit Model plat<strong>for</strong>m has an inverse prediction function: 

8 A minor correction can be applied to estimate the mean if required. 



Specify the required value <strong>for</strong> Y – in this case log(10) = 2.302, 

and then press the RUN button to get the following output: 



The predicted year is found by solving 

2.302 = 218.9 − .11(year) 

and gives and estimated year of 2012.7. A confidence interval <strong>for</strong> the time when the mean log(T EQ) is 

equal to log(10) is somewhere between 2007 and 2026! 

The residual plot looks fine with no apparent problems but the dip in the middle years could require 

further exploration if this pattern was apparent in other site as well: 

The application of regression to non-linear problems is fairly straight<strong>for</strong>ward after the trans<strong>for</strong>mation is 

made. The most error-prone step of the process is the interpretation of the estimates on the TRANSFORMED 

scale and how these relate to the untrans<strong>for</strong>med scale. 

1.4.13 Example: Weight-length relationships - trans<strong>for</strong>mation 

A common technique in fisheries management is to investigate the relationship between weight and lengths 

of fish. 

This is expected to a non-linear relationship because as fish get longer, they also get wider and thicker. 

If a fish grew “equally” in all directions, then the weight of a fish should be proportional to the length 3 

(why?). However, fish do not grow equally in all directions, i.e. a doubling of length is not necessarily 

associated with a doubling of width or thickness. The pattern of association of weight with length may 

reveal in<strong>for</strong>mation on how fish grow. 

The traditional model between weight and length is often postulated to be of the <strong>for</strong>m: 

weight = a × length b 

where a and b are unknown constants to be estimated from data. 

If the estimated value of b is much less than 3, this indicates that as fish get longer, they do not get wider 

and thicker at the same rates. 

or 

How are such models fit? If logarithms are taken on each side, the above equation is trans<strong>for</strong>med to: 

log(weight) = log(a) + b × log(length) 

log(weight) = β 0 + β 1 × log(length) 

where the usual linear relationship on the log-scale is now apparent. 

The following example was provided by Randy Zemlak of the British <strong>Columbia</strong> Ministry of Water, Land, 

and Air Protection. 



Length (mm) 

Weight (g) 

34 585 

46 1941 

33 462 

36 511 

32 428 

33 396 

34 527 

34 485 

33 453 

44 1426 

35 488 

34 511 

32 403 

31 379 

30 319 

33 483 

36 600 

35 532 

29 326 

34 507 

32 414 

33 432 

33 462 

35 566 

34 454 

35 600 

29 336 

31 451 

33 474 

32 480 

35 474 

30 330 

30 376 

34 523 

31 353 

32 412 

c○2012 Carl James Schwarz 63 November 23, 2012 

32 407


A sample of fish was measured at a lake in British <strong>Columbia</strong>. The data is as follows and is available 

in a JMP datasheet called wtlen.jmp at the Sample Program Library at http://www.stat.sfu.ca/ 

~cschwarz/Stat-650/Notes/MyPrograms. 

The following is an initial plot with a spline fit (lambda=10) to the data. 

The fit appears to be non-linear but this may simply be an artifact of the influence of the two largest fish. 

The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, the variance 

appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30 mm. 

There are several (equivalent) ways to fit the growth model to such data in JMP: 

• Use Analyze->Fit Y-by-X directly with the Fit Special feature. 

• Create two new variables log(weight) and log(length) and then use Analyze->Fit Y-by-X on these 

derived variables. 

• Use Analyze->Fit Model on these derived variables. 

We will fit a model on the log-log scale: Note that there is some confusion in scientific papers about a 

“log” trans<strong>for</strong>m. In general, a log-trans<strong>for</strong>mation refers to taking natural-logarithms (base e), and NOT the 

base-10 logarithm. This mathematical convention is often broken in scientific papers where authors try to 



use ln to represent natural logarithms, etc. It does not affect the analysis in anyway which trans<strong>for</strong>mation 

is used other than that values on the natural log scale are approximately 2.3 times larger than values on the 

log 10 scale. Of <strong>course</strong>, the appropriate back trans<strong>for</strong>mation is required. 

Using the Fit Special 

. 

The Fit Special is available from the drop-down menu item: 

It presents a dialogue box where a trans<strong>for</strong>mation on both the Y and X axes may be specified: 



the following output is obtained: 





The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. At 

smaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the two 

definite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm and 

negative residuals at 35 mm. 

The fit was repeated dropping the two largest fish with the following output: 





Now the fit appears to be much better. The relationship (on the log-scale) is linear, the residual plot looks 

OK. 

The estimated power coefficient is 2.76 (SE .21). We find the 95% confidence interval <strong>for</strong> the slope (the 

power coefficient): 

The 95% confidence interval <strong>for</strong> the power coefficient is from (2.33 to 3.2) which includes the value of 

3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width and twice 

the thickness. Of <strong>course</strong>, with this small sample size, it is too difficult to say too much. 

The actual model in the population is: 

log(weight) = β 0 + β 1 log(length) + ɛ 

This implies that the “errors” in growth act on the LOG-scale. This seems reasonable. 

For example, a regression on the original scale would make the assumption that a 20 g error in predicting 

weight is equally severe <strong>for</strong> a fish that (on average) weighs 200 or 400 grams even though the "error" is 

20/200=10% of the predicted value in the first case, while only 5% of the predicted value in the second case. 

On the log-scale, it is implicitly assumed that the “errors” operate on the log-scale, i.e. a 10% error in a 200 

g fish is equally severe as a 10% error in a 400 g fish even though the absolute errors of 20g and 40g are 

quite different. 

Another assumption of regression analysis is that the population error variance is assumed to be constant 

over the entire regression line, but the original plot shows that the standard deviation is increasing with 

length. On the log-scale, the standard deviation is roughly constant over the entire regression line. 

Using derived variables 

The same analysis was repeated using the derived variables log(weight) and log(length) and again using 

the Analyze->Fit Y-by-X plat<strong>for</strong>m, but this time without the Fit Special. [The Fit Special is not needed 

because the derived variables have already been trans<strong>for</strong>med.] 

The following are the outputs using the derived variables, again with and with out the two largest fish. 





Because derived variables are used, the fitting plot uses the derived variables and is on the log-scale. 

This has the advantage that the the fit at the lower lengths is easier to see, but the lack of fit <strong>for</strong> the two 

largest fish is not as clear. However, it is now easier to see on the residual plot the apparent lack of fit with 

the downward sloping part of the residual plot in the 3.4 to 3.6 log(length). 

The two largest fish were removed and the fit repeated using the derived variables: 





The results are identical to the previous section. 

A non-linear fit 

It is also possible to do a direct non-linear least-squares fit. Here the objective is to find values of β 0 and β 1 

to minimize: ∑ 

(weight − β0 × length β1 ) 2 

directly. 

This can also be done in JMP using the Fit NonLinear plat<strong>for</strong>m and won’t be explored in much detail 

here. 

First here are the results from using all of the fish: 



Note that the fit apparently is better than the fit on the log-scale as the fitted curve goes through the 

middle of the points from the two largest fish. Note that there still appear to be problems with the fit at the 

lower lengths. 

The same fit, dropping the two largest fish, gives the following output: 



The estimated power coefficient from the non-linear fit is 2.73 with a standard error of .24. The estimated 

intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to the previous fit. 

Which is a better method to fit this data? The non-linear fit assumes that error are additive on the original 

scale. The consequences of this were discussed earlier, i.e. a 20 g error is equally serious <strong>for</strong> a 200 g fish as 

<strong>for</strong> a 400 g fish. 

For this problem, both the non-linear fit and the fit on the log-scale gave the same results, but this will 

not always be true. In particular, look at the large difference in estimates when the models were fit to the 

all of the fish. The non-linear fit was more influenced by the two large fish - this is a consequence of the 

minimizing the square of the absolute deviation (as opposed to the relative deviation) between the observed 

weight and predicted weight. 



1.4.14 Power/Sample Size 

A power analysis and sample size determination can also be done <strong>for</strong> regression problems, but is (un<strong>for</strong>tunately) 

rarely done in regression. This is <strong>for</strong> a number of reasons: 

• The power depends not only on the total number of points collected, but also on the actual distribution 

of the X values. For example, a regression analysis is most powerful to detect a trend if half the 

observations are collected at a small X value and half of the observations are collected at a large X 

value. However, this type of data gives no in<strong>for</strong>mation on the linearity (or lack there-of) between the 

two X values and is not recommended in practice. A less powerful design would have a range of X 

values collected, but this is often more of interest as lack-of-fit and non-linearity can be collected. 

• Data collected <strong>for</strong> regression analysis is often opportunistic with little chance of choosing the X 

values. Unless you have some prior in<strong>for</strong>mation on the distribution of the X values, it is difficult to 

determine the power. 

• The <strong>for</strong>mula are clumsy to compute by hand, and most power packages tend not to have modules <strong>for</strong> 

power analysis of regression. 

For a power analysis, the in<strong>for</strong>mation required is similar to that requested <strong>for</strong> ANOVA designs: 

• α level. As in power analyses <strong>for</strong> ANOVA, this is traditionally set to α = 0.05. 

• effect size. In ANOVA, power deals with detection of differences among means. In regression analysis, 

power deals with detection of slopes that are different from zero. Hence, the effect size is measured 

by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X. 

• sample size. Recall in ANOVA with more than two groups, that the power depended not only only 

the sample size per group, but also how the means are separated. In regression analysis, the power 

will depend upon the number of observations taken at each value of X and the spread of the X values. 

For example, the greatest power is obtained when half the sample is taken at the two extremes of the 

X space - but at a cost of not being able to detect non-linearity. 

• standard deviation. As in ANOVA, the power will depend upon the variation of the individual objects 

around the regression line. 

This problem of power and sample size <strong>for</strong> regression is beyond what we can cover in this chapter. JMP 

or R currently does not include a power computation module <strong>for</strong> regression analysis. However SAS (Version 

9+) includes a power analysis module (GLMPOWER) <strong>for</strong> the power analysis Please consult suitable help 

<strong>for</strong> details. 

However, the problem simplifies considerably when the the X variable is time, and interest lies in detecting 

a trend (increasing or decreasing) over time. A linear regression of the quantity of interest against 

time is commonly used to evaluate such a trend. For many monitoring designs, observations are taken on a 

yearly basis, so the question reduces to the number of years of monitoring required. 

The analysis of trend data and power/sample size computations is treated in a following chapter. 



1.4.15 The perils of R 2 

R 2 is a “popular” measure of the fit of a regression model and is often quoted in research papers as evidence 

of a good fit etc. However, there are several fundamental problems of R 2 which, in my opinion, make it 

less desirable. A nice summary of these issues is presented in Draper and Smith (1998, <strong>Applied</strong> Regression 

Analysis, p. 245-246). 

Be<strong>for</strong>e exploring this, how is R 2 computed and how is it interpreted? 

While I haven’t discussed the decomposition of the Error SS into Lack-of-Fit and Pure error, this can be 

done when there are replicated X values. A prototype ANOVA table would look something like: 

Source df SS 

Regression p − 1 A 

Lack-of-fit n − p − n e B 

Pure error n e C 

Corrected Total n-1 D 

where there are n observations and a regression model is fit with p additional X values over and above the 

intercept. 

R 2 is computed as 

R 2 = SS(regression) 

SS(total) 

= A D = 1 − B + C 

D 

where SS(·) represents the sum of squares <strong>for</strong> that term in the ANOVA table. At this point, rerun the three 

examples presented earlier to find the value of R 2 . 

For example, in the fertilizer example, the ANOVA table is: 

Analysis of Variance 

Source DF Sum of Squares Mean Square F Ratio p-value 

Model 1 225.18035 225.180 69.8800


in<strong>for</strong>mative. In particular, the estimate of the slope and the se of the slope are much more in<strong>for</strong>mative. 

Here are some reasons, why I decline to use R 2 very much: 

• Overfitting. If there are no replicate X points, then n e = 0, C = 0, and R 2 = 1 − B D . B has n − p 

degrees of freedom. As more and more X variables are added to the model, n − p, and B become 

smaller, and R 2 must increase even if the additional variables are useless. 

• Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate the 

value of C (if the outlier occurs among the set of replicate X values), or B if the outlier occurs at a 

singleton X value. In any cases, they reduce R 2 , so R 2 is not resistant to outliers. 

• People misinterpret high R 2 as implying the regression line is useful. It is tempting to believe that 

a higher value of R 2 implies that a regression line is more useful. But consider the pair of plots below: 

The graph on the left has a very high R 2 , but the change in Y as X varies is negligible. The graph 

on the right has a lower R 2 , but the average change in Y per unit change in X is considerable. R 2 

measures the “tightness” of the points about the line – the higher value of R 2 on the left indicates that 

the points fit the line very well. The value of R 2 does NOT measure how much actual change occurs. 

• Upper bound is not always 1. People often assume that a low R 2 implies a poor fitting line. If you 

have replicate X values, then C > 0. The maximum value of R 2 <strong>for</strong> this problem can be much less 

than 100% - it is mathematically impossible <strong>for</strong> R 2 to reach 100% with replicated X values. In the 

extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R 2 can never exceed 

1 − C D . 

• No intercept models If there is no intercept then D = ∑ (Y i − Y ) 2 does not exist, and R 2 is not 

really defined. 

• R 2 gives no additional in<strong>for</strong>mation. In actual fact, R 2 is a 1-1 trans<strong>for</strong>mation of the slope and its 

standard error, as is the p-value. So there is no new in<strong>for</strong>mation in R 2 . 

• R 2 is not useful <strong>for</strong> non-linear fits. R 2 is really only useful <strong>for</strong> linear fits with the estimated regression 

line free to have a non-zero intercept. The reason is that R 2 is really a comparison between two 

types of models. For example, refer back to the length-weight relationship examined earlier. 



In the linear fit case, the two models being compared are 

vs. 

log(weight) = log(b 0 ) + error 

log(weight) = log(b 0 ) + b 1 ∗ log(length) + error 

and so R 2 is a measure of the improvement with the regression line. [In actual fact, it is a 1-1 trans<strong>for</strong>m 

of the test that β 1 = 0, so why not use that statistics directly?]. In the non-linear fit case, the two 

models being compared are: 

weight = 0 + error 

vs. 

The model weight=0 is silly and so R 2 is silly. 

weight = b 0 ∗ length ∗ ∗b 1 + error 

Hence, the R 2 values reported are really all <strong>for</strong> linear fits - it is just that sometimes the actual linear fit 

is hidden. 

• Not defined in generalized least squares. There are more complex fits that don’t assume equal 

variance around the regression line. In these cases, R 2 is again not defined. 

• Cannot be used with different trans<strong>for</strong>mations of Y . R 2 cannot be used to compare models that 

are fit to different trans<strong>for</strong>mations of the Y variable. For example, many people try fitting a model to 

Y and to log(Y ) and choose the model with the highest R 2 . This is not appropriate as the D terms are 

no longer comparable between the two models. 

• Cannot be used <strong>for</strong> non-nested models. R 2 cannot be used to compare models with different sets of 

X variables unless one model is nested within another model (i.e. all of the X variables in the smaller 

model also appear in the larger model). So using R 2 to compare a model with X 1 , X 3 , and X 5 to a 

model with X 1 , X 2 , and X 4 is not appropriate as these two models are not nested. In these cases, AIC 

should be used to select among models. 

1.5 A no-intercept model: Fulton’s Condition Factor K 

It is possible to fit a regression line that has an intercept of 0, i.e., goes through the origin. Most computer 

packages have an option to suppress the fitting of the intercept. 

The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced are misleading 

<strong>for</strong> these models. As this varies from package to package, please seek advice when fitting such 

models. 

The following is an example of where such a model may be sensible. 

Not all fish within a lake are identical. How can a single summary measure be developed to represent 

the condition of fish within a lake? 



In general, the the relationship between fish weight and length follows a power law: 

W = aL b 

where W is the observed weight; L is the observed length, and a and b are coefficients relating length to 

weight. The usual assumption is that heavier fish of a given length are in better condition than than lighter 

fish. Condition indices are a popular summary measure of the condition of the population. 

There are at least eight different measures of condition which can be found by a simple literature 

search. Conne (1989) raises some important questions about the use of a single index to represent the 

two-dimensional weight-length relationship. 

One common measure is Fulton’s 9 K: 

K = 

W eight 

(Length/100) 3 

This index makes an implicit assumption of isometric growth, i.e. as the fish grows, its body proportions and 

specific gravity do not change. 

How can K be computed from a sample of fish, and how can K be compared among different subset of 

fish from the same lake or across lakes? 

The B.C. Ministry of Environment takes regular samples of rainbow trout using a floating and a sinking 

net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded. 

The data are available in the rainbow-condition.jmp data file in the Sample Program Library at http: 

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A portion of the raw data 

data appears below: 

9 There is some doubt about the first authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J. 

(2005). The Origin of Fulton’s Condition Factor – Setting the Record Straight. Fisheries, 31, 236-238. 



K was computed <strong>for</strong> each individual fish, and the resulting histogram is displayed below: 

There is a range of condition numbers among the individual fish with an average (among the fish caught) K 

of about 13.6. 

Deriving a single summary measure to represent the entire population of fish in the lake depends heavily 

on the sampling design used to capture fish. 

Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the 

population. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically 

more selective <strong>for</strong> fish of a certain size. In this experiment, several different mesh sizes were used to try and 

ensure that all fish of all sizes have an equal chance of being selected. 

As well, if regression methods have an advantage in that a simple random sample from the population is 

no longer required to estimate the regression coefficients. As an analogy, suppose you are interested in the 

relationship between yield of plants and soil fertility. Such a study could be conducted by finding a random 

sample of soil plots, but this may lead to many plots with similar fertility and only a few plots with fertility 

at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots with a range of 

fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regression curve 

to these selected data points. 

Fulton’s index is often re-expressed <strong>for</strong> regression purposes as: 

( ) 3 L 

W = K 

100 

This looks like a simple regression between W and ( L 

100) 3 

but with no intercept. 

A plot of these two variables: 



shows a tight relationship among fish but with possible increasing variance with length. 

There is some debate about the proper way to estimate the regression coefficient K. Classical regression 

methods (least squares) implicitly implies that all of the “error” in the regression is in the vertical direction, 

i.e. conditions on the observed lengths. However, the structural relationship between weight and length 

likely is violated in both variables. This would lead to the error-in-variables problem in regression, which 

has a long history. Fortunately, the relationship between the two variables is often sufficiently tight that it 

really doesn’t matter which method is used to find the estimates. 

JMP can be used to fit the regression line constraining the intercept to be zero by using the Fit Special 

option under the red-triangle: 



This gives rise to the fitted line and statistics about the fit: 





Note that R 2 really doesn’t make sense in cases where the regression is <strong>for</strong>ced through the origin because 

the null model to which it is being compared is the line Y = 0 which is silly. 10 For this reason, JMP does 

not report a value of R 2 . 

The estimated value of K is 13.72 (SE 0.099). 

The residual plot: 

shows clear evidence of increasing variation with the length variable. This usually implies that a weighted 

regression is needed with weights proportional to the 1/length 2 variable. In this case, such a regression 

gives essentially the same estimate of the condition factor ( ̂K = 13.67, SE = .11). 

Comparing condition factors 

This dataset has a number of sub-groups – do all of the subgroups have the same condition factor? For 

example, suppose we wish to compare the K value <strong>for</strong> immature and mature fish. This is covered in more 

detail in the Chapter on the Analysis of Covariance (ANCOVA). 

1.6 Frequent Asked Questions - FAQ 

1.6.1 Do I need a random sample; power analysis 

A student wrote: 

I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstract at- 

10 Consult any of the standard references on regression such as Draper and Smith <strong>for</strong> more details. 



tached). I would like to define a regional hydraulic geometry <strong>for</strong> a fairly small hydrologic/geologic 

homogeneous area in the coast mountains close to SFU. Hydraulic geometry is the study of how 

the primary flow variables (width, depth and velocity) change with discharge in a stream. Typically, 

a straight-regression line is fitted to data plotted on a log-log plot. The equation is of the 

<strong>for</strong>m w = aQ b where a is the intercept, b is the slope, w is the water surface width, and Q is 

the stream discharge. 

I am struggling with the last part of my research proposal which is how do I select (randomly) 

my field sites and how many sites are required. My supervisor - suggests that I select stream segments 

<strong>for</strong> study based on a-priori knowledge of my field area and select streams from across it. 

My argument is that to define a regionally applicable relationship (not just one that characterizes 

my chosen sites) I must randomly select the sites. 

I think that GIS will help me select my sites but have the usual questions of how many sites are 

required to give me a certain level of confidence and whether or not I’m on the right track. As 

well, the primary controlling variables that I am looking at are discharge and stream slope. I 

will be plotting the flow variables against discharge directly but will deal with slope by breaking 

my stream segments into slope classes - I guess that the null hypothesis would be that there is 

no difference in the exponents and intercepts between slope classes. 

You are both correct! 

If you were doing a simple survey, then you are correct in that a random sample from the entire population 

must be selected - you can’t deliberately choose streams. 

However, because you are interested in a regression approach, the assumption can be relaxed a bit. You 

can deliberately choose values of the X variables, but must randomly select from streams with similar X 

values. 

As an analogy, suppose you wanted to estimate the average length of male adult arms. You would need 

a random sample from the entire population. However, suppose that you were interested in the relationship 

between body height (X) and arm length (Y ). You could deliberately choose which X values to measure - 

indeed it would be good idea to get a good contrast among the X values, i.e. find people who are 4 ft tall, 5 

ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fit the regression curve. However, 

at each height level, you must now choose randomly among those people that meet that criterion. Hence you 

could could deliberately choose to have 1/4 of people who are 4 ft tall, 1/4 who are 5 feet tall, 1/4 who are 

6 feet tall, 1/4 who are 7 feet tall which is quite different from the proportions in the population, but at each 

height level must choose people randomly, i.e. don’t always choose skinny 4 ft people, and over-weight 7 ft 

people. 

Now sample size is a bit more difficult as the required sample size depends both on the number of 

streams selected and how they are scattered along the X axis. For example, the highest power occurs when 

observations are evenly divided between the very smallest X and very largest X value. However, without 

intermediate points, you can’t assess linearity very well. So you will want points scattered around the range 

of X values. 



If you have some preliminary data, a power/sample size can be done using JMP, SAS, and other packages. 

If you do a google search <strong>for</strong> power analysis regression, there are several direct links to examples. Refer to 

the earlier section of the notes. 


Chapter 2 

Detecting trends over time 


As the following graphs shows, tests <strong>for</strong> trend are one of the most common statistical tools used. 1 

1 The astute reader may note the discrepancy between the headline and the apparent trend in the graph. Why? 

89

CHAPTER 2. DETECTING TRENDS OVER TIME 



Trend analysis is often used as the endpoint <strong>for</strong> many monitoring designs, i.e. is the monitored variable 

increasing or decreasing. Some nice references <strong>for</strong> planning monitoring studies are: 

• USGS Patuxent Wildlife Research Centre’s Manager’s Monitoring Manual available at http:// 

www.pwrc.usgs.gov/monmanual/. 

• US National Parks Service Guidelines on designing a monitoring study available at http://science. 

nature.nps.gov/im/monitor/index.htm. 

• Elzinga, C.L. et al. (2001). Monitoring Plant and Animal Populations Blackwell Science, Inc. 

There are many types of trends that can exist. 

For example, a simple step function 

is an example of a 

trend where the measured quantity Y increases after some intervention. These types of trends are commonly 

analyzed using a t-test or Analysis of Variance (ANOVA) methods covered in other parts of these 

notes. 

The trend may be a gradual linear increase over time: 



For example, as the amount of trees cleared increases over time, the turbidity of water in a stream may 

increase. In many cases, a regression analysis is used to test <strong>for</strong> trends in time. In these cases, the X variable 

is time and the Y variable is some response variable of interest. This is the main focus of this chapter. 

In some cases the trend is monotonic but non-linear: 



In the case of 

non-linear trends, a trans<strong>for</strong>mation is often used to try and linearize the trend (e.g. a log trans<strong>for</strong>m). This 

is often successful, in which case methods <strong>for</strong> linear regression can be used, but in some cases there is noobvious 

trans<strong>for</strong>mation. The trend can modeled by an arbitrary function of arbitrary shape. A very general 

methodology called Generalized Additive Models can be used to fit very general functions. These are beyond 

the scope of this <strong>course</strong>. 

Sometimes the linear trend changes at some point (called the break point): 



If the break 

point is known in advance, this is easily fit using multiple regression methods, but is beyond the scope 

of these notes. If the breakpoint is unknown, this is a difficult statistical problem, but refer to Toms and 

Lesperance (2003) 2 <strong>for</strong> help. 

Helsel and Hirsch (2002) 3 have a number summary of methods used to detect trends. The following 

table is adopted from their manual: 

2 Toms J.D. and Lesperance M.L. (2003) Piecewise regression A tool <strong>for</strong> identifying ecological thresholds. <strong>Ecology</strong>, 84, 2034-2041. 

3 Helsel, D.R. and Hirsch, R.M. (2002). Statistical methods in water resources. Chapter 12. Available at http://pubs.usgs. 

gov/twri/twri4a3/ 



Nonparametric 

Trends with NO seasonality 

Not adjusted <strong>for</strong> X 

Kendall trend test on Y vs. 

T . 

Adjusted <strong>for</strong> X 

Kendall trend test on residuals 

R from smoothing fit 

(e.g. LOWESS) of Y on 

X. 

Mixed none Kendall trend test on residuals 

R from regression of 

Y on X. 4 

Parametric Regression of Y on T . Regression of Y on X and 

T . 


Mixed 

Parametric 

Trends with seasonality 

Seasonal Kendall test of Y 

on T . 

Regression of deseasonalized 

Y on T , e.g. after 

subtracting the seasonal 

means. 

Regression of Y on T and 

seasonal terms, e.g. AN- 

COVA or sin/cos regression. 

Seasonal Kendall test on 

residuals R from smoothing 

fit (e.g. LOWESS) of 

Y on X. 

Seasonal Kendall trend test 

on residuals from regression 

of Y on X. 

Regression of Y on X, T , 

and seasonal terms. 

Notation: Y response variable; T time variable; X exogenous variable; 

R residuals 

In these notes we will look at linear trends fit using regression models and non-parametric methods. We 

will also look at how to pool two or more sites to see if they have a common linear trend. In cases of trends 

over time, there are often problems of autocorrelation or seasonality. Methods to deal with the problems will 

be discussed. 

At this time however, adjusting <strong>for</strong> other exogenous variables (X) will not be discussed. Methods to deal 

with step-trends are covered in other chapters. 

4 Alley (1988) shows that increased power is obtained by doing the Kendall test on residuals of Y vs. X and T vs. X. This removes 

drift in X over time is removed as well 



2.2 Simple Linear Regression 

We will begin by using the methods of linear regression (covered in an earlier part of these notes) when 

applied to trend over time. 

This is a special case of linear regression analysis, but trend analyses also have some peculiar features that 

are fairly common when dealing with trend analyses that don’t have exact counterparts in regular regression: 

• Testing <strong>for</strong> a common trend (a special case of ANCOVA) 

• Dealing with process vs. sampling variation 

• Dealing with autocorrelation of residuals 

For most of this chapter, we will assume that X is measured in years (e.g. calendar year). 

The same sampling model, assumptions, estimation, and hypothesis testing methods are used as in the 

regular regression case with appropriate modifications to deal with X as time. These will be reviewed again 

below. 

2.2.1 Populations and samples 

The population of interest is the set of Y variables as measured over time (X). In most cases in trend 

analysis, random sampling from some larger population of time points really doesn’t make sense. Rather 

the time values (the X values) are pre-specified. For example, measurements could be taken every year, or 

every two years, etc. 

We wish to summarize the relationship between Y and time (X), and furthermore wish to make predictions 

of the Y value <strong>for</strong> future time (X) values that may be observed from this population. We may also 

wish to do inverse regression, i.e. predict at what time Y will reach a certain value. 

If this were physics, we may conceive of a physical law between Y and time (e.g. distance = velocity× 

time). However, in ecology, the relationship between Y and time is much more tenuous. If you could draw 

a scatter plot of Y against time the points would NOT fall exactly on a straight line. Rather, the value of Y 

would fluctuate above or below a straight line at any given time value. 

We denote this relationship as 

Y = β 0 + β 1 X + ɛ 

where we remember that X is now time rather than some other predictor variable. Now β 0 , β 1 are the 

POPULATION intercept and slope respectively. We say that 

E[Y ] = β 0 + β 1 X 



is the expected or average value of Y at X. 5 

The term ɛ represent random variation of individual units in the population above and below the expected 

value. It is assumed to have constant standard deviation over the entire regression line (i.e. the spread of 

data points is constant over time). 

Of <strong>course</strong>, we can never measure all units of the population. So a sample must be taken in order to 

estimate the population slope, population intercept, and standard deviation. In most trend analyses, the 

values of X are chosen to be equally spaced in time, e.g. measurements taken every year. 

Once the data points are selected, the estimation process can proceed, but not be<strong>for</strong>e assessing the assumptions! 


The assumptions <strong>for</strong> a trend analysis are virtually the same as <strong>for</strong> a standard regression analysis. This is not 

surprising as trend analysis is really a special case of regression analyses. 

Linearity 

Regression analysis assume that the relationship between Y and X is linear, i.e. a constant decline over 

time. This can be assessed quite simply by plotting Y vs. time. Perhaps a trans<strong>for</strong>mation is required (e.g. 

log(Y ) vs. log(X)). Some caution is required when trans<strong>for</strong>mations are done, as the error structure on the 

trans<strong>for</strong>med scale is most important. As well, you need to be a little careful about the back-trans<strong>for</strong>mation 

after doing regression on trans<strong>for</strong>med values. 

You should also plot residuals vs. the X (time) values. If the scatter is not random around 0 but shows 

some pattern (e.g. quadratic curve), this usually indicates that the relationship between Y and X (time) is 

not linear. Alternatively, you can fit a model that includes X and X 2 and test if the coefficient associated 

with X 2 is zero. Un<strong>for</strong>tunately, this test could fail to detect a higher order relationship. Third, if there are 

multiple readings at some X-values, then a test of goodness-of-fit can be per<strong>for</strong>med where the variation of 

the responses at the same X value is compared to the variation around the regression line. 

Scale of Y and X 

As X is time, it has an interval or ratio scale. It is further assumed that Y has an interval or ratio scale 

as well. This can be violated in a number of ways. For example, a numerical value is often to represent a 

5 In ANOVA, we let each treatment group have its own mean; here in regression we assume that the means must fit on a straight line. 

In some cases, even in the absence of sampling error, the true value of Y does NOT lies on the straight time. This is known as process 

variation, and will be discussed later. 



category and this numerical value in used in a regression. This is not valid. Suppose that you code hair color 

as (1=red, 2=brown, and 3=black). Then using these values as the response variable (Y ) is not sensible. 


The Y must be a random sample from the population of Y values at every time point. 


All the points must belong to the relationship – there should be no unusual points. The scatter plot of Y vs. 

X should be examined. If in doubt, fit the model with the outlying points in and out of the model and see if 

this makes a difference in the fit. 

Outliers can have a dramatic effect on the fitted line as you saw in a previous chapter. 


The variability about the regression line is similar <strong>for</strong> all values of X, i.e. the scatter of the points above and 

below the fitted line should be roughly constant over time. This is assessed by looking at the plots of the 

residuals against X to see if the scatter is roughly uni<strong>for</strong>mly scattered around zero with no increase and no 

decrease in spread over the entire line. 

Independence 

Each value of Y is independent of any other value of Y . This is a common failing in trend analysis where 

the measurement in a particular year influences the measurement in subsequent years. 





Y over all X values must be normally distributed, i.e they look simply at the distribution of the Y ’s ignoring 

the Xs. The assumption only states that the residuals, the difference between the value of Y and the point 

on the line must be normally distributed. 





important. 

X measured without error 

This is a new assumption <strong>for</strong> regression as compared to ANOVA. In ANOVA, the group membership was 

always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity. However, 

in regression, it can turn out that that the X value may not be known exactly. 

This may seem a bit puzzling in a trend analysis – after all, how can the calendar year not be known 

exactly. An example of the problem is when Y is a estimate of the population size which is measured 

over time. This is often obtained from a mark-recapture study when animals are marked in one month, and 

recaptured in the next month. In this case, does the population size refer to the population size at the start 

of the study, in the middle of the study, or the end of the study. If the same protocol was per<strong>for</strong>med in all 

years, then it really doesn’t matter, but the start and end of sampling likely varies over years (e.g. in some 

years starts in March, in other years starts in April) so that the interval between sampling occasions is not 

constant. 

This general problem is called the “error in variables” problem and has a long history in statistics. More 

details are available in the chapter on regression analysis. 


As be<strong>for</strong>e, we distinguish between population parameters and sample estimates. We denote the sample 

intercept by b 0 and the sample slope by b 1 . The equation of a particular sample of points is expressed 

Ŷ i = b 0 + b 1 X i where b 0 is the estimated intercept, and b 1 is the estimated slope. The symbol Ŷ indicates 

that we are referring to the estimated line and not to population line. 

As in regression analysis, the best fitting line is typically found using least squares. However, in more 

complex situation (e.g. when accounting <strong>for</strong> autocorrelation over time), maximum likelihood methods can 

also be used. The least-squares line is the line that makes the sum of the squares of the deviations of the data 

points from the line in the vertical direction as small as possible. 

The estimated intercept (b 0 ) is the estimated value of Y when X = 0. In many cases of trend analysis, 

it is meaningless to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a 

plot of income vs. year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear 

interpretation of the intercept, and it merely serves as a place holder <strong>for</strong> the line. 

The estimated slope (b 1 ) is the estimated change in Y per unit change in X. In many cases X is measured 

in years, so this would be the change in Y per year. 




each of the estimates. Confidence intervals <strong>for</strong> the true slope and intercept can also be found. 

Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameter as this 

is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is no relationship 

between Y and X, i.e. no trend over time. More <strong>for</strong>mally the null hypothesis is: 

H : β 1 = 0 




A : β 1 ≠ 0 

although one-sided tests looking <strong>for</strong> either a positive or negative trend are possible. 

The p-value is interpreted in exactly the same way as in ANOVA, i.e. is measures the probability of 

observing this data if the hypothesis of no relationship were true. 

As be<strong>for</strong>e, the p-value does not tell the whole story, i.e. statistical vs. biological (non)significance must 

be determined and assessed. 

2.2.4 Obtaining Predictions 

Once the best fitting line is found it can be used to make predictions <strong>for</strong> new values of X, e.g. what is the 

predicted value of Y <strong>for</strong> new time points. 




X. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses 

at a particular X. 6 The prediction interval <strong>for</strong> an individual response is sometimes called a confidence 

interval <strong>for</strong> an individual response but this is an un<strong>for</strong>tunate (and incorrect) use of the term confidence interval. 

Strictly speaking confidence intervals are computed <strong>for</strong> fixed unknown parameter values; predication 

intervals are computed <strong>for</strong> future random variables. 


In both cases, the estimate is found in the same manner – substitute the new value of X into the equation 


“dummy” observation in the dataset with the value of Y missing, but the value of X present. The missing 







In the first case, where predictions <strong>for</strong> INDIVIDUALs are wanted, there are two sources of uncertainty 

involved in the prediction. First, there is the uncertainty caused by the fact that this estimated line is based 

upon a sample. Then there is the additional uncertainty that the value could be above or below the predicted 

line. This interval is often called a prediction interval at a new X. 

In the second case, where predictions <strong>for</strong> the mean response are wanted, only the uncertainty caused by 

estimating the line based on a sample is relevant. This interval is often called a confidence interval <strong>for</strong> the 

mean at a new X. 







2.2.5 Inverse predictions 

A related question is "how long be<strong>for</strong>e the E[Y ] reaches a certain point". These are obtained by drawing a 

line across from the Y axis until it reaches the fitted line, and then following the line down until it reaches 

the X (time) axis. Confidence intervals <strong>for</strong> the inverse prediction are found by following the same procedure 

but now following the line horizontally across until it reaches one of the confidence intervals (either <strong>for</strong> the 

mean response or the individual response). 7 

7 It is possible that the confidence intervals are one-sided (i.e. one side is either plus or minus infinity), or even that the confidence 

interval comes in two sections. Please consult a reference such as Draper and Smith <strong>for</strong> details. 



2.2.6 Residual Plots 

After the curve is fit, it is important to examine if the fitted curve is reasonable. This is done using residuals. 

The residual <strong>for</strong> a point is the difference between the observed value and the predicted value, i.e., the residual 

from fitting a straight line is found as: residual i = Y i − (b 0 + b 1 X i ) = (Y i − Ŷi). 

There are several standard residual plots: 

• plot of residuals vs. predicted (Ŷ ); 

• plot of residuals vs. X; 

In all cases, the residual plots should show random scatter around zero with no obvious pattern. Don’t 

plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and don’t mean 

anything. 



2.2.7 Example: The Grass is Greener (<strong>for</strong> longer) 

D.G. Grisenthwaite, a pensioner who has spent 20 years keeping detailed records of how often he cuts his 

grass has been included in a climate change study. David Grisenthwhaite, 77, and a self-confessed “creature 

of habit”, has kept a note of cutting grass in his Kirkcaldy garden since 1984. The grandfather’s data was so 

valuable it was used by the Royal Meteorological Society in a paper on global warming. 

The retired paper-maker, who moved to Scotland from Cockermouth in West Cumbria in 1960, said he 

began making a note of the time and date of every occasion he cut the grass simply “<strong>for</strong> the fun of it”. 

The data are presented in: 

Sparks, T.H., Croxton, J.P.J., Collinson, N., and Grisenthwaite, D.A. (2005) The Grass is 

Greener (<strong>for</strong> longer). Weather 60, 121-123. 

from which the data on the duration of the cutting season was extracted: 



Year 

Duration 

(days) 

1984 200 

1985 215 

1986 195 

1987 212 

1988 225 

1989 240 

1990 203 

1991 208 

1992 203 

1993 202 

1994 210 

1995 225 

1996 204 

1997 245 

1998 238 

1999 226 

2000 227 

2001 236 

2002 215 

2003 242 

The question of interest is if there is evidence that the lawn cutting season has increased over time? 

JMP analysis 

The data and JMP scripts are available in the grass.jmp file in the Sample Program Library available at 

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The data are entered in JMP into the usual way. Notice that two extra lines were added to the end of the 

data representing two years <strong>for</strong> which predictions will be made. Both variables should be continuous scale. 



Use the Analyze->Fit Y-by-X plat<strong>for</strong>m to create preliminary plot of the number of days between the first 

and last cut (Y ) versus year (X): 



The plot shows some evidence that the duration of the cutting season has increased over time. 

We can check some of the assumptions: 

• the Y and X variable are both the proper scale. 

• the relationship appears to be approximately linear. 

• there are no obvious outliers. 

• the variance (scatter) of points around the line appears to be approximately equal. We will check this 

again from the residual plot. 

• they may be some evidence of autocorrelation as the line joining the raw data points seems to dip 

above and below the line <strong>for</strong> several years in a row. This could correspond to slowly changing effects 



such as a multi-year dry or wet spell. However, with only 20 data points, it is difficult to tell. We will 

check more <strong>for</strong>mally <strong>for</strong> non-independence by looking at the residual plot and the Durbin-Watson test 

statistic later. 

Use the red-triangle drop-down menu on the plot to select the Fit Line option. This gives: 

The estimated intercept (-2702) would represent the estimated duration of the growing season in year 0 

– clearly a nonsensical result. It really doesn’t matter, as the intercept is just a place holder <strong>for</strong> the equation 

of the line. What really is of interest, is the estimated slope. 

The estimated slope is 1.46 (se 0.52) days/year. This means that the duration of the growing season is 

estimated to have increased by 1.46 days per year over the span of this study. The 95% confidence interval 

<strong>for</strong> the slope 8 (0.36 to 2.56) does not include the value of 0, so there is evidence against the slope actually 

8 If the 95% confidence interval doesn’t show in your output, do a right-click (Windoze) or ctrl-click (Macintosh) in the table of 



being 0 (i.e. no change over the years). 

Finally, the p-value <strong>for</strong> testing if the true slope is zero is 0.012 which again provides evidence against 

the hypothesis of no change in mean duration over the span of the experiment. 

The estimated value of RMSE (not shown here but available in the Summary of Fit section of the output) 

is 13.52 days which is the estimated standard deviation of the data points around the regression line. 

The confidence intervals <strong>for</strong> the mean response and the prediction intervals <strong>for</strong> the individual response 

are available from the red-triangle on the linear-fit box: 

Selecting both confidence intervals gives: 

estimates and select the 95% lower and upper limits to be displayed. 



Notice how much wider the prediction intervals <strong>for</strong> individual responses are compared to the confidence 

interval <strong>for</strong> the mean response. You can use the cross hairs tool to select points on each of the lines to read 

off the values. 

Un<strong>for</strong>tunately, the Analyze->Fit Y-by-X plat<strong>for</strong>m doesn’t allow you save the confidence intervals directly 

to the data table. In order to do this you need to use the Analyze->Fit Model plat<strong>for</strong>m: 



The Y variable is the duration of the lawn cutting season, while the only effects to be entered into the effect 

box is that of year. After the model is fit (with identical results to what we had earlier), the predicted values 

and confidence intervals and prediction intervals along with residuals and other good stuff can be saved by 

clicking on the drop-down red-triangle near the upper plot: 



The data table will also include predictions <strong>for</strong> 2004 and 2005 



Notice the difference in width <strong>for</strong> the confidence interval <strong>for</strong> the mean response and the prediction interval 

<strong>for</strong> the individual response. These two intervals are often confused and it is important to keep their two uses 

in mind. 

The residual plot is automatically given by the Analyze->Fit Model plat<strong>for</strong>m, but is also easily obtained 

from the Analyze->Fit Y-by-X plat<strong>for</strong>m: 



It does not show any evidence of problems. 

Finally, the Durbin-Watson statistic <strong>for</strong> testing the presence of autocorrelation is found in the Analyze- 

>Fit Model plat<strong>for</strong>m: 

to give 9 

9 You may have to use the pop-down menu from the red-triangle to get the p-value 



The DW statistic should be close to 2 if there is no autocorrelation present in the data. The p-value does 

not indicate any evidence of a problem with autocorrelation. The estimated autocorrelation is very small 

(−.004) so that it is essentially zero. 

Postscript 

A more <strong>for</strong>mal analysis of the data presented in the article looked at the date of first cutting, the date of last 

cutting, and the number of cuts as well. The authors conclude: 

Despite having a relatively <strong>short</strong> span of 20 years, the data from Kirkcaldy provide biological 

evidence of an increase in the length of the growing season and some suggestions of what 

meteorological factors affect lawn growth. Strictly, we are dealing with the cutting season 

which is likely to underestimate the growing season. 

This was quite an interesting analysis of an unusual data set! 

2.3 Trans<strong>for</strong>mations 

In some cases, the plot of Y vs. X is obviously non-linear and a trans<strong>for</strong>mation of X or Y may be used to 

establish linearity. For example, many dose-response curves are linear in log(X). Or the equation may be 

intrinsically non-linear, e.g. a weight-length relationship is of the <strong>for</strong>m weight = β 0 length β1 . Or, some 

variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured in L/100 

km or km/L? You are already with some variables measured on the log-scale - pH is a common example. 

Often a visual inspection of a plot may identify the appropriate trans<strong>for</strong>mation. 

There is no theoretical difficulty in fitting a linear regression using trans<strong>for</strong>med variables other than an 

understanding of the implicit assumption of the error structure. The model <strong>for</strong> a fit on trans<strong>for</strong>med data is of 

the <strong>for</strong>m 

trans(Y ) = β 0 + β 1 × trans(X) + error 



Note that the error is assumed to act additively on the trans<strong>for</strong>med scale. All of the assumptions of linear 

regression are assumed to act on the trans<strong>for</strong>med scale – in particular that the standard deviation around the 

regression line is constant on the trans<strong>for</strong>med scale. 

The most common trans<strong>for</strong>mation is the logarithmic trans<strong>for</strong>m. It doesn’t matter if the natural logarithm 

(often called the ln function) or the common logarithm trans<strong>for</strong>mation (often called the log 10 trans<strong>for</strong>mation) 

is used. There is a 1-1 relationship between the two trans<strong>for</strong>mations, and linearity on one trans<strong>for</strong>m is 

preserved on the other trans<strong>for</strong>m. The only change is that values on the ln scale are 2.302 = ln(10) times 

that on the log 10 scale which implies that the estimated slope and intercept both differ by a factor of 2.302. 

There is some confusion in scientific papers about the meaning of log - some papers use this to refer to the 

ln trans<strong>for</strong>mation, while others use this to refer to the log 10 trans<strong>for</strong>mation. 

After the regression model is fit, remember to interpret the estimates of slope and intercept on the trans<strong>for</strong>med 

scale. For example, suppose that a ln(Y ) trans<strong>for</strong>mation is used. Then we have 

and 

. 

ln(Y t+1 ) = b 0 + b 1 × (t + 1) 

ln(Y t ) = b 0 + b 1 × t 

ln(Y t+1 ) − ln(Y t ) = ln( Y t+1 

Y t 

) = b 1 × (t + 1 − t) = b 1 

exp(ln( Y t+1 

)) = Y t+1 

= exp(b 1 ) = e b1 

Y t Y t 

Hence a one unit increase in X cause Y to be MULTIPLED by e b1 . As an example, suppose that on 

the log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by a 

multiplicative factor or e −.07 = .93, i.e. roughly a 7% decline per year. 10 

Predictions on the trans<strong>for</strong>med scale, must be back-trans<strong>for</strong>med to the untrans<strong>for</strong>med scale. 

In some problems, scientists search <strong>for</strong> the ‘best’ trans<strong>for</strong>m. This is not an easy task and using simple 

statistics such as R 2 to search <strong>for</strong> the best trans<strong>for</strong>mation should be avoided. Seek help if you need to find 

the best trans<strong>for</strong>mation <strong>for</strong> a particular dataset. 

2.3.1 Example: Monitoring Dioxins - trans<strong>for</strong>mation 





10 It can be shown that on the log scale, that <strong>for</strong> smallish values of the slope that the change is almost the same on the untrans<strong>for</strong>med 

scale, i.e. if the slope is −.07 on the log sale, this implies roughly a 7% decline per year; a slope of .07 implies roughly a 7% increase 

per year. 





Each year, four crabs are captured from a monitoring station. The liver is excised and the livers from all 

four crabs are composited together into a single sample. 11 The dioxins levels in this composite sample is 

measured. As there are many different <strong>for</strong>ms of dioxins with different toxicities, a summary measure, called 

the Total Equivalent Dose (TEQ) is computed from the sample. 


Site Year TEQ 

a 1990 179.05 

a 1991 82.39 

a 1992 130.18 

a 1993 97.06 

a 1994 49.34 

a 1995 57.05 

a 1996 57.41 

a 1997 29.94 

a 1998 48.48 

a 1999 49.67 

a 2000 34.25 

a 2001 59.28 

a 2002 34.92 

a 2003 28.16 

JMP analysis 

The data is available in a JMP data file dioxinTEQ.jmp in the Sample Program Library available at http: 

//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

As with all analyses, start with a preliminary plot of the data obtained using the Analyze->Fit Y-by-X 

plat<strong>for</strong>m. 






The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Why is 

this so? In many cases, a fixed fraction of dioxins degrades per year, e.g. a 10% decline per year. This can 

be expressed in a non-linear relationship: 

T EQ = Cr t 

where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this is 

plotted over time, this leads to the non-linear pattern seen above. 

If logarithms are taken, this leads to the relationship: 

which can be expressed as: 

log(T EQ) = log(C) + t × log(r) 

log(T EQ) = β 0 + β 1 × t 

which is the equation of a straight line with β 0 = log(C) and β 1 = log(r). 

JMP can easily be used to compute the log(T EQ) by using the Formula Editor in the usual fashion. 

A plot of log(T EQ) vs. year using the Analyze->Fit Y-by-X plat<strong>for</strong>m gives the following: 



This look linear over time with a steady decline. A line can be fit as be<strong>for</strong>e by selecting the Fit Line 

option from the red triangle in the upper left side of the plot: 






The residual plot looks fine with no apparent problems but the dip in the middle years could require 



further exploration if this pattern was apparent in other site as well: 

. 

The fitted line is: 

log(T EQ) = 218.9 − .11(year) 

The intercept (218.9) would be the log(T EQ) in the year 0 which is clearly nonsensical. The slope 

(−.11) is the estimated log(ratio) from one year to the next. For example, exp(−.11) = .898 would mean 

that the TEQ in one year is only 89.8% of the TEQ in the previous year, or about an 11% decline per year. 12 

The standard error of the estimated slope is .02. A 95% confidence interval <strong>for</strong> the slope can be obtained 

by pressing a Right-Click (<strong>for</strong> Windoze machines) or a Ctrl-Click (<strong>for</strong> Macintosh machines) in the Parameter 

Estimates summary table and selecting the confidence intervals to display in the table. 

The 95% confidence interval <strong>for</strong> the slope is (−.154 → −.061). If you take the anti-logs of the endpoints, 

this gives a 95% confidence interval <strong>for</strong> the fraction of TEQ that remains from year to year, i.e. between 

(0.86 to 0.94) of the TEQ in one year, remains to the next year. 

Several types of predictions can be made. For example, what would be the estimated mean TEQ in 2010? 

12 It can be shown that in regressions using a log(Y ) vs. time, that the estimated slope on the logarithmic scale is the approximate 

fraction decline per time interval. For example, in the above, the estimated slope of −.11 corresponds to an approximate 11% decline 

per year. This approximation only work well when the slopes are small, i.e. close to zero. 



This can be accomplished in several ways. 

The computations could be done by hand, or by using the cross-hairs on the plot from the Analyze- 

>Fit Y-by-X plat<strong>for</strong>m. Confidence intervals <strong>for</strong> the mean response, or prediction intervals <strong>for</strong> an individual 

response can be added to the plot from the pop-down menu. 

However, a more powerful tool is available from the Analyze->Fit Model plat<strong>for</strong>m. 

Start first, by adding rows to the original data table corresponding to the years <strong>for</strong> which a prediction is 

required. In this case, the additional row would have the value of 2010 in the Year column with the remainder 

of the row unspecified. Missing values will be automatically inserted <strong>for</strong> the other variables. 

Then invoke the Analyze->Fit Model plat<strong>for</strong>m: 



This gives much the same output as from the Analyze->Fit Y-by-X plat<strong>for</strong>m with a few new (useful) 

features, a few of which we will explore in the remainder of this section. 

Next, save the prediction <strong>for</strong>mula, and the confidence interval <strong>for</strong> the mean, and <strong>for</strong> an individual prediction 

to the data table (this will take three successive saves): 



Now the data table has been augmented with additional columns and more importantly predictions <strong>for</strong> 

2010 are now available: 



The estimated mean log(T EQ) is 2.60 (corresponding to an estimated MEDIAN TEQ of exp(2.60) = 

13.46). A 95% confidence interval <strong>for</strong> the mean log(T EQ) is (1.94 to 3.26) corresponding to a 95% confidence 

interval <strong>for</strong> the actual MEDIAN TEQ of between (6.96 and 26.05). 13 Note that the confidence 

interval after taking anti-logs is no longer symmetrical. 

Why does a mean of a logarithm trans<strong>for</strong>m back to the median on the untrans<strong>for</strong>med scale? Basically, 

because the trans<strong>for</strong>mation is non-linear, properties such mean and standard errors cannot be simply 

anti-trans<strong>for</strong>med without introducing some bias. However, measures of location, (such as a median) are 

unaffected. On the trans<strong>for</strong>med scale, it is assumed that the sampling distribution about the estimate is symmetrical 

which makes the mean and median take the same value. So what really is happening is that the 

median on the trans<strong>for</strong>med scale is back-trans<strong>for</strong>med to the median on the untrans<strong>for</strong>med scale. 

Similarly, a 95% prediction interval <strong>for</strong> the log(T EQ) <strong>for</strong> an INDIVIDUAL composite sample can be 

found. 

Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to some 

particular value. For example, health regulations may require that the TEQ of the composite sample be 

below 10 units. 

The Analyze->Fit Model plat<strong>for</strong>m has an inverse prediction function: 

13 A minor correction can be applied to estimate the mean if required. 



Specify the required value <strong>for</strong> Y – in this case log(10) = 2.302, 

and then press the RUN button to get the following output: 



The predicted year is found by solving 

2.302 = 218.9 − .11(year) 

and gives and estimated year of 2012.7. A confidence interval <strong>for</strong> the time when the mean log(T EQ) is 

equal to log(10) is somewhere between 2007 and 2026! 

2.3.2 Final Words 

The application of regression to non-linear problems is fairly straight<strong>for</strong>ward after the trans<strong>for</strong>mation is 

made. The most error-prone step of the process is the interpretation of the estimates on the TRANSFORMED 

scale and how these relate to the untrans<strong>for</strong>med scale. 

2.4 Power/Sample Size 


A common goal in ecological research is to determine if some quantity (e.g. abundance, water quality) 

is tending to increase or decrease. A linear regression of this quantity against time is commonly used to 

evaluate such a trend. The methods presented earlier can be used in these situations without much difficulty 

except <strong>for</strong> problems of autocorrelation over time (<strong>for</strong> example if the same monitoring plots were measured 

repeatedly over time), and making sure that the experimental and observational unit are not confused (this is 

similar to the problem of sub-sampling discussed earlier). 14 

When designing programs to detect trends, several related questions arise. For how many years does the 

study have to run? What is the influence of the precision of the individual yearly measurements have on the 

length of the monitoring study? What is the power to detect a certain sized trend given a proposed study 

design? 

As in ANOVA, these questions are answered through a power analysis. The in<strong>for</strong>mation needed to 

conduct a power analysis <strong>for</strong> linear regression is similar to that required <strong>for</strong> a power analyses in ANOVA - 

however, the computations are more complex. 

The in<strong>for</strong>mation needed is: 

• α level. As in power analyses <strong>for</strong> ANOVA, this is traditionally set to α = 0.05. 

14 An example of such confusion would be an investigation of the fecundity of a bird over time. Several sites covering the range of 

the bird are measured and many nests within each site are also measured. This study continues <strong>for</strong> a number of years. The average 

fecundity (over all sites and nests) is the response variable, i.e. one single number per year rather than the individual nest measurements. 

The reason <strong>for</strong> this is that factors that operate on the yearly scale (e.g. environmental variables) affect all nests simultaneously rather 

than operating on a single nest at a time independently of other nests. For example, a poor summer will depress fecundity <strong>for</strong> all nests 

simultaneously. 



• effect size. In ANOVA, power deals with detection of differences among means. In regression analysis, 

power deals with detection of slopes that are different from zero. Hence, the effect size is measured 

by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X. 

• sample size. Recall in ANOVA with more than two groups, that the power depended not only only 

the sample size per group, but also how the means are separated. In regression analysis, the power 

will depend upon the number of observations taken at each value of X and the spread of the X values. 

For example, the greatest power is obtained when half the sample is taken at the two extremes of 

the X space - but at a cost of not being able to detect non-linearity. For many monitoring designs, 

observations are taken on a yearly basis, so the question reduces to the number of years of monitoring 

required. 

• standard deviation. As in ANOVA, the power will depend upon the variation of the individual objects 


A very nice series of papers on detecting trends in ecological studies is available: 

• Gerrodette, T. 1987. A power analysis <strong>for</strong> detecting trends. <strong>Ecology</strong> 68: 1364-1372. http://dx. 

doi.org/10.2307/1939220. 

• Link, W. A. and Hatfield, J. S. 1990. Power calculations and model selection <strong>for</strong> trend analysis: a 

comment. <strong>Ecology</strong> 71: 1217-1220. http://dx.doi.org/10.2307/1937393. 

• Gerrodette, T. 1991. Models <strong>for</strong> power of detecting trends - a reply to Link and Hatfield. <strong>Ecology</strong> 72: 

1889-1892. http://dx.doi.org/10.2307/1940986. 

• Gerrodette, T. 1993. Trends: software <strong>for</strong> a power analysis of linear regression. Wildlife Society 

Bulletin 21: 515-516. 

JMP does not include a power computation module <strong>for</strong> regression analysis. However SAS v.9+ includes 

a power analysis module (GLMPOWER) <strong>for</strong> the power analysis of regression models, but this a bit complex 

to use. 

Perhaps the most common aspect of a power analysis <strong>for</strong> linear regression is the planning of a monitoring 

study to detect trends over time. This considerable simplifies the computations of the power as usually the 

time points are equally spaced with the same number of measurements taken at each time point. There are 

two readily available software packages to help plan such studies. The first, TRENDS, available at http: 

//swfsc.noaa.gov/textblock.aspx?Division=PRD&ParentMenuId=228&id=4740 is a 

Windoze based program that does the computations as outlined in the above papers. Because of concerns 

raised by Link and Hatfield, a second program, MONITOR, available at http://www.mbr-pwrc. 

usgs.gov/software/monitor.html, was developed that does power computations based on simulation 

rather than simple <strong>for</strong>mulae. This used a web-based interface rather than running on individual 

machines. 15 This second program also has additional flexibility to handle situation where the monitoring 

points are not equally spaced in time, or there are multiple measurements taken at each time point. 

15 The original author of MONITOR, James Gibbs, indicates that a Windoze version will be available in early 2005 at http: 

//www.esf.edu/efb/gibbs/ 



CAUTION Power analysis <strong>for</strong> trend can be very complex. The authors of Program Monitor have some 

sage advice that is applicable to both TRENDS and MONITOR: 

Users should be aware (and wary) of the complexity of power analysis in general, and also acknowledge 

some specific limitations of MONITOR <strong>for</strong> many real-world applications. Our chief, 

immediate concern is that many users of MONITOR may be unaware of these limitations and 

may be using the program inappropriately. Below are comments from one of our statisticians 

on some of the aspects of MONITOR that users should be cognizant of: “There are numerous 

issues with how Program Monitor calculates statistical power and sample size. One issue concerns 

the default option whereby the user assumes independence of plots or sites from one time 

period to the next. If you are randomly sampling new sites or plots each time period, then it is 

correct to assume independence (assuming that finite population correction factor is not an issue, 

which depends on how many plots or sites you are sampling, relative to the total population 

size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time, 

however, then the default option in Program Monitor is unlikely to give a correct calculation of 

statistical power or sample size. If plots or sites are positively autocorrelated over time, as is 

usually the case in biological surveys, then Program Monitor will underestimate sample size, or 

conversely, it will overestimate the statistical power. The correct sample size estimate is likely 

to be greater, and depending upon the amount of autocorrelation, the correct sample size could 

be vastly greater to achieve a stated power objective. A more fundamental issue concerns the 

null model one chooses <strong>for</strong> the trend in population growth. Program Monitor assumes a relatively 

simple linear trend in population growth, but this is a controversial issue, because there 

are potentially an infinite number of models one could use. If pilot data are available, then it 

may be possible to estimate autocorrelation and try to make some choices concerning the type 

of model to use as the null model <strong>for</strong> a power calculation, but regardless of how you decide to 

proceed, it would be a good idea to consult a statistician to determine an approach that fits your 

needs and data. No matter what additional flexibility is built into the modeling, however, it will 

always be possible to posit the existence of further structure which if overlooked will produce 

misleading results. For a pertinent discussion of some of these issues, please see Elzinga et 

al. (1998). Although this reference deals specifically with plant populations, the fundamental 

statistical issues are similar whether you are sampling plant or animal populations. Literature 

Citation: Elzinga, C.L., D.W. Salzer, and J.W. Willoughby. 1998. Measuring and monitoring 

plant populations. BLM Technical Reference 1730-1, Denver, CO. 477 pages.” 

Some care must also be taken to distinguish between sampling variation and process variation. 



Process vs Sampling Variation 

Population 

} 

{ 

} 

{ 

} 

Process variation 

Sampling 

Variation 

Time 

Samplingvariation refers to the uncertainty of each 

measurement in each year. This can be reduced by 

increasing the sampling ef<strong>for</strong>t in each year. Process variation 

refers to the fact that even if the data values were known exactly, 

the points would not lie on the straight line. Process variation 

is unaffected by the sampling ef<strong>for</strong>t in each year. 



Sampling variation is the size of the standard error when estimates are made at each sampling occasion. 

Sampling variation can be reduced by increasing sampling ef<strong>for</strong>t (e.g. more measurements per occasion). 

Process variation refers to the variation around the perfect linear regression even if there was no uncertainty 

in each individual observation. Process variation cannot be reduced by increasing sampling ef<strong>for</strong>t. At the 

moment, Program Monitor assumes that process variation is 0, i.e. if you knew each data point exactly, they 

would all fit exactly on the linear trend. There are a number of web pages that discuss this issue in more 

detail - do a simple search using a search engine. 

2.4.2 Getting the necessary in<strong>for</strong>mation 

As noted earlier, the in<strong>for</strong>mation required to do a power analysis is similar to that <strong>for</strong> ANOVA. We will 

concentrate on relevant quantities <strong>for</strong> a trend analysis over time rather than a general regression situation. I 

will use population size as my response variable, but any other ecological quantity could be used. 

α level. As in power analyses <strong>for</strong> ANOVA, this is traditionally set to α = 0.05. 

Effect size. In trend analysis, this is traditionally specified as the rate of change per unit time and denoted 

by r. For example, a value of r = .02 = 2% corresponds to an (increasing) change of 2% per year. Both 

TRENDS and MONITOR allow <strong>for</strong> both linear and exponential trends. In linear trends, the population size 

changes by the same fixed percentage of the initial population each year. So if the initial population was 

1000 animals, a 2% decline per year would correspond to a fixed change of .02 × 1000 = 20 animals per 

year, i.e. 1000, 980, 960, 940, 920, 900, . . .. 

In exponential trends, the change is multiplicative each year. So if the initial population was 1000 

animals, a 2% (multiplicative) decline corresponds to 1000×.98 = 980 animals in the next year, 980×.98 = 

1000 × .98 2 = 960.4 in the next year, followed by 941.2, 922.4, 904, 885 etc in subsequent years. 

If the rate is small, then both an exponential and linear trend will be very similar <strong>for</strong> <strong>short</strong> time trends - 

they can be quite different if the rate is large and/or the time series is very long. 

Individuals monitoring populations often think of long-term trends in populations, such as, how many 

plots do I need to monitor to detect a 10% reduction in this population over a 10 year period? This overall 

change must be converted to a rate per unit time. The MONITOR home page has a trend converter, but the 

computations are relatively simple. 

For linear trends, the rate is found as: 

r = 

R 

n − 1 

where R is the overall fractional change in abundance over the n years. For example, a 10% reduction over 

10 years has R = −.1 and n = 10 leading to: 

or just over a 1% reduction per year. 

r = −.1 

10 − 1 = −.011 



For exponential trends, the rate is found as: 

f = (R + 1) 1/(n−1) − 1 

where R is the overall fractional change in abundance over the n years. For example, a 10% reduction over 

10 years has R = −.1 and n = 10 leading to: 

r = (.9) 1/9 − 1 = −.0116 

or just over a 1% reduction per year. Again note that <strong>for</strong> small reductions and a small number of years, both 

a linear and exponential trend have similar rates. 

Sample size. For many monitoring designs, observations are taken on a yearly basis, so the question 

reduces to the number of years of monitoring required. TRENDS required fixed sampling intervals while 

MONITOR has allows <strong>for</strong> some flexibility in the timing of the monitoring. 

Standard deviation. As in ANOVA, the power will depend upon the variation of the individual objects 


In many cases, the standard deviation is not directly available given, but rather the variability of the 

estimates of the individual observations is reported as the relative standard error (cv = stddev 

mean 

). TRENDS 

uses the cv while MONITOR uses the actual standard deviation. 

Gibbs (2000) 16 summarizes typical cv ′ s <strong>for</strong> measuring a number of types of populations: 

16 Gibbs, J. P. (2000). Monitoring Populations. Pages 213-252 in Research Techniques in Animal <strong>Ecology</strong>, Boitani, L. and Fuller, T. 

K.eds, <strong>Columbia</strong> University Press 



Group 

cv 

Large mammals 15% 

Grasses and sedges 20% 

Herbs, compositae 20% 

Herbs, non-compositae 20% 

Turtles 35% 

Salamanders 35% 

Large bodied birds 35% 

Lizards 40% 

Fishes, salmonids 50% 

Caddis flies 50% 

Snakes 55% 

Dragonflies 55% 

Small bodied birds 55% 

Beetles 60% 

Small mammals 60% 

Spiders 65% 

Medium sized mammals 65% 

Fishes, non-salmonids 70% 

Salamander (aquatic) 85% 

Moths 90% 

Frogs and toads 95% 

Bats 95% 

Butterflies 110% 

Flies 130% 

If necessary, these can be converted to standard deviation if the initial density is approximately known 

by multiplying the cv by the initial density. For example, if the initial density is 25 mice/hectare, then 

the approximate standard deviation (<strong>for</strong> small mammals) would be found as 25 mice/hectare × 60% or 15 

mice/hectare. 

Finally, even if all else is equal, the variation often changes with the change in abundance over time. 

Gerrodette (1987) examines three cases: 

• the cv is constant over time. 

• the cv is proportional to √ abundance 

• the cv is proportional to 1/ √ abundance. 



Many sampling methods give cvs that are proportional to 1/ √ abundance. The TRENDS program allows 

you to select an appropriate relationship. Again, <strong>for</strong> small time scales, there isn’t much of a difference in 

results among the different relationships of cv and abundance. 

The cv may be improved if multiple, independent samples are taken each year. If m multiple samples 

are taken each year, then the corresponding cv value is: 

cv average = cv individual 

√ m 

. 

Both programs do this computation automatically if you specify that the ef<strong>for</strong>t is increased at each sampling 

occasion. Note that you are implicitly assuming that process variation is 0 in these cases, i.e. if perfect 

in<strong>for</strong>mation were known, the abundance would lie exactly on the trend line. This may not be a suitable 

assumption and some care is needed if a large amount of sampling is to be done in each year to try and get 

the cv of the estimates down to a reasonable level - the payoff may not be as great as expected. 

2.4.3 How does power vary as in<strong>for</strong>mation changes? 

A nice discussion of some of the issues in sample size <strong>for</strong> trend analysis is found at http://www.pwrc. 

usgs.gov/monmanual/samplesize.htm which is reproduced here <strong>for</strong> convenience: 


Managers' Monitoring Manual | Setting Sample Size 

10/06/2005 06:52 PM 

Patuxent Wildlife 

Research Center 

Managers' Monitoring Manual 

Home > HOW to monitor? > Setting sample size 

Figuring out how many samples you need 

The number of samples you need are affected by the following factors: 

Project goals 

How you plan to analyze your data 

How variable your data are or are likely to be 

How precisely you want to measure change or trend 

The number of years over which you want to detect a trend 

How many times a year you will sample each point 

How much money and manpower you have 

Here are some graphs that illustrate some of these trade-offs. These graphs were made using the assumption that you would be analyzing your 

data using simple linear regression. Each graph isolates one factor and looks how altering that factor affects sample. Those factors are explained 

in greater detail below 

135 

http://www.pwrc.usgs.gov/monmanual/samplesize.htm 

Page 1 of 8


10/06/2005 06:52 PM 

136 


Page 2 of 8


10/06/2005 06:52 PM 

137 


Page 3 of 8


10/06/2005 06:52 PM 

138 


Page 4 of 8


10/06/2005 06:52 PM 

In general, you can lower your sample size requirements by adopting the following approaches: 

Aim to detect only long-term changes 

Set your analytical tests to P


10/06/2005 06:52 PM 

Must be located randomly or uni<strong>for</strong>mly throughout the study area 

Must detect a constant proportion of the individuals (or estimate the differences) 

Be precise enough to detect the types of changes you want to detect 

Issues of bias, sample placement, and choosing your counting technique have been discussed elsewhere in this web site. Here we will help you 

determine whether your monitoring program has a sufficient number of sampling locations (sample size) to detect the types of changes you have 

set <strong>for</strong>th as your goal. 

So, what is a sufficient sample size? 

To answer that you need to address three things: 

1. What is the inherent variability of your counts? 

2. What magnitude of trend do you want to detect and how precisely you would like to measure trend? 

3. How you are going to statistically test <strong>for</strong> population change? 

Count Variation 

Count variation is simply a measure of how your counts fluctuate from year to year. Variation affects your ability to detect trends as, obviously, if 

the data fluctuate greatly you will not have resolution to find an increasing or decreasing trajectory in the population you are monitoring. 

Basic rule of thumb: The more variable your counts, the more samples you will need to detect a change or trend of a given magnitude. 

Conversely, <strong>for</strong> any given sample size, the more your counts vary the lower your ability to to detect trends. 

Sample size calculations need an estimate of count variation to estimate sample sizes. You can get such an estimate from your own pilot data 

(the mean and standard deviation) or from estimates taken from other, similar situations. We provide some of those estimates <strong>for</strong> amphibians and 

birds (point counts and territory mapping). Note that these are calculations of fluctuation over time not over space, meaning that you calcuate a 

mean and standard deviation of the counts across several years at one point and not a mean and standard deviation among several points. 

Be aware, when using counts from other studies, that count variances are specific to the counting technique and how the original study pooled 

their samples. Additionally, as you can see from these collections of count variances that even when using the same counting technique, on the 

same species, in the same region the degree of variability in the resulting counts is usually differs (often greatly) from study to study or even from 

site to site. The good news is that reviews of long-term studies have shown that, at any individual site, the variability in counts remains about the 

same. This is another strong, strong reason that you need to review your monitoring program after 5 years, to see if you have been adequately 

sampling your populations. 

Basic rule of thumb: Use the estimates of variability of counts from other studies as a general guide to what you might expect in yours, but it is 

wise to check the variability of your own counts after your program has been established <strong>for</strong> 5 years to see if you sampling strategy needs to be 

revised. 

Helpful hints on how to decrease variability 

Philosophical Considerations 

Trend. Trend can be defined as change over time. More apropros to a monitoring program, would be to define trend as some specific rate of 

change over a specific period of time. Most calculations <strong>for</strong> determining sample size require that you specify a minimum rate of change you would 

like to detect and a minimum time period over which you would detect those changes. Those minimums now become the targets our monitoring 

program will aim to achieve or beat. In other words, by appropriately setting our sample sizes we hope to be able to detect a trend as small or 

smaller than those minimums which we have targeted. 

Basic rule of thumb: The smaller the population change you would like to detect, the greater the number of samples you will need to detect it. 

Another basic rule of thumb: The fewer the number of years over which you would like to detect a trend, the greater the number of samples 

you will need. 

Grand rule of thumb: Any monitoring program whose goal it is to detect small population changes over just a few years will be expensive to 

create. 

Precision. Calculators of sample size also need to know how precisely you want to measure these changes. An imprecisely measured trend is a 

very unsatisfactory trend in that you are unsure of how well it really reflects the REAL changes in the animal populations on your land. On the 

other hand, a very precisely measured trend can be very costly to obtain because you will have to spend a great deal of your budget to achieve 

that level of precision. So, think of your precision goal as your willingness to risk being wrong about the population change you are trying to 

measure. You need to determine the amount of risk you are willing to take in your monitoring program and understand the consequences of that 

decision, both as a cost to your budget and in the probability of being wrong. 

140 

Basic rule of thumb: The lower the precision - the lower the number of samples you will need. Conversely, the higher the precision the larger 

the number of samples needed. 

You control precision and risk using two statistical settings: alpha and power. Because most basic statistical books and quite a number of web 

sites cover these parameters well and are very accessible, we will not cover them here, but we do want to highlight a few considerations relevant 

to the estimation of sample sizes <strong>for</strong> monitoring programs. 


Page 6 of 8


10/06/2005 06:52 PM 

Alpha Level. Because animal populations and their counts will vary <strong>for</strong> a number of reasons, the data from your monitoring program 

just cannot be expected to produce nice straight lines when you finally plot them out. Because of this imprecision we must specify 

some level of uncertainty in our measures of change that we are willing to tolerate. This level represents our willingness to risk being 

wrong, <strong>for</strong> example, to claim that a trend exists when it does not. Traditionally, this is known as setting the alpha level. 

Setting your alpha level is a balance between not wanting to 'cry wolf' (saying a trend exists when it really doesn't) and missing an 

important trend by being too conservative. If you are using your monitoring program as an early warning of negative population 

change, then you may want to increase your alpha level above the traditional level of 0.05. To do so may mean that you 'cry wolf' 

more of the time, but because the goal of most monitoring programs is to alert managers to potential problems then a higher alpha is 

justified in light of the possibility of missing a problem while waiting <strong>for</strong> it to become "statistically significant" at a lower alpha level. 

Basic rule of thumb: The less willing you are to be caught 'crying wolf' (or the smaller you set the alpha level) the more samples 

you will need to detect a given level of population trend. 

Power. Power can be defined as your ability to (or the odds of) detecting a trend given that there really is a change going on in your 

animal populations. In general, a power of 90-95% is reasonable <strong>for</strong> most monitoring programs. 

Statistical Testing 

You now know that count variability affects the number of samples you need as does your requirements <strong>for</strong> what magnitude of the change you 

want your program to detect. The last issue that needs to be resolved is what statistical model will you use to test your data. 

Basic rule of thumb. The specific <strong>for</strong>mula (or simulations) used to calculate sample size is unique to the statistical test or model that you will 

use. 

Now ... some practical guidance on how to calculate the sample sizes <strong>for</strong> your monitoring program. 

Note: Throughout this document we often use the terms variance, variation, and variability as a <strong>short</strong> hand expression <strong>for</strong> the variability of counts. 

However, understand that the actual mathematical calculation of variability could be any one of several measures (standard deviation, standard 

error, variance, or coefficient of variation), each of which has a specific statistical meaning. 

Basic rules of thumb. You must have determined the following to set samples sizes: 

A mean and standard deviation (i.e., the coefficient of variation or the variation of your counts) 

The smallest number of years over which you would like to detect a change 

The smallest percentage change you would like to detect over those years 

An alpha level (how often you will cry wolf) 

A power level (the proportion of the time you would like to detect a trend if one were occurring) 

A statistical test (your analytical model) 

Calculating the mean and standard deviation requires some additional explanation. While the other factors that affect sample sizes are set based 

on your desired need <strong>for</strong> precision and the smallest degree of change you want to detect, the mean and variance are factors that are set by the 

animals being sampled. If you have several years of pilot data you will want to calculate the mean and standard deviation from your own data. If 

you don’t, then you can use someone else’s data from as similar a situation as you can find to estimate means and standard deviations. In a 

pinch, you can estimate some of the variation to be expected in a set of yearly counts by calculating a mean and standard deviation from one 

year’s data if you have several replicates OF THE SAME points or plots. However this approach fails to account <strong>for</strong> any between-year variation in 

the animal populations. Finally, you can use data published in the literature or from one of our databases on count variation (e.g. amphibians, bird 

point counts, bird territory mapping). 

Pilot data is far and away the most preferable source of in<strong>for</strong>mation <strong>for</strong> determining count variation. We have found that the variation in the 

counts of animals is very consistent within a site (figure). However, there are often wide differences in the variation of counts among sites, even 

those close by using the same technique. 

Ways to avoid problems when you calculate variances: 

1. Use data collected over time, not over space. 

2. Use data that match the counting technique and sampling units you plan to use (e.g, don’t use the variances that come from counts from 

a 50-stop point count system when you are planning to use only a 20-stop system). 

3. Use means that come from the same data you used to calculate the variance. 

Basic rule of thumb: If you have no access to pilot data or are aware of examples from the literature that you trust, a conservative estimate of 

the amount of variation that you could use in sample size calculations would be to use a CV of 100% with a moderately conservative alternative of 

75%. 

141 

Choosing an analytical technique also includes some further explanation. The specific calculation of sample sizes is different <strong>for</strong> every statistical 

test. In complicated analyses <strong>for</strong>mulas often don’t exist and simulation must be used to calculate sample size. Below are listed a some text and 

web resources <strong>for</strong> setting sample sizes <strong>for</strong> various simple statistical models. For complicated situations you can either run the simulations your 

self, have a statistician do them <strong>for</strong> you, or use a conservative model to estimate sample sizes from. 

James Gibbs has created a software program that estimates sample sizes <strong>for</strong> those who will use linear or exponential regression to anlayze their 


Page 7 of 8


10/06/2005 06:52 PM 

data. As this type of regression is the most basic it is also likely to be the most conservative. Currently his software only runs on Windows XP or 

2000. It is available by contacting Sam Droege at the address below. A new version is expected out soon that will run on more plat<strong>for</strong>ms. 

Desperation rule of thumb: If, <strong>for</strong> whatever reason, you cannot calculate a reasonable estimate <strong>for</strong> the number of samples to take, then put in 

60 plots/points, under many circumstances that may be sufficient. Obviously the more the better, be sure to review your data after 3 years to reevaluate 

this weak choice. 

Texts on estimating sample size 

Web sites and online calculators <strong>for</strong> the calculation of sample sizes 

home | START HERE | worksheets | counting techniques | CV tools | site guide 

U.S. Department of the Interior 

U.S. Geological Survey 

Patuxent Wildlife Research Center 

Laurel, MD USA 20708-4038 

http://www.pwrc.usgs.gov/monmanual 

Contact Sam Droege, email sam_droege@usgs.gov 

USGS Privacy Statement 

142 


Page 8 of 8


Gerrodette (1987) also looks at the effect of various factors upon the number of years of monitoring 

required. 

For example, Figure 1 of his report: 

shows the reliance of power upon the rate and type of cv relationship when the initial cv was 20% and 

α = .05. Note that <strong>for</strong> n = 5, you have very little power to detect anything but huge changes (large 

values of r). For example, even with r = .2 corresponding to a 20% change/year in abundance, power 

barely exceeds 50% even after 5 years. Power is highest (and hence trend is easier to detect) when the cv is 

proportional to 1/ √ abundance (but this is reversed <strong>for</strong> declining trends). 

Figure 2 of his report: 



shows the relationship of power to type of trend (linear or exponential) and if the trend is increasing or 

decreasing. Regardless if the type of trend is linear or exponential, decreasing trends are easier to detect than 

increasing trends. Furthermore, it is easier to detect a declining trend that has a constant absolute decline 

than a relative decline and hardest to detect an increasing trend that changes by an absolute amount each 

year as well. This is related to the “compounding” effect in exponential changes. 

Finally, Figure 3 of his report: 



shows the effect of different amount of variation upon trend detection. As expected, trend is easier to detect 

with lower amount of variation (smaller cvs). 

2.4.4 Finally - how many year do I need to monitor? 

Gerrodette (1987) gives a quick-and-dirty approximation that will help guide sample size determination. For 

α = .05 and power = 80%, the following is an approximate rule: 

r 2 n 3 ≥ 94(cv) 2 

For example, to detect a 5% decline/year in a population whose cv is 20% and constant over time would 

require: 

(−.05) 2 n 3 ≥ 94(.2) 2 

or n ≥ 11 years of monitoring. 17 

Suppose we wish to investigate the power of a monitoring design that will run <strong>for</strong> 5 years. At each survey 

occasion (i.e. every year), we have 1 monitoring station, and we make 2 estimates of the population at the 

17 If you try this actual power computation using TRENDS, you find that actually 9 years may be sufficient. This <strong>for</strong>mula is ONLY 

an approximation! 



monitoring station in each year. The population is expected to start with 1000 animals, and we expect that 

the measurement error in each estimate is about 200, i.e. the coefficient of variation of each measurement is 

about 20% and is constant over time. We are interested in detecting increasing or decreasing trends and to 

start, a 5% decline per year will be of interest. 

The input/output <strong>for</strong> TRENDS is shown below: 

Most of the fields are self-explanatory. The ef<strong>for</strong>t multiplier, i.e. 2 surveys/year is located at the bottom 

right of the screen. We find that a five year only has a 14% chance of detecting a 5% decline per year - hardly 

worthwhile doing the study! 

The input <strong>for</strong> the MONITOR Program follows: 


147


. 

Most of the fields are self explanatory, but additional help can be obtained by clicking on the active links 

behind each term. 

The output from this proposed program follows: 


Program MONITOR 

Tue May 4 00:53:34 2004 p=2705 

This is an example of a power analysis to detect a trend 

SIMULATION OVERVIEW 

Number of plots monitored : 1 

Plot Counts : 

1000.000 

Plot Standard Deviations : 

200.000 

Plot weights : 

1.000 

Number counts/plot/survey occasion : 2 

CV in trends : 0.000 

Total Surveys : 5 

Survey occasions: 

0.000 

1.000 

2.000 

3.000 

4.000 

Trend Type = Linear 

Counts analyzed as decimals 

Projection set = Complete 

Significance Level : 0.050 

Significance Test : 2-tailed t-test 

Iterations : 500 

Power to Detect Population Trends: 

10% Increase = 0.68200 

9% Increase = 0.58800 

8% Increase = 0.45000 

7% Increase = 0.36200 

6% Increase = 0.26400 

5% Increase = 0.17600 

4% Increase = 0.13800 

3% Increase = 0.11800 

2% Increase = 0.05400 

1% Increase = 0.05600 

0% Increase = 0.04200 

149

10% Decrease = 0.35600 

9% Decrease = 0.30600 

8% Decrease = 0.24800 

7% Decrease = 0.21200 

6% Decrease = 0.19600 

5% Decrease = 0.15600 

4% Decrease = 0.09600 

3% Decrease = 0.08800 

2% Decrease = 0.06600 

1% Decrease = 0.05000 

0% Decrease = 0.04200 

END OF OUTPUT FILE 

Back to MONITOR input page 

150


. 

This design is estimated to have an estimated power of 16% to detect a 5% decrease PER YEAR. 

The difference in reported powers are artifacts of the different ways the two programs compute power. 

TRENDS uses analytical <strong>for</strong>mulae based on normal approximations, while MONITOR conducts a simulation 

study and reports the number of trials (in this case out of 500) that detected the trend. In any event, 

don’t get hung up over these differences - the key point is that this proposed study has virtually no power to 

detect a 5% decline/year. 

Program MONITOR also reports power <strong>for</strong> a range of trends; Program TRENDS reports power <strong>for</strong> a 

single TREND at a time, but you can quickly vary the sliding window to investigate different design options. 

2.4.5 Summary of plans 

Here is a summary of some power computations to detect an average decrease over time. In all cases, the cv 

was assumed to be proportional to 1/ √ abundance. 

The results are sobering. For many animal species, many years of concentrated ef<strong>for</strong>t will be needed to 

detect small effects with decent power. 


Approximate power to detect (decreasing) linear trend when monitoring x years and n obs/year 

CV of initial obs (%) 10 

N years 

monitored 

Obs/year 

Average % decrease/year 

0 2 4 6 8 10 

Power Power Power Power Power Power 

2 1 . . . . . . 

3 5 5 7 9 11 15 

5 5 6 9 13 19 26 

3 1 5 5 6 7 7 9 

3 5 7 13 22 35 48 

5 5 9 20 38 58 75 

4 1 5 6 8 12 16 21 

3 5 10 26 48 70 85 

5 5 15 43 74 92 98 

5 1 5 7 13 22 32 43 

3 5 16 47 77 93 99 

5 5 26 71 95 100 100 

6 1 5 10 22 38 55 69 

3 5 25 69 94 99 100 

5 5 40 90 100 100 100 

7 1 5 13 33 57 76 88 

3 5 37 87 99 100 100 

5 5 58 98 100 100 100 

8 1 5 18 47 75 90 97 

3 5 52 96 100 100 100 

5 5 75 100 100 100 100 

9 1 5 24 62 88 97 99 

3 5 66 99 100 100 100 

5 5 88 100 100 100 100 

10 1 5 31 76 95 99 100 

3 5 79 100 100 100 100 

5 5 95 100 100 100 100 

152 

Refer to http://www.mbr-pwrc.usgs.gov/cgi-bin/monitor.pl <strong>for</strong> a web-based interface



N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 6 7 7 

5 5 5 6 7 8 10 

3 1 5 5 5 5 6 6 

3 5 6 7 9 12 16 

5 5 6 9 13 19 26 

4 1 5 5 6 7 8 9 

3 5 6 10 16 24 33 

5 5 7 14 25 39 53 

5 1 5 6 7 9 12 15 

3 5 8 16 27 41 55 

5 5 10 24 44 64 79 

6 1 5 6 9 13 19 24 

3 5 10 23 42 61 76 

5 5 14 37 65 84 94 

7 1 5 7 12 19 28 36 

3 5 13 34 59 78 90 

5 5 19 53 82 95 99 

8 1 5 8 16 27 38 49 

3 5 17 46 74 90 97 

5 5 26 69 93 99 100 

9 1 5 10 21 36 50 62 

3 5 22 60 86 97 99 

5 5 35 82 98 100 100 

10 1 5 11 27 45 62 74 

3 5 28 72 94 99 100 

5 5 44 92 100 100 100 

153 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 6 6 

5 5 5 5 6 6 7 

3 1 5 5 5 5 5 5 

3 5 5 6 7 8 10 

5 5 5 7 9 11 14 

4 1 5 5 5 6 6 7 

3 5 6 7 10 13 17 

5 5 6 9 14 20 27 

5 1 5 5 6 7 8 10 

3 5 6 10 15 21 28 

5 5 7 13 23 34 46 

6 1 5 6 7 9 11 14 

3 5 7 13 22 32 43 

5 5 9 19 34 51 65 

7 1 5 6 8 11 15 19 

3 5 8 18 31 45 58 

5 5 11 27 49 68 81 

8 1 5 6 10 15 20 25 

3 5 10 24 42 59 72 

5 5 14 37 63 82 92 

9 1 5 7 12 19 26 33 

3 5 12 31 53 71 83 

5 5 18 49 76 91 97 

10 1 5 8 15 23 33 41 

3 5 15 40 65 82 91 

5 5 23 60 86 96 99 

154 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 6 

5 5 5 5 5 6 6 

3 1 5 5 5 5 5 5 

3 5 5 5 6 7 8 

5 5 5 6 7 8 10 

4 1 5 5 5 5 6 6 

3 5 5 6 8 10 12 

5 5 6 7 10 13 17 

5 1 5 5 6 6 7 8 

3 5 6 8 10 14 18 

5 5 6 10 15 21 28 

6 1 5 5 6 7 8 10 

3 5 6 9 14 20 26 

5 5 7 13 21 32 42 

7 1 5 5 7 8 11 13 

3 5 7 12 19 28 37 

5 5 8 18 30 44 57 

8 1 5 6 8 10 13 16 

3 5 8 15 26 37 48 

5 5 10 23 41 57 71 

9 1 5 6 9 13 17 21 

3 5 9 20 33 47 59 

5 5 12 30 52 70 82 

10 1 5 7 10 15 21 26 

3 5 11 25 42 57 70 

5 5 15 39 63 80 90 

155 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 5 

5 5 5 5 5 6 6 

3 1 5 5 5 5 5 5 

3 5 5 5 6 6 7 

5 5 5 6 6 7 8 

4 1 5 5 5 5 5 6 

3 5 5 6 7 8 9 

5 5 5 6 8 10 13 

5 1 5 5 5 6 6 7 

3 5 5 7 8 11 13 

5 5 6 8 11 15 20 

6 1 5 5 6 6 7 8 

3 5 6 8 11 15 19 

5 5 6 10 15 22 29 

7 1 5 5 6 7 9 10 

3 5 6 9 14 20 25 

5 5 7 13 21 31 40 

8 1 5 5 7 8 10 12 

3 5 7 12 18 26 33 

5 5 8 17 28 40 52 

9 1 5 6 7 10 12 15 

3 5 8 14 23 33 42 

5 5 10 21 36 51 63 

10 1 5 6 8 11 15 18 

3 5 9 17 29 40 51 

5 5 11 27 45 61 74 

156 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 5 

5 5 5 5 5 5 6 

3 1 5 5 5 5 5 5 

3 5 5 5 5 6 6 

5 5 5 5 6 7 7 

4 1 5 5 5 5 5 5 

3 5 5 6 6 7 8 

5 5 5 6 7 9 10 

5 1 5 5 5 5 6 6 

3 5 5 6 7 9 11 

5 5 6 7 9 12 15 

6 1 5 5 5 6 6 7 

3 5 6 7 9 12 14 

5 5 6 8 12 17 22 

7 1 5 5 6 7 7 8 

3 5 6 8 11 15 19 

5 5 7 10 16 23 30 

8 1 5 5 6 7 9 10 

3 5 6 10 14 19 25 

5 5 7 13 21 30 39 

9 1 5 5 7 8 10 12 

3 5 7 11 18 24 31 

5 5 8 16 27 38 48 

10 1 5 6 7 9 12 14 

3 5 7 14 22 30 38 

5 5 9 20 33 47 58 

157 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 5 

5 5 5 5 5 5 5 

3 1 5 5 5 5 5 5 

3 5 5 5 5 6 6 

5 5 5 5 6 6 7 

4 1 5 5 5 5 5 5 

3 5 5 5 6 6 7 

5 5 5 6 7 8 9 

5 1 5 5 5 5 6 6 

3 5 5 6 7 8 9 

5 5 5 6 8 10 12 

6 1 5 5 5 6 6 7 

3 5 5 6 8 10 12 

5 5 6 8 10 14 17 

7 1 5 5 6 6 7 8 

3 5 6 7 10 12 15 

5 5 6 9 13 18 23 

8 1 5 5 6 7 8 9 

3 5 6 8 12 15 19 

5 5 7 11 17 23 30 

9 1 5 5 6 7 9 10 

3 5 6 10 14 19 24 

5 5 7 13 21 29 38 

10 1 5 6 7 8 10 12 

3 5 7 11 17 23 29 

5 5 8 16 26 36 46 

158 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 5 

5 5 5 5 5 5 5 

3 1 5 5 5 5 5 5 

3 5 5 5 5 5 6 

5 5 5 5 5 6 6 

4 1 5 5 5 5 5 5 

3 5 5 5 6 6 7 

5 5 5 6 6 7 8 

5 1 5 5 5 5 5 6 

3 5 5 6 6 7 8 

5 5 5 6 7 9 11 

6 1 5 5 5 6 6 6 

3 5 5 6 7 9 10 

5 5 6 7 9 11 14 

7 1 5 5 5 6 6 7 

3 5 5 7 8 11 13 

5 5 6 8 11 15 19 

8 1 5 5 6 6 7 8 

3 5 6 8 10 13 16 

5 5 6 9 14 19 24 

9 1 5 5 6 7 8 9 

3 5 6 9 12 16 20 

5 5 7 11 17 24 30 

10 1 5 5 6 8 9 10 

3 5 6 10 14 19 24 

5 5 7 13 21 29 37 

159 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 5 

5 5 5 5 5 5 5 

3 1 5 5 5 5 5 5 

3 5 5 5 5 5 6 

5 5 5 5 5 6 6 

4 1 5 5 5 5 5 5 

3 5 5 5 6 6 6 

5 5 5 5 6 7 7 

5 1 5 5 5 5 5 6 

3 5 5 6 6 7 7 

5 5 5 6 7 8 9 

6 1 5 5 5 5 6 6 

3 5 5 6 7 8 9 

5 5 5 7 8 10 12 

7 1 5 5 5 6 6 7 

3 5 5 6 8 9 11 

5 5 6 7 10 13 16 

8 1 5 5 6 6 7 7 

3 5 6 7 9 11 14 

5 5 6 8 12 16 20 

9 1 5 5 6 6 7 8 

3 5 6 8 10 13 16 

5 5 6 10 15 20 25 

10 1 5 5 6 7 8 9 

3 5 6 9 12 16 20 

5 5 7 11 18 24 30 

160 




N years 


Obs/year 


0 2 4 6 8 10 


2 1 . . . . . . 

3 5 5 5 5 5 5 

5 5 5 5 5 5 5 

3 1 5 5 5 5 5 5 

3 5 5 5 5 5 5 

5 5 5 5 5 6 6 

4 1 5 5 5 5 5 5 

3 5 5 5 5 6 6 

5 5 5 5 6 6 7 

5 1 5 5 5 5 5 5 

3 5 5 5 6 6 7 

5 5 5 6 7 7 9 

6 1 5 5 5 5 6 6 

3 5 5 6 6 7 8 

5 5 5 6 8 9 11 

7 1 5 5 5 6 6 6 

3 5 5 6 7 9 10 

5 5 6 7 9 11 14 

8 1 5 5 5 6 6 7 

3 5 5 7 8 10 12 

5 5 6 8 11 14 17 

9 1 5 5 6 6 7 7 

3 5 6 7 9 12 14 

5 5 6 9 13 17 21 

10 1 5 5 6 7 7 8 

3 5 6 8 11 14 17 

5 5 7 10 15 20 25 

161 



. 

2.5 Testing <strong>for</strong> common trend - ANCOVA 

In some cases, it is of interest to test if the same trend is occurring in a number of locations. Or, the data 

from a single site is so poor that trends cannot be detected, but by pooling the sites, a common trend over 

sites can be detected because of the increased sample size. This technique can also be used <strong>for</strong> adjusting <strong>for</strong> 

seasonality as will be seen later. 

The Analysis of Covariance (ANCOVA) does both. Groups of data (e.g. from the same location) are 

identified by a nominal or ordinal scale variable and time is also measured <strong>for</strong> both groups. 

Typically, ANCOVA is used to check if the regression line <strong>for</strong> the groups are parallel. If there is evidence 

that the individual regression lines are not parallel, then a separate regression line (trend line) must be fit <strong>for</strong> 

each group <strong>for</strong> prediction purposes. If there is no evidence of non-parallelism, then the next task is to see if 

the lines are co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the 

lines are not coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate 

the common slope. If there is no evidence that the lines are not coincident, then all of the data can be simply 

pooled together and a single regression line fit <strong>for</strong> all of the data. 

The three possibilities are shown below <strong>for</strong> the case of two groups - the extension to many groups is 

obvious: 








As be<strong>for</strong>e, it is important be<strong>for</strong>e the analysis is started to verify the assumptions underlying the analysis. As 

ANCOVA is a combination of ANOVA and Regression, the assumptions are similar. 

• The response variable Y is continuous (interval or ratio scaled). 

• The Y are a random sample from the various time points measured. 

• There must be no outliers. Plot Y vs. X <strong>for</strong> each group separately to see if there are any points that 

don’t appear to follow the straight line. 

• The relationship between Y and X must be linear <strong>for</strong> each group. 18 Check this assumption by looking 

18 It is possible to relax this assumption as well, but is again, beyond the scope of this <strong>course</strong>. 



at the individual plots of Y vs. X <strong>for</strong> each group. 

• The variance must be equal <strong>for</strong> both groups around their respective regression lines. Check that the 

spread of the points is equal around the range of X and that the spread is comparable between the two 

groups. This can be <strong>for</strong>mally checked by looking at the MSE from a separate regression line <strong>for</strong> each 

group as MSE estimates the variance of the data around the regression line. 

• The residuals must be normally distributed around the regression line <strong>for</strong> each group. This assumption 

can be check by examining the residual plots from the fitted model <strong>for</strong> evidence of non-normality. For 

large samples, this is not too crucial; <strong>for</strong> small sample sizes, you will likely have inadequate power to 

detect anything but gross departures. 

2.5.2 Statistical model 

You saw in earlier chapters, that a statistical model is a powerful <strong>short</strong>hand to describe what analysis is fit 

to a set of data. The model must describe the treatment structure, the experimental unit structure, and the 

randomization structure. Let Y be the response variable; X be the continuous predictor variable, and Group 

be the group factor. 

As ANCOVA is a combination of ANOVA and regression, it will not be surprising that the models will 

have terms corresponding to both Group and X. Again, there are three cases: 

If the lines <strong>for</strong> each group are not parallel: 



the appropriate model is 

Y 1 = Group X Group ∗ X 

The terms can be in any order. This is read as variation in Y can be explained a common intercept (never 

specified) followed by group effects (different intercepts), a common slope (trend) on X, and an “interaction” 

between Group and X which is interpreted as different slopes (different trends) <strong>for</strong> each group. This model 

is almost equivalent to fitting a separate regression line <strong>for</strong> each group. The only advantage to using this 

joint model <strong>for</strong> all groups is similar to that enjoyed by using ANOVA - all of the groups contribute to a better 

estimate of residual error. If the number of data points per group is small, this can lead to improvements in 

precision compared to fitting each group individually and an improved power to detect trends. 

If the lines are parallel across groups, but not coincident: 




Y 2 = Group X 

The terms can be in any order. The only difference between this and the previous model is that this simpler 

model lacks the Group*X “interaction” term. It would not be surprising then that a statistical test to see if 

this simpler model is tenable would correspond to examining the p-value of the test on the Group*X term 

from the complex model. This is exactly analogous to testing <strong>for</strong> interaction effects between factors in a 

two-factor ANOVA. 

Lastly, if the lines are co-incident: 




Y 3 = X 

Now the difference between this model and the previous model is the Group term that has been dropped. 

Again, it would not be surprising that this corresponds to the test of the Group effect in the <strong>for</strong>mal statistical 

test. The test <strong>for</strong> co-incident lines should only be done if there is insufficient evidence against parallelism. 

While it is possible to test <strong>for</strong> a non-zero slope, this is rarely done. 

2.5.3 Example: Degradation of dioxin - pooling locations 









Each year, four crabs are captured from two monitoring stations which are situated quite a distance apart 

on the same inlet where the pulp mill was located. The liver is excised and the livers from all four crabs 

are composited together into a single sample. 19 The dioxins levels in this composite sample is measured. 

As there are many different <strong>for</strong>ms of dioxins with different toxicities, a summary measure, called the Total 

Equivalent Dose (TEQ) is computed from the sample. 

As seen earlier, the appropriate response variable is log(T EQ). 

Is the rate of decline the same <strong>for</strong> both sites? Did the sites have the same initial concentration? 







Site Year TEQ log(T EQ) 

a 1990 179.05 5.19 

a 1991 82.39 4.41 

a 1992 130.18 4.87 

a 1993 97.06 4.58 

a 1994 49.34 3.90 

a 1995 57.05 4.04 

a 1996 57.41 4.05 

a 1997 29.94 3.40 

a 1998 48.48 3.88 

a 1999 49.67 3.91 

a 2000 34.25 3.53 

a 2001 59.28 4.08 

a 2002 34.92 3.55 

a 2003 28.16 3.34 

b 1990 93.07 4.53 

b 1991 105.23 4.66 

b 1992 188.13 5.24 

b 1993 133.81 4.90 

b 1994 69.17 4.24 

b 1995 150.52 5.01 

b 1996 95.47 4.56 

b 1997 146.80 4.99 

b 1998 85.83 4.45 

b 1999 67.72 4.22 

b 2000 42.44 3.75 

b 2001 53.88 3.99 

b 2002 81.11 4.40 

b 2003 70.88 4.26 

JMP analysis 

The raw data are available in Dioxin2.JMP available from the Sample Program Library available at http: 


The data can be entered into JMP in the usual fashion. Make sure that Site is a nominal scale variable, 

and that Year is a continuous variable. 



In cases with multiple groups, it is often helpful to use a different plotting symbol <strong>for</strong> each group. This 

is easily accomplished in JMP by selecting the rows (say <strong>for</strong> site a) and using the Rows→Markers to set the 

plotting symbol <strong>for</strong> the selected rows: 

The final data sheet has two different plotting symbols <strong>for</strong> the two sites: 





Be<strong>for</strong>e fitting the various models, begin with an exploratory examination of the data looking <strong>for</strong> outliers 

and checking the assumptions. 

Each year’s data is independent of other year’s data as a different set of crabs was selected. Similarly, 

the data from one site are independent from the other site. This is an observational study, so the question 

arises of how exactly were the crabs were selected? In this study, crab pots were placed on the floor of the 

sea to capture the available crabs in the area. 

When ever multiple sets of data are collected over time, there is always the worry about common year 

effects (also known as process error). For example, if the response variable was body mass of small fish, then 

poor growing conditions in a single year could depress the growth of fish in all locations. This would then 

violate the assumption of independence as the residual in one site in a year would be related to the residual 

in another site in the sam year. You tend to see the residuals “paired” with negative residuals from the fitted 

line at one site matched (by year) with negative residuals at the other site. In this case, this is unlikely to have 

occured. Degradation of dioxin is relatively independent of external environmental factors and the variation 

that we see about the two regression lines is related solely to samplng error based on the particular set of 

crabs that that were sampled. It seems unlikely that the residuals are related. 20 

Use the Analyze->Fit Y-by-X plat<strong>for</strong>m and specify the log(T EQ) as the Y variable, and Y ear as the X 

variable: 

20 If you actually try and fit a process error term to this model, you find that the estimated process error is zero. 



Then specify a grouping variable by clicking on the pop-down menu near the Bivariate Fit window-title line: 

and selecting Site as the grouping variable: 



Now select the Fit Line from the same pop-down menu: 



to get separate lines fit <strong>for</strong> each group: 



This relationships <strong>for</strong> each site appear to be linear. The actual estimates are also presented: 





The scatter plot doesn’t show any obvious outliers. The estimated slope <strong>for</strong> the a site is −.107 (se .02) 

while the estimated slope <strong>for</strong> the b site is −.06 (se .02). The 95% confidence intervals (not shown on the 

output but available by right-clicking/ctrl-clicking on the parameter estimates table) overlap considerably so 

the slopes could be the same <strong>for</strong> the two groups. 

The MSE from site a is .10 and the MSE from site b is .12. This corresponds to standard deviations of 

√ 

.10 = .32 and 

√ 

.12 = .35 which are very similar so that assumption of equal standard deviations seems 

reasonable. 

The residual plots (not shown) also look reasonable. 

The assumptions appear to be satisfied, so let us now fit the various models. 

First, fit the model allowing <strong>for</strong> separate lines <strong>for</strong> each group. The Analyze->Fit Model plat<strong>for</strong>m is used: 

The terms can be in any order and correspond to the model described earlier. This gives the following 

output: 



The regression plot is just the same as the plot of the two individual lines seen earlier. What is of interest is 

the Effect test <strong>for</strong> the Site*year interaction. Here the p-value is not very small, so there is no evidence that 

the lines are not parallel. 

We need to refit the model, dropping the interaction term: 



which gives the following regression plot: 



This shows the fitted parallel lines. The effect tests: 

now have a small p-value <strong>for</strong> the Site effect indicating that the lines are not coincident, i.e. they are parallel 

with different intercepts. This would mean that the rate of decay of the dioxin appears to be equal in both 

sites, but the initial concentration appears to be different. 

The estimated (common) slope is found in the Parameter Estimates portion of the output: 



and has a value of −.083 (se .016). Because the analysis was done on the log-scale, this implies that the 

dioxin levels changed by a factor of exp(−.083) = .92 from year to year, i.e. about a 8% decline each year. 

The 95% confidence interval <strong>for</strong> the slope on the log-scale is from (−.12 →−.05) which corresponds to a 

potential factor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12% and 5% decline 

per year. 21 

While it is possible to estimate the difference between the parallel lines from the Parameter Estimates 

table, it is easier to look at the section of the output corresponding to the Site effects. Here the estimated 

LSMeans correspond to the log(T EQ) at the average value of Year - not really of interest. As in previous 

chapters, the difference in means is often of more interest than the raw means themselves. This is found by 

using the pop-down menu and selecting a LSMeans Contrast or Multiple Comparison procedure to give: 

21 The confidence intervals are found by right clicking/ctrl-clicking in the Parameter Estimates table 



The estimated difference between the lines (on the log-scale) is estimated to be 0.46 (se .13). Because the 

analysis was done on the log-scale, this corresponds to a ratio of exp(.46) = 1.58 in dioxin levels between 

the two sites, i.e. site b has 1.58 times the dioxin level as site a. Because the slopes are parallel and declining, 

the dioxin levels are falling in both sites, but the 1.58 times ratio remains consistent. 



Finally, the actual by Predicted plot (not shown here), the leverage plots (not shown here) and the residual 

plot 

don’t show any evidence of a problem in the fit. 

2.5.4 Change in yearly average temperature with regime shifts 

The ANCOVA technique can also be used <strong>for</strong> trends when there are KNOWN regime shifts in the series. 

The case when the timing of the shift is unknown is more difficult and not covered in this <strong>course</strong>. 

For example, consider a time series of annual average temperatures measured at Tuscaloosa, Alabama 

from 1901 to 2001. It is well known that shifts in temperature can occur whenever the instrument or location 

or observer or other characteristics of the station change. 

The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in the Sample Program Library at 


A portion of the raw data is shown below: 



and a time series plot of the data: 

shows a shift in the readings in 1939 (thermometer changed), 1957 (station moved), and possibly in 1987 

(location and thermometer changed). 

It turns out that cases where the number of epochs tends to increase with the number of data points has 

some serious technical issues with the properties of the estimators. See 

Lu, Q. and Lund, R.B. (2007). 



Simple linear regression with multiple level shifts. 

Canadian Journal of Statistics, 35, 447-458 

<strong>for</strong> details. Basically, if the number of parameters tends to increase with sample size, this violates one of 

the assumptions <strong>for</strong> maximum likelihood estimation. This would lead to estimates which may not even be 

consistent! For example, suppose that the recording changed every two years. Then the two data points 

should still be able to estimate the common slope, but this corresponds to the well known problem with 

case-control studies where the number of pairs increases with total sample size. Fortunately, Lu and Lund 

(2007) showed that this violation is not serious. 

The analysis proceeds as in the dioxin example with two sites, except that now the series is broken into 

different epochs corresponding to the sets of years when conditions remained stable at the recording site. In 

this case, this corresponds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986 (inclusive), 

and 1989-2000 (inclusive). Note that the years 1939, 1957, and 1987 are NOT used because the average 

temperature in these two years is an amalgam of two different recording conditions 22 . 

For example, the data file (around the first regime change) may look like: 

Note that year and Avg Temp and both set to have continuous scale; but epoch should have a nominal or 

ordinal scale. 

Model filling proceeds as be<strong>for</strong>e by first the model: 

AvgT emp = Y ear Epoch Y ear ∗ Epoch 

to see if the change in AvgTemp is consistent among Epochs and then fitting the model: 

AvgT emp = Y ear Epoch 

to estimate the common trend (after adjusting <strong>for</strong> shifts among the Epochs). 

The Analyze->Fit Model plat<strong>for</strong>m is used: 

22 If the exact day of the change were known, it is possible to weight the two epochs in these years and include the data points. 



There is no strong evidence that the slopes are different among the epochs (p=.10) despite the plot showing 

a potentially differential slope in the 3 rd epoch: 



The simpler model with common slopes is then fit: 



with fitted (common slope) lines: 



No further model simplification is possible and there is evident that the common slope is different from zero: 

The estimated change in average temperature is: 



i.e. an estimated increase of .033 (SE .006) per year. The 95% confidence interval does not cover 0. 

The residual plots (against predicted and the order in which the data were collected): 



shows no obvious problems. 

Whenever time series data are used, autocorrelation should be investigated. The Durbin-Watson test is 

applied to the residuals: 

with no obvious problem detected. 

The leverage plot (against year) 



also reveals nothing amiss. 

A more sophisticated analysis can be fit using SAS, but isn’t needed. The sample program and output are 

available in the Sample Program Library. 

2.6 Dealing with Autocorrelation 

Short time series (10-50 observations) are common in environmental and ecological studies. It is well known 

that when data are collected over time, that the usual assumption of errors (deviations above and below the 

regression line) being independent may not be true. 

This is a key assumption of regression analysis. What it implies is that if the data point <strong>for</strong> a particular 



year happens to be above the line, it has no influence on if the data point <strong>for</strong> the next year is also above the 

line. In many cases this is not true, because of long-term trends that affect data points <strong>for</strong> several years in 

a row. For example, precipitation patterns often follow several year patterns where a drought year is more 

often followed by another drought year than a return to normal rainfall. If the level of precipitation affects 

the response, you may see an induced autocorrelation (also known as serial correlation). The uncritical 

application of regression to these type of data without accounting <strong>for</strong> the autocorrelation over time is known 

as pseudo-replication over time (Hurlbert, 1984). 

This problem and how to deal with it are well known in economics and related disciplines, but less well 

known in ecology. 

Some articles that discuss the problem and solutions are: 

• Bence, J. R. (1995). Analysis of <strong>short</strong> time series: Correction <strong>for</strong> autocorrelation. <strong>Ecology</strong> 76, 628- 

639. A nice non-technical review of the subject and how to deal with in <strong>for</strong> ecology. 

• Roy A., Falk B. and Fuller W.A. (2004) Testing <strong>for</strong> Trend in the Presence of Autoregressive Error. 

Journal of the American Statistical Association, 99, 1082-1091. This article is VERY technical, but 

the reference list provides a nice summary of relevant papers about this problem. 

In some previous examples, we looked at the Durbin-Watson statistic to examine if there was evidence 

of autocorrelation. What is the Durbin-Watson test? What is autocorrelation? Why is this a problem? How 

do we fit models accounting <strong>for</strong> autocorrelation? 

In order to understand autocorrelation, we need to step back and look at the model <strong>for</strong> regression analysis 

in a little more detail. Recall, that we often used a <strong>short</strong>hand notation to represent a linear regression 

problem: 

Y = T ime 

where Y is the response variable, and T ime is the effect of time. Mathematically, the model is written as: 

Y i = β 0 + β 1 t i + ɛ i 

where β 0 is the intercept, β 1 is the slope, and ɛ i is the deviation of the i th data point from the actual 

underlying line. 

The usual assumption made in regression analysis is that the ɛ i are independent of each other. In autocorrelated 

models, this is not true. Mathematically, the simplest autocorrelation process (known as an AR(1) 

process) has: 

ɛ i+1 = ρɛ i + a i 

where the a i are now independent and ρ is the autocorrelation coefficient. 

In the same way that regular correlation between two variables ranges from -1 to +1, so to does autocorrelation. 

An autocorrelation of 0 would indicate no correlation between successive deviations about the 

regression line as the ɛ i would have no effect on ɛ i+1 ; an autocorrelation close to 1 would indicate very 



high correlation between successive deviations; an autocorrelation close to -1 (very rare in ecological studies) 

would indicate a negative influence, i.e. high deviations in one year are typically followed by high but 

negative deviations in subsequent years. 23 

The following plots are some examples of autocorrelated data about the same underlying linear trend 

with the associated residual plots: 

23 A way that negative autocorrelations can be induced if there is a cost to breeding so that a successful season of breeding is followed 

by a year of not breeding etc 


a s e l i n e 

1 4 0 

r h o = - 0 . 9 5 

R e s i d u a l 

4 0 . 0 0 0 0 

r h o = - 0 . 9 5 

1 3 0 

3 0 . 0 0 0 0 

1 2 0 

1 1 0 

2 0 . 0 0 0 0 

1 0 0 

9 0 

1 0 . 0 0 0 0 

8 0 

7 0 

0 

6 0 

5 0 

- 1 0 . 0 0 0 0 

4 0 

- 2 0 . 0 0 0 0 

3 0 

2 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 

b a s e l i n e 

1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = - 0 . 9 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = - 0 . 9 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = - 0 . 8 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = - 0 . 8 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = - 0 . 6 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = - 0 . 6 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 

t i m e 

198

a s e l i n e 

1 4 0 

r h o = - 0 . 4 0 


4 0 . 0 0 0 0 

r h o = - 0 . 4 0 

1 3 0 

3 0 . 0 0 0 0 

1 2 0 

1 1 0 

2 0 . 0 0 0 0 

1 0 0 

9 0 

1 0 . 0 0 0 0 

8 0 

7 0 

0 

6 0 

5 0 

- 1 0 . 0 0 0 0 

4 0 

- 2 0 . 0 0 0 0 

3 0 

2 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = - 0 . 2 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = - 0 . 2 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = - 0 . 0 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = - 0 . 0 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = 0 . 2 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = 0 . 2 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 

t i m e 

199

a s e l i n e 

1 4 0 

r h o = 0 . 4 0 


4 0 . 0 0 0 0 

r h o = 0 . 4 0 

1 3 0 

3 0 . 0 0 0 0 

1 2 0 

1 1 0 

2 0 . 0 0 0 0 

1 0 0 

9 0 

1 0 . 0 0 0 0 

8 0 

7 0 

0 

6 0 

5 0 

- 1 0 . 0 0 0 0 

4 0 

- 2 0 . 0 0 0 0 

3 0 

2 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = 0 . 6 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = 0 . 6 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = 0 . 8 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = 0 . 8 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 


1 4 0 

1 3 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

r h o = 0 . 9 0 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 


4 0 . 0 0 0 0 

3 0 . 0 0 0 0 

t i m e 

r h o = 0 . 9 0 

1 2 0 

1 1 0 

1 0 0 

9 0 

8 0 

7 0 

6 0 

5 0 

4 0 

3 0 

2 0 

2 0 . 0 0 0 0 

1 0 . 0 0 0 0 

0 

- 1 0 . 0 0 0 0 

- 2 0 . 0 0 0 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 

t i m e 

200

a s e l i n e 

1 4 0 

r h o = 0 . 9 5 


4 0 . 0 0 0 0 

r h o = 0 . 9 5 

1 3 0 

3 0 . 0 0 0 0 

1 2 0 

1 1 0 

2 0 . 0 0 0 0 

1 0 0 

9 0 

1 0 . 0 0 0 0 

8 0 

7 0 

0 

6 0 

5 0 

- 1 0 . 0 0 0 0 

4 0 

- 2 0 . 0 0 0 0 

3 0 

2 0 

- 3 0 . 0 0 0 0 

1 0 

0 

- 4 0 . 0 0 0 0 

- 1 0 

- 2 0 

0 1 0 2 0 3 0 

t i m e 

- 5 0 . 0 0 0 0 

0 1 0 2 0 3 0 

t i m e 

201


If the autocorrelation is close to -1, then points above the underlying trend are usually followed immediately 

by points below the underlying. The fitted line will be close to the underlying trend. The residual plot 

will show the same pattern. 

If the autocorrelation is close to 1, then you will see long runs of points above the underlying trend line 

and long runs of points below the underlying line. DANGER! In cases of very high autocorrelation 

with <strong>short</strong> time series, you can be drastically misled by the data! If you examine the plots above, you see that 

in the case of high positive autocorrelation, the points tended to stay above or below the underlying trend 

line <strong>for</strong> long periods of time. If the time series is <strong>short</strong>, you may never see the actual line dipping below the 

real trend line and the fitted line (shown in the above plots) may be completely misleading and there is no 

way to detect this! Ironically, with <strong>short</strong> time series (e.g. fewer than 30 data points), it will be very difficult 

to detect high positive autocorrelation and this is exactly the time when it can cause the most damage when 

the data give misleading results! 

If the autocorrelation is close to 0, the points will be randomly scattered about the underlying trend 

line, the fitted line will be close to the underlying trend line, and the residuals should appear to be random 

scattered about 0. 

In many cases, if you have fewer than 30 data points, it will be very difficult to observe or detect any 

autocorrelation unless extreme! 

What are the effects of autocorrelation? In most cases in ecology the autocorrelation tends to be positive. 

This has the following effects: 

• Estimates of the slope and intercept are still unbiased, but they are less efficient (i.e. the true standard 

error is larger) than estimates of the same process in the absence of autocorrelation. This may seem to 

be contradicted by my statement above that in the presence of high positive autocorrelation and <strong>short</strong> 

time series, that the data may be very misleading, but this is an artifact that you have a very <strong>short</strong> time 

series. With a long time-series you will see that the data run over and under the trend line in long 

waves and the fitted line will once again be close to the actual underlying trend. 

• The reported variance around the regression line (MSE) may seriously underestimate the true variance. 

• Un<strong>for</strong>tunately, while the estimates of the slope and intercept are usually not affected greatly, the 

reported standard errors can be misleading. In the case of positive or negative autocorrelation, the 

reported standard errors obtained when a line is fit assuming not autocorrelation, are typically too 

small, i.e. the estimates look more precise than they really are. 

• Reported confidence intervals ignoring autocorrelation tend to be too narrow. 

• The p-values from hypothesis testing tend to be too small, i.e. you tend to detect differences that are 

not real too often. 

The autocorrelation can be estimated from the data in many ways. In one method, a regression line is fit 



to the data, the residuals are found, and then the autocorrelation is estimated as: 

̂ρ = 

T∑ 

e i e i−1 

i=2 

T∑ 

−1 

e 2 i 

i=2 

where e i is the residual <strong>for</strong> the i th observation. Bence (1995) points out that this often underestimates the 

autocorrelation and provides some corrected estimates. More modern methods estimate the autocorrelation 

using a technique called maximum likelihood and these often per<strong>for</strong>m well than these two-step methods. 

As a rule of thumb, the reported standard errors obtained from fitting a regression ignoring autocorrelation 

should be inflated by a factor of 

√ 

(1+ρ) 

(1−ρ) 

. For example, if the actual autocorrelation is 0.6, then the 

√ 

(1+.6) 

(1−.6) = 2, 

standard errors (from an analysis ignoring autocorrelation) should be inflated by a factor of 

i.e. multiply the reported standard errors ignoring autocorrelation by a factor of 2. Consequently, unless the 

autocorrelation is very close to 1 or -1, the inflation factor is usually pretty small. 24 

A slightly simpler <strong>for</strong>mula that also seems to work well in practice is that the effective sample size in the 

presence of autocorrelation is found as: 

n effective = 1 + (n − 1)(1 − ρ) 

This is based on the observation that the first observation counts as a data point, but each additional data 

point only counts as (1 − ρ) of a data point. Then use the fact that <strong>for</strong> most statistical problems, the standard 

errors decrease by a factor of √ n to estimate the effect upon the precision of the estimates. For example, if 

n 

n effective 

= 2 then the reported standard errors (computed ignoring autocorrelation) should be inflated by a 

factor of about √ 2. 

The Durbin-Watson test statistic is a popular measure of autocorrelation. It is computed as: 

d = 

N∑ 

(e i − e i−1 ) 2 

i=2 

N∑ 

e 2 i 

i=1 

≈ 

∑ 

2 N e 2 i − 2 ∑ N e i e i−1 

i=1 i=2 

N∑ 

e 2 i 

i=1 

≈ 2 (1 − ρ) 

Consequently, if the autocorrelation is close to 0, the Durbin-Watson statistic should be close to 2. The 

p-value <strong>for</strong> the statistics is found from tables, but most modern software can compute it automatically. 

Remedial measures If there is strong evidence <strong>for</strong> autocorrelation, there a number of remedial measures 

that can be taken: 

• Use ordinary regression and inflate the reported standard errors by the inflation factor mentioned 

above. This is a very approximate solution and is not often used now that modern software is available. 

24 As Bence (1995) points out, the correction factor assumes that you know the value of ρ. Often this is difficult to estimate and 

typically estimates are too close to 0 resulting in a correction factor that is too small as well. He provides a bias adjusted correction 

factor. 



• A major cause of autocorrelation is the omission of an important explanatory variable. The example 

of precipitation that tends to occur in cycles was noted earlier. In this case, a more complex regression 

model (multiple regression) that looks at the simultaneous effect of more than two variables would be 

appropriate. Un<strong>for</strong>tunately this is beyond the scope of these notes. 

• Trans<strong>for</strong>m the variables be<strong>for</strong>e using simple regression methods that ignore autocorrelation. There are 

two popular trans<strong>for</strong>mations, the Cochrane-Orcutt and Hildreth-Lu procedures. Both procedures start 

by estimating the autocorrelation ρ by fitting the ordinary regression line, obtaining the residuals, and 

then using the residuals to estimate the autocorrelation. Then the data are trans<strong>for</strong>med by subtracting 

the estimated portion due to autocorrelation. Finally, the trans<strong>for</strong>med data are refit using ordinary 

regression (again ignoring autocorrelation). These approaches are falling out of favor because of the 

availability of integrated procedures below. 

• Use a more sophisticated fitting procedure that explicitly estimates the autocorrelation and accounts 

<strong>for</strong> it. This can be done using maximum likelihood or extensions of the previous methods, e.g. the 

Yule-Walker methods which fit generalized least squares. Many statistical packages offer such procedures, 

e.g. SAS’s PROC AUTOREG is specially designed to deal with autocorrelation uses the 

Yule-Walker methods; SAS’s Proc MIXED uses maximum likelihood methods. 

2.6.1 Example: Mink pelts from Saskatchewan 

L.B. Keith (1963) collected in<strong>for</strong>mation on the number of mink-pelts from Saskatchewan, Canada over 

a 30 year period. This is data series 3707 in the NERC Centre <strong>for</strong> Population Biology, Imperial College 

(1999) The Global Population Dynamics Database available at http://www.sw.ic.ac.uk/cpb/ 

cpb/gpdd.html. 

We are interested to see if there is a linear trend in the series. 

Here is the raw data: 



Year Pelts 

1914 15585 

1915 9696 

1916 6757 

1917 6443 

1918 6744 

1919 10637 

1920 11206 

1921 8937 

1922 13977 

1923 11430 

1924 13955 

1925 6635 

1926 7855 

1927 5485 

1928 5605 

1929 5016 

1930 6028 

1931 6287 

1932 11978 

1933 15730 

1934 14850 

1935 9766 

1936 6577 

1937 3871 

1938 4659 

1939 6749 

1940 12469 

1941 8579 

1942 6839 

1943 9990 

1944 6561 

1945 5831 

1946 8088 

1947 9579 

1948 10672 

1949 16195 


1950 12596 

1951 12833 

1952 18853 

1953 11493 

1954 14613 

1955 18514


The raw data are available in a JMP file called mink.jmp available in the Sample Program Library available 


It is common when dealing with population trends, to analyze the data on the log-scale. The reason <strong>for</strong> 

this is that many processes operate multiplicatively on the original scale, and this is translated into a linear 

line on the log-scale. For example, if the number of pelts harvested increased by x% per year, the <strong>for</strong>ecasted 

number of pelts harvested would be fit by the equations: 

Years from baseline 

P elts = B(1 + x) 

where B is the baseline number of pelts. When this is trans<strong>for</strong>med to the log-scale, the resulting equation is: 

or 

log(P elts) = log(B) + (Years from baseline) log(1 + x) 

Y ′ = β 0 + β 1 (Years from baseline) 

This equation can be further modified by using the raw year as the X variable (rather than years-frombaseline). 

All that happens is that the value of the baseline refers back to year 0 (which is pretty meaningless), 

but the value of the slope is still OK. 

It is recommended that you take natural logarithms (base e) rather than common logarithms (base 10) 

because then the estimated slope has a nice interpretation. For small slopes on the natural log scale, a value 

of ̂β 1 corresponds closely to the same percentage increase per year. For example if ̂β 1 = .04, then the 

population in increasing at a rate of 

or 4% per year. 

JMP Analysis 

exp(̂β 1 ) − 1 = exp(.04) − 1 = 1.041 − 1 = .041 ≈ β 1 = .04 

JMP deals with autocorrelated data through the Analyze →Modelling →TimeSeries plat<strong>for</strong>m which are 

beyond the scope of this <strong>course</strong>. This time series plat<strong>for</strong>m allows you to fit the Box-Jenkins ARIMA(p,q) 

series of models, but does not allow <strong>for</strong> missing data. 

log_mink was constructed using a <strong>for</strong>mula variable in the usual fashion. Here is a portion of the raw 

data: 



Begin by using the Analyze->Fit Y-by-X plat<strong>for</strong>m to fit a simple linear fit and to fit a line joining all of 

the points: 25 

25 Use the Fit Each Value under the red-triangle pop-down menu to get the individual points joined up 



There appears to be a generally increasing trend, but the the points seem to show an irregular cyclical 

type of pattern where several years of high takes of pelts is followed by several years of low takes of pelts. 

This is often a sign of autocorrelated residuals. Indeed the residual plot shows this pattern: 26 

26 This residual plot was obtained by Saving the residuals to the data sheet, and then using the Analyze->Fit Y-by-X plat<strong>for</strong>m to plot 

the saved residuals against year. The joined line was obtained by using the Fit Each Value from the red-triangle pop-down menu. The 

horizontal line at zero was obtained by using the Fit Special from the red-triangle and selecting an intercept of 0 and a slope of 0. 



In order to test estimate the autocorrelation, the Analyze->Fit Model plat<strong>for</strong>m must be used to fit a linear 

fit to logmink 



and obtain the fitted line and residual plots in the usual way. The Durbin-Watson statistic is obtained from 

the red-triangle pop-down menu: 



The Durbin-Watson statistics indicates that there is strong evidence of autocorrelation with an estimated 

autocorrelation of approximately 0.56. 

The estimated intercept and slope (without adjusting <strong>for</strong> autocorrelation) are: 



The number of pelts is estimated to increase at about 0.8% per year. As noted be<strong>for</strong>e, the estimates are 

still unbiased, but the reported standard errors are too small. Using the rule-of-thumb, the inflation factor <strong>for</strong> 

the standard errors is approximately: 

√ √ 

1 + ̂ρ 1 + .56 

InfFactor = 

1 − ̂ρ = 1 − .56 = 1.9 

Hence a more realistic standard error would be 1.9 × .005 = .009. 

A more <strong>for</strong>mal analysis would proceed as follows. First launch the Analyze →Modelling →TimeSeries 

plat<strong>for</strong>m: 

and specify the Y variable. 



The Time variable is only used <strong>for</strong> graphing. JMP assumes that the data are equally spaced without any 

missing values. 

This gives the initial output: 



The estimated lag-1 autocorrelation is about 0.6 - quite high, but the lag 2 and higher autocorrelations 

don’t appear to be statistically significant as they don’t fall outside the blue lines drawn on the graph of the 

autocorrelations. 

A simple autoregression model with NO TREND (called the ARIMA(1,0,0)) model is fit using the 

ARIMA drop down menu and completing the various boxes: 





A key assumptions of the Box-Jenkins approach is that the series is stationary, i.e. has a constant mean. If 

there is a linear trend in the log_mink numbers, this MUST be first removed be<strong>for</strong>e a subsequent model is 

fit. A simple linear trend is removed by differencing. For example, if a simple linear trend model is correct 

Y t = β 0 + β 1 t 

then: 

Y t+1 = β 0 + β 1 (t + 1) 

Y t+1 − Y t = β 1 

and the FIRST differences are constant. 

A first difference model is fit by specifying the second term in the ARIMA model specification: 



Finally, a model with differencing but not autocorrelation may also be useful: 

A comparison of the three model is given by JMP: 



The AIC criterion indicates that the model with the lowest value of AIC is preferred, and models with AIC 

within 2 or 3 of the best fitting model could also be candidates. According to this output, the AR(1) model is 

the best fitting model with an AIC of -94 almost 8 units lower than the next best fitting model, i.e. it wasn’t 

necessary to remove the trend from the model be<strong>for</strong>e fitting it to the data. 

Indeed, if you look at the output from the AR(1,1) or AR(0,1) model: 

the estimated average difference in the log_mink is only .00042 with a se of .05, clearly not statistically 

different from zero. 

It is interesting to note that there is no evidence of further autocorrelation in the residuals after the first 

differences were taken as the AR(1,1) model is a worse fit (but only by about 2 units) when compared on 

the AIC scale. You can also estimate the average first difference by computing a derived variable using the 

Formula editor and using the Analyze->Fit Model plat<strong>for</strong>m to estimate the overall mean and to see if there 

is residual autocorrelation. 

Final Notes 

It is interesting to note that there is no evidence of further autocorrelation in the residuals after the first 

differences were taken. If you hadn’t examined the autocorrelation plots you would not have known this. It 

is quite common that a first difference will remove much autocorrelation in the data and this is often a good 

first step. 



2.7 Dealing with seasonality 

In many cases, the “cause” of autocorrelation over time is some sort of seasonality. For example, stream 

flow may follow a cyclical pattern with high flows in the winter months (at least in Vancouver) and low 

flows in the summer months. A way to deal with this type of autocorrelation is to either first adjust the data 

<strong>for</strong> seasonal effects, and then use the usual regression methods on this adjusted data, or to fit a cyclic pattern 

over and above the simple trend line. 

2.7.1 Empirical adjustment <strong>for</strong> seasonality 

General idea 

The intuitive idea behind this method is quite simple. Arrange the data into seasonal groups (e.g. months) 

and subtract the seasonal group mean or median 27 from every point in the seasonal series. This will subtract 

the cyclic pattern and leave adjusted data that is “free” of seasonal effects. 

The adjustment process can either be done within the computer package, or in many cases, is easily done 

on a spreadsheet. 

This adjustment is a bit ad hoc, but seems to work well in practice. The reported standard errors from 

the regression line are a bit too small as they have not accounted <strong>for</strong> the adjustment process. 

Example: Total phosphorus from Klamath River 

Consider, <strong>for</strong> example, values of total phosphorus taken from the Klamath River near Klamath, Cali<strong>for</strong>nia 

as analyzed by Hirsch et al (1982). 28 

27 The median would be preferred to avoid contamination of the mean by outliers 

28 This was monitoring station 11530500 from the NASQAN network in the US. Data are available from http://waterdata. 

usgs.gov/nwis/qwdata/?site_no=11530500. The data was analyzed by Hirsch, R.M., Slack, J.R., and Smith, R.A. (1982). 

Techniques of trend analysis <strong>for</strong> monthly water quality data. Water Resources Research 18, 107-121. 



Total phosphorus (mg/L) in Klamath River near Klamath, CA 

Month 

Year 

1972 1973 1974 1975 1976 1977 1978 1979 

1 0.07 0.33 0.70 0.08 0.04 0.05 0.14 0.08 

2 0.11 0.24 0.17 . . . 0.11 0.04 

3 0.60 0.12 0.16 . 0.14 0.03 0.02 0.02 

4 0.10 0.08 1.20 0.11 0.05 0.04 0.06 0.01 

5 0.04 0.03 0.12 0.09 0.02 0.04 0.03 0.03 

6 0.05


shows a obvious seasonality to the data with peak levels occurring in the winter months. There are also some 

missing values as seen in the raw data table. Finally, notice the presence of several very large values (above 

0.20 mg/L) that would normally be classified as outliers. 

There are several values greater than 0.20 mg/L which appear to be outliers. Consequently, we will use 

the median from each month <strong>for</strong> the adjustment. The sorted values <strong>for</strong> the January readings are: 

.04, .05, .07, .08, .08, .14, .33, .70 

The median value <strong>for</strong> January readings is the average of the 4 th and 5 th observations 29 or 

median January = 

.08 + .08 

2 

= .08. 

The value of .08 is subtracted from each of the January readings to give 

−.01, .25, .62, .00, −.04, −.03, .06, .00 

29 If the number of observations is odd as <strong>for</strong> February, the median is the middle value 



This process is repeated <strong>for</strong> each month. These computations are illustrated in the Klamath tab in the 

ALLofDATA.xls workbook available in the Sample Program Library available at http://www.stat. 

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms to give: 

Month 

Seasonally Adjusted Total phosphorus (mg/L) 

in Klamath River near Klamath, CA 

Year 

Month 1972 1973 1974 1975 1976 1977 1978 1979 

1 -0.01 0.25 0.62 0.00 -0.04 -0.03 0.06 0.00 

2 0.00 0.13 0.06 . . . 0.00 -0.07 

3 0.48 0.00 0.04 . 0.02 -0.09 -0.10 -0.10 

4 0.03 0.01 1.13 0.04 -0.02 -0.03 -0.01 -0.06 

5 0.01 -0.01 0.09 0.06 -0.02 0.01 -0.01 -0.01 

6 0.00 -0.04 0.00 0.00 . . -0.02 . 

7 0.00 0.00 -0.01 -0.02 . 0.02 -0.02 0.00 

8 -0.01 0.01 -0.03 -0.01 0.02 0.03 0.01 -0.04 

9 0.02 0.01 -0.03 0.02 . 0.00 -0.04 . 

10 0.00 0.00 -0.01 . 0.00 -0.04 -0.03 0.20 

11 0.00 0.28 . -0.01 . 0.33 0.00 . 

12 0.01 0.03 -0.01 -0.08 . 0.18 -0.06 . 

A plot of the seasonally adjusted values: 



shows that most of the seasonal effects have been removed, but there may still evidence of autocorrelation. 

There are certainly still some outliers. 

JMP analysis 

The seasonally adjusted values were imported into JMP and stacked in the usual way. A new variable 

year-month was created using a <strong>for</strong>mula variable (year + month−1 

12 

) to represent time: 



The Analyze->Fit Y-by-X plat<strong>for</strong>m was used to draw the scatter plot and fit the preliminary line: 



It is a bit worrisome that the outliers seems to be all in early years. All seasonally adjusted values greater 

than 0.2 were excluded from the analysis 30 and the line was refit: 

30 Use the Rows→Select to select these rows 





There appears to be evidence of a trend of −.0032 mg/L/year. The p-value and se of the slope are likely 

too small by some small factor because the seasonally adjustment was not taken into account. The residual 

plot seems to show some evidence of remaining autocorrelation. 

The Analyze->Fit Model plat<strong>for</strong>m was reused to fit the data and obtain the Durbin-Watson statistics: 

This indicates a low amount of residual autocorrelation (estimated value of .04) but is statistically significant 

because the large sample size allows you detect very small autocorrelation. 

Further comments It is a bit worrisome that all of the outliers appear to happen early in the time series 

and once these are removed, that there is no evidence of a trend. However, one could argue that that disappearance 

of the outliers is, in fact, the most interesting point of this dataset and that the fact that the outliers 

disappeared indicates evidence of a downward trend. 

It also turns out that the results are VERY sensitive to which outliers are removed. For example, in late 

1977 there is seasonally adjusted value of .17, and in late 1979 there was a seasonally adjusted value of 

0.20 that were not excluded. If these points are also removed, the final regression line is not statistically 

significant with an estimated trend of −.0063 mg/L/year. 

As you will see later, a non-parametric analysis that includes these outliers point did detect a downward 

trend with an estimated slope of about −.006 mg/L/year! The moral of the story is that statistics must be 

used carefully! 

2.7.2 Using the ANCOVA approach 

General idea 

Rather than relying on a ad hoc approach to doing a seasonal adjustment, the ANCOVA method can also 

be used. The advantage of the ANCOVA method over the ad hoc approach is that not only can you fit an 

overall trend line, but you can also test if the trend is the same <strong>for</strong> all seasons as well. Outliers will have to 

be removed in the usual fashion. 

The general model will start with the non-parallel slope model of the <strong>for</strong>m: 

Y = Season T ime Season ∗ T ime 

Then examine if the Season*Time interaction term indicates if the slopes may not be parallel over seasons. 



If there is insufficient evidence against the hypothesis of parallelism, then fit the final model with a common 

slope over the seasons, but difference among the seasons. 

Y = Season T ime 

Example: Total phosphorus levels on the Klamath River - revisited 

JMP Analysis 

The raw data is available in the file klamath.jmp in the Sample Program Library available at http://www. 

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

An earlier plot shows that there are some outliers. Remove all data points greater than 0.20 mg/L. 

The Analyze->Fit Model plat<strong>for</strong>m is used to fit the non-parallel slope model. CAUTION: Be sure that 

month is nominally scaled and that year is continuously scaled!. 



The graph of the lines by season appear to show that some seasons (months) appear to have a different slope 

than the other months: 

and the effect test <strong>for</strong> non-parallel slopes: 

also shows some evidence of non-parallel slopes. However, we will fit the parallel slope model to continue 

the demonstration. 

The Analyze->Fit Model plat<strong>for</strong>m is again used to fit the parallel slope model. CAUTION: Be sure that 

month is nominally scaled and that year is continuously scaled!. 



The fitted lines and the model fit graph appear to be acceptable: 





The effect tests show a strong effect of year with estimated coefficients of: 

The estimated trend is −.0056 (se .0016) mg/L/year which is comparable to the previous estimates. Note 

that the estimates <strong>for</strong> the month effects are not directly interpretable from this output - the LSMEANS table 

should be consulted - seek help on this point. 

The residual plots (not shown) don’t indicate any major problems. The Durbin-Watson test <strong>for</strong> autocorrelation 

detects a small autocorrelation, but with this large sample size is not important. 

2.7.3 Fitting cyclical patterns 

General approach 

In some cases, the seasonal pattern is quite regular with regular peaks during part of the year and regular 

lows during another part of the year. Another approach is to try and account <strong>for</strong> this cyclical pattern, and 

then see if there is still evidence of a decline over time. 

The basic building block <strong>for</strong> the seasonality are the use of sine and cosine functions to represent the 



seasonal patterns. The general model will take the <strong>for</strong>m: 

Y i = β 0 + β 1 × t i + β 2 cos( 2πt i 

ν 

) + β 3 sin( 2πt i 

ν 

) + ɛ i 

Here the coefficients β 0 and β 1 represent the intercept and linear change over time. The coefficients β 2 and 

β 3 represent the seasonal components. 

The term ν represents the period of the cycle. It is assumed to be known in advance. For example, if the 

cycles are one year in duration and the time axis is measured in years, then ν = 1. If the cycles are one year 

in duration, but the time axis is measured in months, then ν = 12. This is often coded incorrectly so be 

careful! 

The reason there are both a sine and cosine function is that these two functions have the same period but 

different amplitudes at different parts of the cycle. For example, the cosine function has a maximum at the 

start of each cycle and a minimum half-way through each cycle, while the sine function has a minimum at 

the 3/4 point of a cycle and a maximum at the 1/4 point of the cycle. 

The analysis starts by creating two new variables in the data table corresponding to the sine and cosine 

functions. Then multiple regression is used to fit a model incorporating all three explanatory variables. In 

the <strong>short</strong> hand notation <strong>for</strong> models, the model fit is: 

Y = T ime Cos Sin 

After the model is fit, the coefficient of the Time variable represents the overall trend. The usual tests of 

hypothesis <strong>for</strong> no trend, and confidence intervals <strong>for</strong> the slope can be found. The slope is interested as the 

change in Y per unit change in X = T IME after adjusting <strong>for</strong> seasonality. The coefficients <strong>for</strong> the sine 

and cosine functions are usually not of interest. 

The computation should NOT be attempted by hand or in spreadsheet program. Most statistical packages 

have facilities <strong>for</strong> creating the relevant variables and fitting these models. 

The usual assumptions still hold, so they should be checked via residual plots, estimation of the autocorrelation 

that remains, etc. 

Example: Total phosphorus from Klamath River 

JMP Analysis 

The data and model fits are available in a JMP file klamath3.jmp available in the Sample Program Library 

available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The data must be stacked in the usual fashion and a variable created year-month to represent the time 

variable, year-month = year + month−1 

12 

. 



As the time variable is measured in years and the preliminary plot shows a yearly cycle, ν = 1, so the 

<strong>for</strong>mulae <strong>for</strong> the cosine and sine variables are: 



respectively. This gives the final data table looking somewhat like: 

There are no problems with the fact that some of the phosphorus data is missing as the packages will simply 

ignore any row that is not complete. 

The Analyze->Fit Model plat<strong>for</strong>m is used to fit the model: 



. The output is voluminous and a full discussion is beyond the scope of these notes. 31 The key things to 

look at are the estimated coefficients and the residual plots and model fit plots: 

31 See Freund, R., Little, R. and Creighton, L. (2003). Regression using JMP. Wiley. <strong>for</strong> more details on the output from this 

plat<strong>for</strong>m. 





These all indicate the presence of several outliers. 

The model was refit omitting these outliers with phosphorus values greater than 0.20. 32 The model was 

refit. The residual and model fit plots are much better: 

32 Use the Rows→Select to select rows and the Rows→Exclude to remove these from the analysis 





but the residual plot still shows something strange happening about 1/2 way through the time series. It 

appears that the cycles are shifting so you get this long wave of residuals. 

The estimated coefficients are: 

The coefficients <strong>for</strong> both the cosine and sine terms are statistically significant but not of much interest. The 

estimated trend is −.0056 mg/L/year (se .0017) with a p-value <strong>for</strong> the trend line of .0013. The results are 

statistically significant. 

The Durbin-Watson test <strong>for</strong> autocorrelation shows some residual serial correlation: 

which likely reflects the behavior in the tail end of the series. 

Example: Comparing air quality measurements using two different methods 

The air that we breath often has many contaminants. One contaminant of interest is Particulate Matter 

(PM). Particulate matter is the general term used <strong>for</strong> a mixture of solid particles and liquid droplets in the 

air. It includes aerosols, smoke, fumes, dust, ash and pollen. The composition of particulate matter varies 

with place, season and weather conditions. Particulate matter is characterized according to size - mainly 

because of the different health effects associated with particles of different diameters. Fine particulate matter 

is particulate matter that is 2.5 microns in diameter and less. [A human hair is approximately 30 times larger 



than these particles!] The smaller particles are so small that several thousand of them could fit on the period 

at the end of this sentence. It is also known as PM2.5 or respirable particles because it penetrates the 

respiratory system further than larger particles. 

PM2.5 material is primarily <strong>for</strong>med from chemical reactions in the atmosphere and through fuel combustion 

(e.g., motor vehicles, power generation, industrial facilities residential fire places, wood stoves and 

agricultural burning). Significant amounts of PM2.5 are carried into Ontario from the U.S. During periods 

of widespread elevated levels of fine particulate matter, it is estimated that more than 50 per cent of Ontario’s 

PM2.5 comes from the U.S. 

Adverse health effects from breathing air with a high PM 2.5 concentration include: premature death, 

increased respiratory symptoms and disease, chronic bronchitis, and decreased lung function particularly <strong>for</strong> 

individuals with asthma. 

Further in<strong>for</strong>mation about fine particulates is available at many websites as http://www.health. 

state.ny.us/nysdoh/indoor/pmq_a.htm and http://www.airqualityontario.com/ 

science/pollutants/particulates.cfm, and http://www.epa.gov/pmdesignations/ 

faq.htm. 

The PM2.5 concentrations in air can be measured in many ways. A well known method is a is a filter 

based method whereby one 24 hour sample is collected every third day. The sampler draws air through a preweighed 

filter <strong>for</strong> a specified period (usually 24 hours) at a known flowrate. The filter is then removed and 

sent to a laboratory to determine the gain in filter mass due to particle collection. Ambient PM concentration 

is calculated on the basis of the gain in filter mass, divided by the product of sampling period and sampling 

flowrate. Additional analysis can also be per<strong>for</strong>med on the filter to determine the chemical composition of 

the sample. 

In recent years, a program of continuous sampling using automatic samplers has been introduced. An instrument 

widely adopted <strong>for</strong> this use is the Tapered Element Oscillating Microbalance (TEOM). The TEOM 

operates under the following principles. Ambient air is drawn in through a heated inlet. It is then drawn 

through a filtered cartridge on the end of a hollow, tapered tube. The tube is clamped at one end and oscillates 

freely like a tuning <strong>for</strong>k. As particulate matter gathers on the filter cartridge, the natural frequency of 

oscillation of the tube decreases. The mass accumulation of particulate matter is then determined from the 

corresponding change in frequency. 

Because of the different ways in which these instruments work, a calibration experiment was per<strong>for</strong>med. 

The hourly TEOM readings were accumulated to a daily value and compared to those obtained from an air 

filter method. Here are the data: 

Date TEOM Ref 

2003.06.05 8.1 10.6 

2003.06.08 6.5 9.0 

2003.06.11 3.2 4.6 

2003.06.14 2.2 3.7 

2003.06.17 5.8 7.9 

2003.06.20 1.4 4.4 

2003.06.23 1.8 2.8 

2003.06.26 4.5 6.5 

2003.06.29 4.6 5.8 

2003.07.02 3.3 3.6 

2003.07.05 1.6 3.7 

2003.07.08 7.1 7.2 

2003.07.11 7.7 8.6 

2003.07.14 4.3 4.4 

2003.07.17 4.6 6.4 

2003.07.20 7.2 8.5 

2003.07.23 8.8 10.5 

2003.07.26 8.1 9.0 

2003.07.29 11.2 10.4 



2003.08.01 19.4 21.0 

2003.08.07 5.9 5.2 

2003.08.10 11.9 12.6 

2003.08.13 7.2 8.4 

2003.08.16 48.2 46.2 

2003.08.19 49.3 51.2 

2003.08.22 53.3 54.5 

2003.08.25 56.8 57.2 

2003.08.28 4.5 7.4 

2003.08.31 27.8 26.1 

2003.09.03 34.3 33.0 

2003.09.06 41.5 42.1 

2003.09.24 5.8 9.5 

2003.09.27 5.7 8.0 

2003.09.30 9.1 9.8 

2003.10.03 10.5 13.9 

2003.10.06 10.9 15.6 

2003.10.09 3.5 5.6 

2003.10.12 4.1 6.3 

2003.10.15 5.7 10.1 

2003.10.18 15.5 20.2 

2003.10.21 5.4 8.9 

2003.10.24 11.7 19.0 

2003.10.27 14.9 23.3 

2003.10.30 3.9 7.5 

2003.11.02 12.9 21.2 

2003.11.05 18.9 33.4 

2003.11.08 23.6 35.9 

2003.11.11 19.0 30.2 

2003.11.14 18.5 28.2 

2003.11.17 11.1 18.4 

2003.11.20 11.6 20.1 

2003.11.23 9.4 17.9 

2003.11.26 25.6 42.8 

2003.11.29 6.9 11.2 

2003.12.02 13.2 25.6 

2003.12.05 10.2 19.9 

2003.12.08 17.6 31.6 

2003.12.11 6.7 14.1 

2003.12.14 16.2 26.5 

2003.12.17 8.3 13.5 

2004.01.13 6.8 13.8 

2004.01.16 9.2 17.3 

2004.01.19 16.5 32.6 

2004.01.22 4.3 11.6 

2004.01.25 6.1 10.0 

2004.01.28 10.1 14.4 

2004.01.31 14.0 28.1 

2004.02.06 19.4 35.0 

2004.02.09 15.1 25.2 

2004.02.12 16.8 32.9 

2004.02.15 15.9 28.5 

2004.02.18 9.8 18.5 

2004.02.21 9.1 17.2 

2004.02.24 17.1 31.9 

2004.02.27 12.1 21.7 

2004.03.01 8.8 14.1 

2004.03.07 3.2 5.6 

2004.03.10 10.9 15.3 

2004.03.13 7.1 10.8 

2004.03.16 7.4 13.8 

2004.03.19 10.4 14.0 

2004.03.22 10.6 16.1 

2004.03.25 5.0 8.4 

2004.03.28 6.4 10.3 

2004.03.31 5.3 6.6 

2004.04.03 6.5 9.7 

2004.04.09 6.4 9.7 

2004.04.12 7.0 8.8 

2004.04.15 2.3 4.6 

2004.04.18 4.2 5.7 

2004.04.21 4.7 5.7 

2004.04.24 3.7 4.1 

2004.04.27 4.1 5.0 

2004.04.30 7.3 7.3 

2004.05.03 3.5 5.0 

2004.05.06 2.5 2.8 

2004.05.09 2.3 2.7 

2004.07.02 6.0 4.3 

2004.07.05 3.3 2.4 

2004.07.08 1.6 2.0 

2004.07.11 1.2 5.7 

2004.07.14 5.4 8.3 

2004.07.17 8.8 3.5 

2004.07.20 2.2 10.0 

2004.07.23 8.3 12.5 

2004.07.26 10.5 17.0 

2004.08.01 25.3 24.7 

2004.08.04 14.7 10.5 

2004.08.07 2.7 3.1 

2004.08.10 6.5 7.2 

2004.08.19 20.1 13.6 

2004.08.25 4.1 4.2 

2004.08.28 2.5 1.5 

2004.08.31 4.7 6.3 

2004.09.03 3.2 4.0 

2004.09.15 1.8 2.6 

2004.09.18 2.6 4.7 

2004.09.21 4.7 6.2 

2004.09.24 5.6 8.0 

2004.09.27 7.1 10.0 

2004.09.30 4.8 7.7 

2004.10.03 9.5 13.3 

2004.10.06 10.1 13.0 

2004.10.09 3.8 5.0 

2004.10.12 5.0 7.3 

2004.10.15 2.3 5.4 

2004.10.18 7.5 10.1 

2004.10.21 8.1 11.0 

2004.10.24 6.6 13.6 

2004.10.27 14.0 18.2 

2004.10.30 15.9 24.8 

2004.11.02 8.4 14.1 

2004.11.08 10.8 17.6 

2004.11.11 1.4 4.7 

2004.11.14 6.5 10.0 

2004.11.17 11.0 18.8 

2004.11.20 7.7 14.4 

2004.11.26 15.4 23.4 

2004.11.29 8.9 17.1 

2004.12.02 18.3 30.8 

2004.12.05 6.2 13.5 

2004.12.08 8.3 16.5 

2004.12.11 9.6 15.9 

2004.12.14 9.8 17.6 

2004.12.17 11.5 21.5 

2004.12.20 14.0 26.1 

2004.12.23 9.8 20.0 

2004.12.26 4.9 9.4 

2004.12.29 3.7 7.6 

2005.01.01 10.2 18.5 

2005.01.04 18.6 38.3 

2005.01.22 11.1 24.7 

2005.01.25 11.8 22.7 

2005.01.28 13.1 20.9 

2005.01.31 5.1 10.9 

2005.02.03 6.2 11.1 

2005.02.06 6.5 10.0 



2005.02.09 10.6 20.8 

2005.02.12 11.4 23.3 

2005.02.15 12.9 18.8 

2005.02.18 14.0 23.4 

2005.02.21 21.9 31.7 

2005.02.24 17.1 26.4 

2005.02.26 8.3 16.3 

2005.02.27 11.8 20.1 

2005.03.02 16.7 28.9 

2005.03.05 12.0 18.9 

2005.03.08 5.3 9.8 

2005.03.11 10.9 18.8 

2005.03.14 11.3 18.1 

2005.03.17 8.5 11.0 

2005.04.04 12.0 10.9 

2005.04.07 7.8 7.1 

2005.04.16 2.3 4.8 

2005.04.19 5.5 3.9 

2005.04.22 8.0 6.7 

2005.04.25 7.3 10.0 

2005.04.28 3.5 9.0 

2005.05.01 4.5 4.5 

2005.05.04 5.1 1.8 

2005.05.07 2.5 5.4 

2005.05.28 6.1 6.7 

2005.05.31 9.7 12.0 

2005.06.03 5.2 5.0 

2005.06.06 0.9 2.1 

2005.06.09 4.4 6.2 

2005.06.12 2.3 2.7 

2005.06.15 2.3 2.2 

2005.06.18 1.7 2.6 

2005.06.21 6.7 6.9 

2005.06.24 3.4 3.8 

2005.06.27 4.2 4.6 

2005.06.30 4.3 5.5 

2005.07.03 2.7 5.2 

2005.07.06 3.6 4.2 

2005.07.09 1.3 1.9 

2005.07.12 2.8 6.3 

Do both meters give similar readings over time? 

It is quite common when comparing two instruments to do the comparison on the log-ratio scale, i.e. 

either log(T EOM/reference) or log(reference/T EOM). There are two reasons why this is commonly 

done. First, the logarithmic scale makes ratios more than 1 and less than 1 symmetric. For example, the 

ratio 1/2 and 2 on the regular scale are not symmetric about the value of 1, but log(1/2) = −.693 and 

log(2) = .693 are symmetric about zero. Second, it is often the case that the variation tends to increase with 

the base size of the reading. The use of logarithms makes the variances more similar over the spread of the 

data values. 

JMP Analysis 

A JMP data file is available in the teom.jmp file in the Sample Program Library available at http://www. 

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

Two variables need to be created in the JMP table. First, the log(T EOM/reference) variable as noted 

above. This is created using the <strong>for</strong>mula editor: 



Second, a variable representing the decimal year is required so that plotting and regression happen on 

the year scale rather than the internal date and time <strong>for</strong>mat in JMP. JMP uses the number of seconds since 

a reference date as the internal value <strong>for</strong> a date. Consequently, to convert to dates, you need to divide by 

86,400 second/day to convert to days, and then by 365 to convert to years. [This ignores the effect of leap 

years and leap seconds.] This year variable is created using the <strong>for</strong>mula: 



Here are the first few lines of data, including the two new derived variables: 



A plot of the log(T EOM/reference) by the year variable is obtained using the Analyze->Fit Y-by-X 




shows a clear cyclical pattern. The peaks of the cycles are almost exactly one year apart. Consequently, we 

then create two new variables to represent the sine and cosine terms <strong>for</strong> a cyclical fit. Because the time units 

are in years, the period is also in years and is equal to ν = 1. The following <strong>for</strong>mulae variables were created: 



with the first few lines of the data table now looking like: 



Now use the Analyze->Fit Model plat<strong>for</strong>m to fit a multiple-regression using the year, it sine, and cosine 

variables: 



The effects test 

indicate the presence of a cyclical pattern (not unexpectedly), but no evidence of a year effect. Save the 

predicted values and the residuals to the data table using the Red Triangle→SaveColumns pop-down menus: 

The residual plot is then found using the Analyze->Fit Y-by-X plat<strong>for</strong>m: 



and shows no severe lack of fit. There are several outliers that appear and perhaps something unusual is 

happening in mid-2003. An overlay plot of the actual and predicted values: 33 

33 Use the Graph→Overlay plat<strong>for</strong>m; select the observed and predicted values as the Y variables and year as the X variable; click 

on the legend <strong>for</strong> the predicted values and join with a line and hide the points. 



shows a generally good fit, with some outlier points and again further investigation required in about mid- 

2003. 34 

The log(T EOM/reference) hardly goes above the value of 0 (which is the reference line indicating 

no difference between the two instruments). In order to estimate the average log-ratio, we refit the model 

DROPPING the year term (why?) and examine the parameter estimates of this simpler model: 

The average log-ratio is −.39 (se .02). This corresponds to a ratio of .68 on the anti-log scale, i.e. the TEOM 

meter is reading, on average across the entire year, only 68% of the reference meter. 

2.7.4 Further comments 

An implicit assumption of this method is that the amplitude of the seasonal trend is constant in time, i.e. 

the β 2 and β 3 terms do not depend on time. It could happen that the amplitude is also decreasing in time, 

In this case, you may consider a log-trans<strong>for</strong>m of the Y variable so that the relative ratio between the top 

and bottom of the cycle may be fixed. Alternatively, more complex non-linear regression models where the 

amplitude also depends upon time may be fit. This beyond the scope of these notes. 

The key feature of this method <strong>for</strong> it to work well is the regularity of the seasonal effects and that the 

shape of the seasonal effects must be that of a sine or cosine curve. Consequently, a pattern that is relatively 

flat with a single sharp peak in a consistent month cannot be well fit by these models. In this case, you could 

create indicator variables <strong>for</strong> the peak time and then fit a multiple regression model as above – this is again 

beyond the scope of these notes. 

2.8 Seasonality and Autocorrelation 

Whew! This is a tough issue to deal with! Fortunately, there have been great advances in software and in 

some packages (e.g. SAS) this is fairly easy to deal with. Un<strong>for</strong>tunately, this is beyond simple packages such 

as JMP or SYStat. 

34 It turns out that these points were collected when a large amount of smoke from a nearby <strong>for</strong>est fire was present. 



This section will be brief with very little explanation of the underlying statistical concepts and reference 

to output from SAS. Please seek further help if you are dealing with this type of data. 

Again, refer back to the Klamath River data. It may turn out that even after adjusting <strong>for</strong> seasonality, there 

is residual autocorrelation within a year. For example, a particular year may have generally low phosphorus 

levels <strong>for</strong> some reason and so observations in months close together are more highly related than observations 

in months far apart. 

A common model <strong>for</strong> dealing with this type of autocorrelation is the familiar AR(1) process with a single 

autocorrelation parameter. In general, the covariance of two observations is modeled as: 

cov(Y t1 , Y t2 ) = σ 2 ρ ∆t 

where ∆t is the difference in time between the two observations. For example, observations that are 1 time 

unit apart will have covariance σ 2 ρ 1 ; observations that are two time units apart will have covariance σ 2 ρ 2 ; 

etc. 

The advantage of using this power notation is that missing values are easily accommodated – it is not 

necessary to have every observation in time so interpolation to ‘fill in’ missing values are not necessary. 

Let us revisit the Klamath phosphorus data. A model that allows <strong>for</strong> seasonal variation (by months) and 

autocorrelation can be fit using Proc Mixed using both the ANCOVA and autocorrelation models. The code 

fragment looks like: 

proc mixed data=klamath maxiter=200 maxfunc=1000; 

where phosphorus


The estimated common slope from this model (mg/L/year) is: 

Standard 

Label Estimate Error DF t Value Pr > |t| 

avg slope -0.00578 0.002515 10.9 -2.30 0.0430 

which is similar to the estimates found earlier. 

A model was also fit assuming independence among the observation (see the ANCOVA approach to 

seasonal adjustment earlier in this chapter). Is there support <strong>for</strong> the independence model? 

The AIC criteria is used to compare these different models. The two AIC (corrected <strong>for</strong> small sample 

sizes) are: 

AICC (smaller is better) -216.2 <strong>for</strong> the spatial power model 

AICC (smaller is better) -208.7 <strong>for</strong> the independence model 

A usual rule of thumb is that differences of more than 2 among the AIC indicate that there is evidence <strong>for</strong> 

the model with the smaller AIC. In this case, the AIC <strong>for</strong> the spatial power model is almost 8 units smaller 

than the independence model. There is strong evidence <strong>for</strong> residual autocorrelation. 

The estimated trend (ignoring autocorrelation) is: 

Standard 

Label Estimate Error DF t Value Pr > |t| 

avg slope -0.00562 0.001621 58 -3.47 0.0010 

As expected the estimated slopes are √ similar, 

√ 

but the reported se from the model ignoring autocorrelation 

was too small by a factor of about (1+ρ) 

1−ρ 

= 1.5 

.5 = √ 3 = 1.7. 

2.9 Non-parametric detection of trend 

The methods so far in this chapter all rely on several assumptions that may not be satisfied in all context. 

For example, all the methods (including the methods <strong>for</strong> autocorrelation) assume that deviations from the 

regression line are normally distributed with equal variance. In practice, they are fairly robust to nonnormality 

and heterogeneous variances if the sample sizes are fairly large. 

But, how is it possible to deal with truncated or censored observations? For example, it is quite common 

<strong>for</strong> measurement tools to have upper and lower limits of detectability and you often get measurements that 



are below or above detection limits. How can a monotonic, but not linear relationship be examined? 35 For 

example, cases of asthma seem to increase with the concentration of particulates in the atmosphere, but the 

relationship is not linear. 

A nice review of the basic methods applicable to many situations is given by: 

Berryman, D., B. Bobee, D. Cluis, and J. Haemmerli (1988). Non-parametric approaches <strong>for</strong> 

trend detection in water quality time series. Water Resources Bulletin 24(3):545-556. 

2.9.1 Cox and Stuart test <strong>for</strong> trend 

This is a very simple test to per<strong>for</strong>m and can be used in many different situations as illustrated in Conover 

(1999, Section 3.5). 36 The idea behind the test is to divide first the dataset into two parts. Match the first 

observation in the first part with the first observation in the second part; match the second observation in 

the first part with the second observation in the second part; etc. Then <strong>for</strong> each pair of values, determine 

if the value from the second part is greater that than the matched value from the first part. If there is a 

generally upwards trend in the data, then you see should see lots of pairs where the data value <strong>for</strong> the second 

part is larger than that of the first part. The number of pairs where the data from the second part exceeds 

its counterpart in the first part has a binomial distribution with p=.5 and this can be used to determine the 

p-value of the test etc. This will be illustrated with an example. 

In an earlier section, we examined the records of the grass cutting season over time. We will apply the 

Cox and Stuart procedure to this data as well. 

Here is the raw data again: 

35 If a trans<strong>for</strong>mation will linearize the line, then an ordinary regression can be used on the trans<strong>for</strong>med data. 

36 Conover, W.J. (1999). <strong>Applied</strong> non-parametric statistics, 2nd edition. Wiley 



Year 

Duration 

(days) 

1984 200 

1985 215 

1986 195 

1987 212 

1988 225 

1989 240 

1990 203 

1991 208 

1992 203 

1993 202 

1994 210 

1995 225 

1996 204 

1997 245 

1998 238 

1999 226 

2000 227 

2001 236 

2002 215 

2003 242 

There are exactly 20 observations, so the data is divided into two parts corresponding to the first 10 years 

and the last 10 years. 37 This gives the pairing: 

37 If the number of observations is odd, then the middle observation is discarded 



Part I 

Part II 

Year Duration Year Duration Part II >Part I 

1984 200 1994 210 1 

1985 215 1995 225 1 

1986 195 1996 204 1 

1987 212 1997 245 1 

1988 225 1998 238 1 

1989 240 1999 226 0 

1990 203 2000 227 1 

1991 208 2001 236 1 

1992 203 2002 215 1 

1993 202 2003 242 1 

If there are any ties in the pairs, these are also discarded. In this case, there were no ties, and the data 

from the second part was greater than the corresponding data from the first part in 9 of the 10 years. 

A two-sided p-value (allowing <strong>for</strong> either and increasing or decreasing trend) is found by finding the 

probability 

P (X ≥ 9) + P (X ≤ 1) 

when X comes from a Binomial distribution with n = 10 and p=0.5. 

This can be computed or found from tables such as at http://www.stat.sfu.ca/~cschwarz/ 

Stat-650/Notes/PDF/Tables.pdf. A portion of the Binomial table with n = 10 is presented 

below: 

Individual binomial probabilities <strong>for</strong> n=10 and selected values of p 

n x 0.1 0.2 0.3 0.4 0.5 

------------------------------------------ 

10 0 0.3487 0.1074 0.0282 0.0060 0.0010 

10 1 0.3874 0.2684 0.1211 0.0403 0.0098 

10 2 0.1937 0.3020 0.2335 0.1209 0.0439 

10 3 0.0574 0.2013 0.2668 0.2150 0.1172 

10 4 0.0112 0.0881 0.2001 0.2508 0.2051 

10 5 0.0015 0.0264 0.1029 0.2007 0.2461 

10 6 0.0001 0.0055 0.0368 0.1115 0.2051 

10 7 0.0000 0.0008 0.0090 0.0425 0.1172 

10 8 0.0000 0.0001 0.0014 0.0106 0.0439 

10 9 0.0000 0.0000 0.0001 0.0016 0.0098 

10 10 0.0000 0.0000 0.0000 0.0001 0.0010 



From the table above we find that the p-value is 

p-value = .0010 + .0098 + .0098 + .0010 = .0216 

which is comparable to that found from a direct application of linear regression of .012. 

Un<strong>for</strong>tunately, it is not possible to estimate the slope or any confidence interval using this method. The 

test is available in some computer packages, but because of its simplicity, is often easiest to do by hand. 

Surprisingly, this very simple test does not per<strong>for</strong>m badly when compared to a real regression. For 

example, the asymptotic relative efficiency of this test compared to a normal regression situation when all 

assumptions are satisfied is almost 80%. This implies that you would get the same power to detect a trend 

as a regular regression with 1 

.80 

= 1.25 times the sample size and using the Cox and Stuart test. 

However, if the data are straight<strong>for</strong>ward as in this case, there are better non-parametric methods as will 

be illustrated in later sections. 

2.9.2 Non-parametric regression - Spearman, Kendall, Theil, Sen estimates 

Non-parametric does NOT mean no assumptions 

While the Cox and Stuart test may indicate that there is evidence of a trend, it cannot provide estimates of 

the slope etc. Consequently, non-parametric methods have been developed <strong>for</strong> these situations. 

CAUTION: Non-parametric does not mean NO assumptions! Many 

people view non-parametric methods as a panacea that solves all ills. On the contrary, non-parametric tests 

also make assumptions about the data that need to be carefully verified in order that the results are sensible. 

In the context of non-parametric regression, the following assumptions are usually made and non-parametric 

tests may relax some of them: 

• Linearity. Parametric regression analysis assume that the relationship between Y and X is linear. 

Non-parametric regression analysis makes the same assumption. 

• Scale of Y and X. Parametric regression analysis assumes that X is time, so that it has an interval 

or ratio scale. It is further assumed that Y has an interval or ratio scale as well. Non-parametric 

regression analysis makes the same assumption except that some methods allow the Y variable to be 

ordinal. This allows non-parametric methods to be used when values are above detection limits as 

they can still often be ordered sensibly. 

• Correct sampling scheme. Parametric regression analysis assumes that the Y must be a random sample 

from the population of Y values at every time point. Non-parametric regression analysis makes the 

same assumption. 

• No outliers or influential points. Parametric regression analysis assumes that all the points must belong 

to the relationship – there should be no unusual points. Non-parametric regression analysis is more 



robust to failures of this assumption as the actual distances between the observed point and the fitted 

line are not used directly. However, many outliers can mask the true relationship. A very nice feature 

of non-parametric methods is that they are invariant to trans<strong>for</strong>ms that preserve order. For 

example, you will get the p-value answers if you use non-parametric analyses on Y or log(Y ). But, 

the estimated slope may be different as it is measured on a different scale. 

• Equal variation along the line on some scale. Parametric regression analysis assumes that the variability 

about the regression line is similar <strong>for</strong> all values of X, i.e. the scatter of the points above and below 

the fitted line should be roughly constant over time. Surprisingly to many people, non-parametric 

regression analysis assumes that the distribution of Y at each X is the same on some measuring scale 

and there<strong>for</strong>e must also have the same variation. However, because the assumption is about equal 

variance on some scale, and because non-parametric methods are invariant to simple trans<strong>for</strong>mations, 

this is often satisfied. For example, if a log-trans<strong>for</strong>m would stabilize the variance, then it is not necessary 

to trans<strong>for</strong>m be<strong>for</strong>e doing the Kendall test. This is one advantage of the non-parametric tests 

over parametric tests which require a homogeneous variation about the regression line. 

• Independence. Parametric regression assumes that each value of Y is independent of any other value 

of Y . Non-parametric regression analysis also makes this assumptions. Consequently, non-parametric 

regression analysis does not deal with autocorrelation. 

• Normality of errors. Parametric regression assumes that the difference between the value of Y and 

the expected value of Y is assumed to be normally distributed. Non-parametric regression analysis 

assumes that the distribution of Y at each value of X is the same, but does not require that it be 

normally distributed. Consequently heavy tailed distributions such a log-normal distributions can be 

handled with non-parametric regression. 

• X measured without error. Parametric regression analysis assumes that the error in measurement of 

X is small or non-existent relative to the error variation about the regression line. Non-parametric 

regression makes the same assumption. 

As you can see, data to be used in non-parametric analysis cannot be just arbitrarily collected - thought 

must be given as to assessing the appropriateness of the regression model. 

Surprising to many, least-square regression is actually a non-parametric method! The principle of choosing 

the regression line to minimize the sum of squared deviations from the regression line makes no distributional 

assumptions of Y at each X. The assumption of normality comes into play when you compute F 

or t-tests to test if the slope is zero, and construct confidence intervals <strong>for</strong> the slope or prediction intervals 

<strong>for</strong> individual means or predictions. 

A simple non-parametric test <strong>for</strong> zero-slope is Spearman’s ρ which is simply a correlation coefficient 

computed on the RANKS of the data. 38 The standard Pearson correlation coefficient (discussed in earlier 

sections) is then applied to the ranked data, and the p-value is found by referring to tables or from large 

sample <strong>for</strong>mula. Fortunately, most computer packages compute Spearman’s ρ and provide p-values. 

38 For each variable, find the smallest value and replace by the value of 1. Find the second smallest value and replace it by the value 

of 2, etc. If there are tied values, replace the tied ranks by the average of the ranks. This is easily done in Excel by repeated sorting the 

(X, Y ) pairs first by X and then by Y . 



Example: The Grass is Greener (<strong>for</strong> longer) revisited 

For example, the grass cutting example data is ranked as follows: 

Year Duration Year Duration 

(days) Rank Rank 

1984 200 1 2.0 

1985 215 2 10.5 

1986 195 3 1.0 

1987 212 4 9.0 

1988 225 5 12.5 

1989 240 6 18.0 

1990 203 7 4.5 

1991 208 8 7.0 

1992 203 9 4.5 

1993 202 10 3.0 

1994 210 11 8.0 

1995 225 12 12.5 

1996 204 13 6.0 

1997 245 14 20.0 

1998 238 15 17.0 

1999 226 16 14.0 

2000 227 17 15.0 

2001 236 18 16.0 

2002 215 19 10.5 

2003 242 20 19.0 

The correlation computed on the ranks is found to be .5766. 

JMP analysis Parametric and non-parametric Correlations between variables are found using the Analyze- 

>MultiVariateMethods->Multivariate plat<strong>for</strong>m: 



Specify both the X and Y variable in the dialogue box: 



Finally request non-parametric correlations from the drop-down menu box: 

which gives the following output: 

The Spearman ρ is found to be .5766 with a p-value of .0078. This compares to the p-value from the 

parametric regression of .012. 

Un<strong>for</strong>tunately, Spearman’s ρ does not provide an easy was to estimate the slope or find confidence 

intervals <strong>for</strong> the slope etc. 39 

Because Spearman’s ρ does not provide a convenient way to estimate the slope or to find confidence 

intervals <strong>for</strong> the slope, variants on Kendall’s τ are often used instead. This estimator of the slope has many 

39 However, refer to Conover (1995), Section 5.5 <strong>for</strong> details on using Spearman’s ρ to estimate a confidence interval <strong>for</strong> the slope. 



names: Sen’s (1968) estimator 40 ; Theil’s (1950) estimators 41 ; or Kendall’s τ 42 estimator are all common 

names. The idea behind these estimators is to look and concordant and discordant pairs of data points. A 

pair of data points (X 1 , Y 1 ) and (X 2 , Y 2 ) is called concordant if Y2−Y1 

X 2−X 1 

is greater than zero, discordant if 

the ratio is less than zero, and both if the ratio is 0. For the grass cutting duration data, the pair of data point 

(1985,215) is concordant with the data point (1988,225), but discordant with the data point (1986,195). As 

you can imagine, it is far easier to let the computer do the computations! 

The test <strong>for</strong> non-zero slope using Kendall’s tau can be computed by finding ALL possible pairs of data 

points (!) and using the rule: 

• if Yj−Yi 

X j−X i 

> 0 then add 1 to N c (concordant); 


X j−X i 

< 0 then add 1 to N d (discordant); 


X j−X i 

= 0 then add 1 2 to both N c and N d ; 

• if X i = X j , no comparison is made 

Kendall’s τ is found as: 

τ = N c − N d 

N c + N d 

The p-value is found from tables or by the computer. 

The computation of τ is simplified by sorting the pairs of (X, Y ) by the value of X and creating a 

spreadsheet to help with the computations. Each value of Y needs only to be compared to those “below” 

Estimation of the slope and confidence intervals <strong>for</strong> the slope are found by computing all the pairs of 

slopes: 

S ij = Y j − Y i 

X j − X i 

The estimate of the slope is simply the median of these values. 

A confidence interval <strong>for</strong> the slope is found by using tables to find the lower and upper quantities to 

use as the bounds of the interval. A close approximation to the values to use is found using this following 

procedure: 

• Let n be the number of pairs of points, and N be the number of paired slopes from above. 

√ 

n(n−1)(2n+5) 

• Compute w = z 

18 

where z is the appropriate quantile from a standard normal distribution. 

For example, <strong>for</strong> a 95% confidence interval, z = 1.96. 

40 Sen, P.K. (1968). Estimates of the regression coefficient based on Kendall’s τ. Journal of the American Statistical Association 

63,1379-1389. 

41 Theil, H.A. (1950). A rank-invariant method of linear and polynomial regression analysis 1, 2, and 3. Neder. Acad. Wetersch. 

Proc. 53,386-392, 521-525, and 1397-1412. 

42 Kendall, M.G. (1970). Rank Correlation Methods. Charles Griffin and Co., London. Fourth Edition. 



• Compute r = .5(N − w). 

• Use the r th and (N − r) th values of the paired slopes as the bounds of the confidence interval. 

For the mowing duration data, n = 20 and there are N = 190 possible slopes! The estimated slope 

is the median value. The approximate value of w = 60, so the 65 th and 135 th sorted values of the paired 

slopes are the lower and upper bounds of the 95% confidence interval. This gives an estimated slope of 

1.389 with a 95% confidence interval of (0.20 → 2.8). This can be compared to the estimated slope of 1.46 

and confidence interval <strong>for</strong> the slope from the ordinary regression analysis of (.4 → 2.6). 

This is rarely found in most computer packages, but the computation of the possible slopes can be 

programmed (sometimes clumsily) and can actually be done in a spreadsheet. 

JMP analysis Kendall’s τ is also computed using the Analyze->MultiVariateMethods->Multivariate plat<strong>for</strong>m 

in the same way as Spearman’s ρ was found: 

This gives: 



The p-value is .0123 very similar to that from the ordinary regression. 

It is very clumsy in JMP to compute the Sen-Theil-Kendall estimate of the slope and it is not done. Refer 

to the SAS program <strong>for</strong> more help. 

Final Remarks 

Berryman (1998) recommend that Kendall’s τ or Spearman’s ρ be used <strong>for</strong> a non-parametric testing <strong>for</strong> trend 

as these have the greatest efficiency relative to ordinary parametric regression. They also recommend (their 

Table 4) that a minimum of 9-11 observations be collected be<strong>for</strong>e testing <strong>for</strong> trend using these methods. 

It turns out that the asymptotic relative efficiency of both Kendall’s τ and Spearman’s ρ are very high 

(90%+) so that the planning tools <strong>for</strong> ordinary regression can be used to estimate the sample sizes required 

under various scenarios with a fair amount of confidence. 

2.9.3 Dealing with seasonality - Seasonal Kendall’s τ 

Basic principles 

In some cases, series of data have an obvious periodicity or seasonal effects. 

Consider, <strong>for</strong> example, values of total phosphorus taken from the Klamath River near Klamath, Cali<strong>for</strong>nia 

as analyzed by Hirsch et al (1982). 43 

43 This was monitoring station 11530500 from the NASQAN network in the US. Data are available from http://waterdata. 

usgs.gov/nwis/qwdata/?site_no=11530500. The data was analyzed by Hirsch, R.M., Slack, J.R., and Smith, R.A. (1982). 

Techniques of trend analysis <strong>for</strong> monthly water quality data. Water Resources Research 18, 107-121. 



Total phosphorus (mg/L) in Klamath River near Klamath, CA 

Year 

Month 1972 1973 1974 1975 1976 1977 1978 1979 

1 0.07 0.33 0.70 0.08 0.04 0.05 0.14 0.08 

2 0.11 0.24 0.17 . . . 0.11 0.04 

3 0.60 0.12 0.16 . 0.14 0.03 0.02 0.02 

4 0.10 0.08 1.20 0.11 0.05 0.04 0.06 0.01 

5 0.04 0.03 0.12 0.09 0.02 0.04 0.03 0.03 

6 0.05


shows a obvious seasonality to the data with peak levels occurring in the winter months. There are also some 

missing values as seen in the raw data table. Finally, notice the presence of several very large values (above 

0.20 mg/L) that would normally be classified as outliers. 

How can a test <strong>for</strong> trend be fit in the presence of this seasonality? 

Hirsch et al. (1982) modified Kendall’s τ to deal with seasonality. The method is very simple to describe, 

but is difficult to implement. 

The basic principle is divide the series into (in this case) 12 separate series, one <strong>for</strong> each month. These 

month-based series range from 8 years of data down to 5 years of data. For each month-based series, compute 

Kendall’s τ. Combine the 12 estimates of τ into a single omnibus test to compute the overall p-value. The 

estimated slopes is found from all the pairwise comparisons within each month-based series that are pooled 

and then the overall median of these pooled sets is used. Un<strong>for</strong>tunately, there are no simple procedures 

available to compute confidence intervals <strong>for</strong> the slope. 



Example: Total phosphorus on the Klamath River revisited 

JMP Analysis 

JMP can be used to compute a test statistic, but it is difficult (!) to estimate the slope. 

A JMP dataset with scripts is located in klamath.jmp and klamath2.jmp in the Sample Program Library 

available at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The JMP dataset has 12 rows and 9 columns corresponding to the various years: 

The years must be stacked using the Tables->Stack command to create three columns, Year, Month, Phosphorus. 

A portion of the stacked data is illustrated below: 



Missing values are indicated by the period. The Analyze->Fit Y-by-X plat<strong>for</strong>m can be used to create a data 

plot to illustrate the seasonal nature of the data (not shown). 

To compute Kendall’s τ <strong>for</strong> each month, use the Analyze->MultiVariateMethods->Multivariate plat<strong>for</strong>m 

and specify month in the BY area; 



This will give the estimates of correlation <strong>for</strong> each month. In order to request Kendall’s τ <strong>for</strong> EVERY plot, 

hold down the option key be<strong>for</strong>e clicking on the red triangle to request the non-parametric Kendall τ statistic. 

Un<strong>for</strong>tunately, there is no way to have JMP automatically save all the Kendall τ’s to a new data sheet <strong>for</strong> 

subsequent processing. You will have to manually (groan) type in each estimate of τ and the p-value to give 

the following table: 



Un<strong>for</strong>tunately, JMP does not provide the raw value underlying Kendall’s τ (what Hirsch et al call S) so we 

can’t use the direct methods outlined in Hirsch et al of simply adding the values of S. A somewhat indirect 

method must be used to combine the reported values of τ and their p-values over the 12 months. 

This indirect method converts the p-value back to a z-score. As the z-scores are distributed as Normal 

distributions (with mean 0 and variance 1) and are assumed to be independent across the months, their sum 

is simply a normal distribution with mean 0 and variance equal to the sum of the variances (in this case 12). 

This resulting sum can be then be converted to an actual p-value. 

To convert the p-value back to a z-score, use the relationship 

z = Φ −1 (1 − pvalue/2) × sign(τ b ) 

where Φ −1 is the inverse normal probability function; the 1 − pvalue/2 converts the two-sided p-value to 

the upper tail of normal curve, and the sign function makes sure that the z value also has the correct sign 

(i.e. positive or negative). This is done by creating a new column in JMP and creating a <strong>for</strong>mula <strong>for</strong> this 

column: 



The Normal Quantile function is the inverse normal function, and the IF clause serves as the sign function. 

The columns Var is simply the variance of the z-score. 

This gives the table: 



We add add together the z-scores and the variances, this gives an overall z-score. The Tables->Summary 

can be used to get this total: 



to give: 

Finally, we use a final <strong>for</strong>mula to compute the probability of exceeding this total z score: 



and the final overall p-value: 

The overall p-value of .0049. This can be compared to the paper by Hirsch et al (1982) who obtained an 

overall z-value of -2.69 with a p-value of .0072. 

Un<strong>for</strong>tunately, there is no simple way to estimate the slope using JMP. 

Final notes 

As pointed out earlier, non-parametric analyses are not assumption free - they merely have different assumptions 

than parametric analyses. In this method, the key assumptions of independence are still important. 

Because the data are broken into monthly-based series, this is likely true - it seems reasonable that the value 

in January 1971 has no influence on the value in January 1972. However, it is likely not true that January 

1971 is independent of February 1971 which would likely invalidate a simple use of Kendall’s method on 

the entire series. 

As Hirsch et al (1982) point out, it is possible that some sub-series exhibit strong evidence of upward 

trend, some sub-series exhibit strong evidence of downward trend, but the overall omnibus test fails to detect 

evidence of a trend. This is not unexpected, and if one is interested in the individual sub-series, then these 

should be examined individually. 

In the original paper by Hirsch et al. (1982), they did not allow <strong>for</strong> multiple observations in each time 

period. This actually poses no problem with computer implementations which handle ties appropriately. 



Lastly, you may have noticed in the original data, some values that were marked as below detection limit. 

These censored observations pose no problem to most non-parametric tests. Clearly a value that is below 

detection limit (e.g. < .01) is also less than .05. The only problem arise in making sure that if there are 

multiple, different, detection limits, that comparisons are handled appropriately. Usually, this implies using 

the largest detection limit in place of any lower detection limits. 

Hirsch et al (1982) did several simulation studies of the seasonal Kendall, and found that it had high 

power to detect changes. 

The Seasonal Kendall estimator has been implemented in many packages specially designed <strong>for</strong> environmental 

studies. Un<strong>for</strong>tunately, there are no packages that I am aware of that report confidence intervals 

<strong>for</strong> the slope. 

Berryman (1988) recommends that at least 60 observation spanning at least 5 cycles be obtained be<strong>for</strong>e 

using the Seasonal Kendall method. 

2.9.4 Seasonality with Autocorrelation 

General ideas 

As noted earlier, the Seasonal Kendall method still assumes that observations in different series are independent, 

i.e. that the January 1972 reading is not related to the February 1972 reading. In some cases this 

is untrue; <strong>for</strong> example, in a wet year, the steam flow may be higher than average <strong>for</strong> all months leading to 

positive correlation across series. 

Hirsch and Slack (1984) 44 considered this problem. As in the Seasonal Kendall test the data are first 

divided into sub-series, e.g. monthly series across several years. The Kendall statistic <strong>for</strong> trend across years 

is computed <strong>for</strong> each sub-series, e.g. <strong>for</strong> each month. These sub-series statistics are added together to give 

an omnibus test statistic. The Seasonal Kendall method could simply sum the variances of each test statistic 

to give the omnibus variance from which a z-score could be computed and a p-value obtained. However, 

because the sub-series are autocorrelated, the new test must also add together estimates of the covariances 

among the test statistics from the individual sub-series to get the omnibus variance prior to computing a 

z=score and p-value . 

Un<strong>for</strong>tunately, this is procedure is implemented in only a handful of specialized software packages <strong>for</strong> 

the analysis of water quality and hydrologic data. These packages can be located with a quick search on the 

WWW. It is not feasible to do the computations in JMP; nor in SYStat; the computations could likely be 

done in SAS, but are complex and way beyond the scope of these notes. 

Consequently, this method will not be discussed further in these notes – interested readers are referred 

to Hirsch and Slack (1982). 

44 Hirsch, R.M. and Slack, J.R. (1984). A non-parametric trend test <strong>for</strong> seasonal data with serial dependence. Water Resources 

Research 20, 727-732. 



Note that because parametric methods are now readily available – refer to earlier chapters of these notes, 

there is less need <strong>for</strong> these no-parametric procedures. 

Berryman (1988) and Hirsch and Slack (1982) recommends that at least 120 observations spanning at 

least 10 cycles be obtained be<strong>for</strong>e using the Seasonal Kendall method adjusted <strong>for</strong> autocorrelation. 

2.10 Summary 

This chapter is concerned mainly with detecting monotonic trends over time, i.e. a gradual increase or 

decrease over time. Some methods were introduced to deal with seasonal effects, but these effects are 

nuisance effects and should be eliminated prior to analysis. 

It is possible <strong>for</strong> these trends over time to be masked by exogenous variables, i.e. variables other than Y 

and X. For example, many ground water variables are influenced by flow, over and above seasonal effects. 

It was beyond the scope of these notes, but the effects of these exogenous variables should be first removed 

be<strong>for</strong>e the trend analysis is done. This can be done using multiple regression or other curve fitting techniques 

such as LOWESS. 

Measurements taken in close proximity over time are likely to be related to each other. This is known as 

serial or autocorrelation. This is often induced by some environmental variable that is slowly changing over 

time and also affects the monitored variable. Again, these exogenous effects should be removed first. Some 

residual autocorrelation may still be present. The most common test statistic to detect autocorrelation is the 

Durbin-Watson statistic where values near 2 indicate the lack of autocorrelation. 

Trend analyses can either be done using parametric or non-parametric methods. BOTH types of analyses 

make certain assumptions about the data – non-parametric methods are NOT assumption-free! It turns out 

that modern non-parametric methods are relatively powerful to detect trends even if all the assumptions are 

used. Hence there is little loss in power in using these methods. In addition, because they use the relative 

ranking of observations, they are relatively insensitive to outliers, moderate levels of non-detected values 

and missing values. 

If so, why not always use non-parametric methods? The basic impediment to the use of non-parametric 

methods are a lack of suitable computer software, the difficulty in computing point estimates and confidence 

intervals <strong>for</strong> the trend line, and the difficulty in making predictions <strong>for</strong> future observations. However, nonparametric 

tests are often ideally suited <strong>for</strong> mass screening. These procedures can be automated and it is 

not necessary to examine the possibly hundreds of individual datasets to see which need to be trans<strong>for</strong>med 

be<strong>for</strong>e parametric procedures can be used. 

Finally, what to do about outliers? Blindly including outliers using non-parametric methods without 

investigating their cause can be very dangerous. Trends may be detected that are not real. An outlier, by 

definition, is a point that doesn’t appear to fit the same pattern as the other data values. An assumption of 

most non-parametric tests is that the distribution of Y values at each X is the same (it may not be normal) 

– this would also require you to exclude outliers. Even parametric methods can deal with outliers nicely - 



a whole area of statistics deals with robust regression methods where outliers are iteratively reweighted and 

given a low weight if they appear to be anomalous. For example, SAS provides Proc RobustReg to do robust 

regression. 

A summary table of the various methods considered in this section of the notes appears below: 45 

45 This table is based on Trend Analysis of Food Processor Land Application Sites in the LUBGWMA available at: http://www. 

deq.state.or.us/wq/groundwa/LUBGroundwater/LUBGTrendAnalysisApp1.pdf 





Trend Analysis 

Method 

Simple Linear 

Regression 

Kendall’s τ 

Seasonal Regression 

Parametric 

or Non- 

Parametric 

Account <strong>for</strong> Seasonality 

Parametric No · Most powerful if assumptions 

hold, especially normality, 

non-seasonal, and independent. 

· Familiar technique to many 

scientists. 

· Simple to compute best fit 

line. 

· Available in most computer 

packages. 


No 

Parametric Yes Subtract 

monthly mean 

or median over 

years from original 

data. Then use 

the residuals to 

regress over time 

or use ANCOVA 

methods. 

Advantages Disadvantages Rec. 

sample 

sizes 

· Non-detect, outliers are easily 

handled. 

· Same p-value regardless of 

trans<strong>for</strong>m used on Y . 

· Accounts <strong>for</strong> seasonality. 

· Produces a description of the 

seasonality pattern. 

· Environmental data rarely 

con<strong>for</strong>ms to test assumptions. 

· Sensitive to outliers. 

· Difficult to handle non-detect 

values. 

· Serial correlation gives unbiased 

estimates, but they are 

not efficient. Consider methods 

to account <strong>for</strong> autocorrelation. 

· Does not account <strong>for</strong> seasonality. 

· Does not account <strong>for</strong> seasonality. 

· Not robust against autocorrelation. 

· Difficult to make predictions. 

· Assumes normality of adjusted 

values about regression 

line. 

· Not robust against serial correlation. 

· Requires near complete 

records <strong>for</strong> each set of 

monthly data. If the patters 

of missing years varies 

among the months, the 

monthly mean used to adjust 

<strong>for</strong> seasonal effects may be 

miss leading. 

· Reported se are too small because 

adjustment <strong>for</strong> seasonality 

not incorporated unless 

ANCOVA method used. 

10 Good 

power 

programs 

are 

available 

10 

30 with 

at least 5 

cycles 


Sine/Cosine 


Parametric Yes Deseasonalized 

values are 

obtained by fitting a 

· Accounts <strong>for</strong> seasonality. · With few exceptions, there is 

little reason to believe that 

the <strong>for</strong>m of the seasonality 

30 with 

at least 5 

cycles



adjusted <strong>for</strong> 

autocorrelation 

Seasonal Kendall 

without correction 

<strong>for</strong> serial 

correlation 

Parametric No · Accounts <strong>for</strong> autocorrelation 

in the data. 

· Can be also adjusted <strong>for</strong> seasonality. 


Seasonal Kendall 

adjusted <strong>for</strong> autocorrelation 


Yes But only by 

comparing the data 

from the same season 

(e.g. months) 

Yes (as above) 


· Robust against non-detects 

and outliers. 


· Robust against non-detects 

and outliers. 

· Robust against serial correlation. 

· Requires sophisticated software. 

· Extremely high autocorrelation 

may be invisible. 

· When applied to data that is 

not seasonal, has a slight loss 

of power. 

· Not robust against serial correlation. 

· Difficult to estimate confidence 

intervals. 

· Not all computer packages 

have this method. May require 

further programming. 

· Significant loss of power when 

applied to data that is not 

seasonal or lacks autocorrelation. 

· Specialized software required. 

20 

60 with 

at least 5 

cycles 

120 

with at 

least 10 

cycles 

CHAPTER 2. DETECTING TRENDS OVER TIME

Chapter 3 

Estimating power/sample size using 

Program Monitor 

J. Gibbs has written a Windoze program to estimate the power and sample size requirements <strong>for</strong> many 

common monitoring programs. 

Gibbs, J. P., and Eduard Ene. 2010. 

Program MONITOR: Estimating the statistical power of ecological monitoring programs. Version 

11.0.0. 

http://www.esf.edu/efb/gibbs/monitor/ 

CAUTION: Version 11.0 of MONITOR appears to have some “features” that result in 

incorrect power computations in certain cases. Please contact me in advance of using the results 

from MONITOR in a critical planning situation to ensure that you have not stumbled on some of the 

“features”. 

Program MONITOR uses simulation procedures to evaluate how each component of a monitoring program 

influences its power to detect a linear (regression) change. The program has been cited in numerous 

peer-reviewed publications since it first became available in 1995. 

Be<strong>for</strong>e using Program MONITOR, you will need to gather some basic in<strong>for</strong>mation about the proposed 

study. 

• What is the initial value of your population. This could be the initial population size, the initial density, 

etc. 

• How precisely can you measure the population at a given sampling occasion? This can be given as the 

standard error you expect to see at any occasion, or the relative standard error (standard error/estimate), 

282

CHAPTER 3. ESTIMATING POWER/SAMPLE SIZE USING PROGRAM 

MONITOR 

etc. 

• What is the process variation? Do you really expect that the measurements would fall precisely on the 

trend line in the absence of measurement error? 

• What is the significance level and target power? Traditional values are α = 0.05 with a power of 80% 

or α = 0.10 with a target power of 90%. 

3.1 Mechanics of MONITOR 

Let us first demonstrate the mechanics of MONITOR be<strong>for</strong>e looking at some real examples of how to use it 

<strong>for</strong> monitoring designs. 

Suppose we wish to investigate the power of a monitoring design that will run <strong>for</strong> 5 years. At each survey 

occasion (i.e. every year), we have 1 monitoring station, and we make 2 estimates of the population size at 

the monitoring station in each year. The population is expected to start with 1000 animals, and we expect 

that the measurement error (standard error) in each estimate is about 200, i.e. the coefficient of variation 

of each measurement is about 20% and is constant over time. We are interested in detecting increasing or 

decreasing trends and to start, a 5% decline per year will be of interest. We will assume an UNREALISTIC 

process error of zero so that the sampling error is equal to the total variation in measurements over time. 

Launch Program MONITOR: 



MONITOR 

The screen starts with default values. We make some changes: 

• Change the sampling occasions to the values 0, 1, 2, 3, 4. 



MONITOR 

• Change the number of survey plots/year to 2. 

• Check that the significance level is set of 0.05. 

• Check that the desired power is set to 0.80. 

• Check that the range of desired trends encompasses -5%. You might want to increase the number 

of trend powers computed to 21 to get power computations <strong>for</strong> every value rather than every second 

value. 

• Check that the two-sided test is selected. 



MONITOR 

Then click on the Plots Tab and enter the initial population size (1000) and a variation (the STANDARD 

DEVIATION) in measurements of 200 under Total Variation 



MONITOR 

Press the Run icon and the following results are shown: [Because the power computations are based on 

a simulation, your results may vary slightly.] 



MONITOR 

Notice that the net change over the five year period with a 5% decline/year is only a −18.5% total decline 

Mean 

% Total 

over the five year period. This is obtained as: 

Year Abundance Decline 

0 1000 0.0% 

1 950.0 = 1000(.95) −5.0% 

2 902.5 = 1000(.95) 2 −9.7% 

3 857.4 = 1000(.95) 3 −14.3% 

4 814.5 = 1000(.95) 4 −18.5% 

By clicking on the Trend vs. Power Chart tab, you see a graph of the power by the size of the trend: 



MONITOR 

This design has a power around 15% of detecting this trend – hardly worthwhile doing the study! 

How many years would be needed to detect this trend with a 80% power? Try modifying the number of 

sampling years until you get the approximate power needed: 



MONITOR 



MONITOR 

So about 10 years of monitoring will be needed to detect a 5% decline PER YEAR with about an 80% power. 

The difference in reported powers between the MONITOR and TRENDS programs are artifacts of the 

different ways the two programs compute power (and potential because of some ‘features’ of the MONITOR 

program). TRENDS uses analytical <strong>for</strong>mulae based on normal approximations, while MONITOR conducts 

a simulation study and reports the number of trials (in this case out of 500) that detected the trend. In any 

event, donÕt get hung up over these differences - the key point is that this proposed study has virtually no 

power to detect a 5% decline/year. 

Program MONITOR also has a hand calculator to convert between the trend per year and the total trend 

over the <strong>course</strong> of the experiment. 



MONITOR 

For example, a 5% decline per year <strong>for</strong> 5 ADDITIONAL years translates into an overall decline of 22.6% 

over the six years of the study (the one initial year + 5ADDITIONAl years). It is not a straight arithmetic 

conversion because the changes are actually multiplicative rather than additive as shown earlier. 

3.2 How does MONITOR work? 

Program MONITOR estimates power using a simulation based approach as outlined in the help file. For 

example, consider the situation outlined in the previous section. Again set up the control parameters in the 

same way except change the trend lines to look only at a single value <strong>for</strong> the decline (-5% per year). 



MONITOR 

Then press the Step icon. The following display is obtained: 



MONITOR 

First the underlying deterministic trend is generated (the black line in the middle of the plot). Then based 

on the variation expected in the measurements, actual “data” are generated (shown by circles below and note 

that at time 1, the values are “off the plot”) and presented in the Survey count details tab: 



MONITOR 



MONITOR 

Then it gets a bit odd, and the output is potentially misleading. The red line is regression line is fit through 

the points (the red line in the first graph; estimates at the bottom of the data window). But this curve is not 

the one used to estimate the power. Rather, a regression line is fit through the log(data) and the results from 

the regression on the log(data) is used to determine if the trend was detected. The analysis is done on the 

log-scale because of the multiplicative way in which the deterministic trend is fit. Refer to the analyses from 

JMP below to see which statistics are used: 



MONITOR 

In this case, the estimated trend line (on the log-scale) was not statistically different from zero and the 

trend was NOT detected. 

The simulation is repeated many hundreds of times and the number of times that a statistically significant 

trend was detected is found and the proportion of times that a statistically significant trend was detected is 

then the estimated power <strong>for</strong> this design. 

3.3 Incorporating process and sampling error 

As noted in the chapter on Trend Analysis, there are often two sources of variation in any monitoring study. 

First is sampling variation. This occurs because it is impossible to measure the population parameter 



MONITOR 

exactly in any one year. For example, if we are measuring the mean DDT level in birds, we must take a 

sample (say of 10 birds), sacrifice them, and find the mean DDT in those 10 birds. If a different sample 

of 10 birds were to be selected, then the sample mean DDT would vary in the second sample. This is 

called sampling error (or the standard error) and can be estimated from the data taken in a single year. Or, 

the parameter of interest may be the number of smolt leaving a stream and this is estimated using capturerecapture 

methods. Again we would have a measure of uncertainty (the standard error) <strong>for</strong> each measurement 

in each year. Sampling error (the standard error) can be reduced by increasing the ef<strong>for</strong>t in each year. 

However, consider what happens when measurements are taken in different years. It is unlikely that the 

population values would fall exactly on the trend line even if the sampling error was zero. This is known as 

process error and is caused by random “year” effects (e.g. an El Nino). Process error CANNOT be reduced 

by increasing the sampling ef<strong>for</strong>t in a year. 

The two sources of variation are diagrammed below: 

Un<strong>for</strong>tunately, process error is often the limiting factor in a monitoring study! 

In order to estimate the process and sampling variation, you will need at least two years of data or some 



MONITOR 

educated guesses from previous years. The Program MONITOR website has a spreadsheet tool to help you 

in the decomposition of process and sampling error. 

For example, consider a study to monitor the density of white-tailed deer obtained by distance sampling 

on Fire Island National Seabird (Underwood et al, 1998), presented as the example on the spreadsheet to 

separate process and sampling variation. 

The estimated density (and se) are: 

Year Density SE 

1995 79.6 23.47 

1996 90.1 11.67 

1997 107.1 12.09 

1998 74.1 10.45 

1999 64.2 13.90 

2000 40.8 12.38 

2001 41.2 7.40 

Consider the plot of density over time (with approximate 95% confidence intervals): 



MONITOR 

Assuming that the deer density is in steady state over the five years of the study, you can see that there is 

considerable process error as many of the 95% confidence intervals <strong>for</strong> the deer density do not cover the 

mean density over the five years. So even if the sampling error (the se) was driven to zero by adding more 

ef<strong>for</strong>t, the data points would not all lie exactly on the mean line over time. 

There are many ways to separate process and sampling variation – the chapter on the analysis of BACI 

designs presents some additional ways. The following is an approximate analysis that should be sufficient 

<strong>for</strong> most planning purposes. 



MONITOR 

First, examine a plot of the estimated se versus the density estimates: 

In many cases, there is often a relationship between the se and the estimate with larger estimates tending to 

have a higher se than smaller estimates. The previous plots shows that except <strong>for</strong> one year, the se is relatively 

constant. If the se had a positive relationship to the estimate, a weighted procedure could be used (this is the 

procedure used in Underwood’s spreadsheet). 

We being by finding the mean density and the total variation from the mean. [If the preliminary study 

had an obvious trend, you could fit the trend line and then find the total variation from the trend line in a 



MONITOR 

similar fashion.] 

We start by finding the total variation in the density estimates over time: 

̂V ar T otal = var(79.6, 90.1, . . . , 41.2) = 599.6 

The total variation is equal to the process + sampling variation. An estimate of the average sampling variation 

is found by averaging the se 2 : 

̂V ar Sampling = 23.42 + 11.67 2 + . . . + 7.40 2 

7 

= 191.9 

Finally, the process variance is found by subtraction: 

̂V ar P rocess = ̂V ar T otal − ̂V ar Sampling = 599.6 − 191.9 = 407.7 

We now launch Program Monitor, and are interested in a 10 year study to look at changes in the population 

density following some management action. Notice we now specify a partitioning of the variation into 

process and sampling error: 



MONITOR 

We use the sqrt() of the two variances estimated above when specifying the two sources of variation: 



MONITOR 

and then press the Run button as be<strong>for</strong>e to get: 



MONITOR 

The power to detect a 5% decline PER YEAR is not very good. 

It is instructive to see what would happen if you believed that there was NO process variation and simply 

used the average sampling variation as the sole source of variation: 



MONITOR 

Now the (incorrect) estimated power is much higher. 

3.4 Presence/Absence Data 

Sometimes, only presence/absence data can be collected on each plot, rather than a measure of density. In 

cases like this, you may wish to consider occupancy modelling, but that is a topic <strong>for</strong> another <strong>course</strong>. 

Despite not having an absolute measure of abundance, presence/absence data can be used to monitor 

the density of species with relatively low abundances. This makes use of the Poisson distribution to predict 



MONITOR 

presence/absence as a function of density. 

For example, according to the Poisson distribution, if the average density per plot is µ, then the probability 

that a sampled plot will be labelled as a presence is 1 − exp(−µ) and the probability that a sampled plot 

will be labelled as an absence is exp(−µ). So a change in the overall proportion of sites that are occupied 

corresponds to a change in the overall average density. 

Note that we are implicitly assuming that all absences are true absences, i.e. not a false negative. If false 

negatives are possible, you really should be using an occupancy design rather than a simple presence/absence 

design. 

We will use the example that ships with Program MONITOR. This example focuses on the least bittern 

(Ixobrychus exilis) , a secretive marsh bird. Least bittern populations are hard to monitor given their quirky 

habitats, that is, its unpredictable calling behavior. Calls are the only way to detect the species presence 

within the dense vegetation of the marshes where it lives. Consider that baseline surveys of least bitterns 

between May 15-June 15 indicate that an average of about 0.20 calling least bitterns were heard on any 

given visit. A water control structure on the marsh is being altered to generate a more stable water level that 

should improve the situation <strong>for</strong> bitterns at the site. How much of a trend can be detected with 10 years of 

monitoring and 10 visits to the marsh each year? 

Here the average of 0.20 calls/visit implies that a “presence” was detected in about 1/5 visits to the 

marsh. 

Start by entering the data on the main page and then on the plots page. 



MONITOR 



MONITOR 

With presence/absence data, the plot “mean” should have the approximate base rate of presences and 

there is no need <strong>for</strong> a standard deviation estimator. On the main page, tests <strong>for</strong> trend in presence/absence 

data are equivalent to “chi-square tests” (covered in another section of the notes). The Custom/ANOVA 

area indicates a doubling of the presences frequency in the second through 10 year of monitoring. 

Be<strong>for</strong>e computing the power, press the step button to get a feel <strong>for</strong> the data that are generated (not shown). 

I think this is where Program MONITOR has a “feature” as the data in the 3rd and subsequent visits doesn’t 

ever have any non-detects. 

Consequently, I won’t continue with this example until I understand what MONITOR is doing! I have 

SAS programs that can help in planning of presence/absence studies – please contact me <strong>for</strong> assistance. 

3.5 WARNING about using testing <strong>for</strong> temporal trends 

The Patuxent Wildlife Research Center has some sage advice about power analysis <strong>for</strong> temporal trends. 

Users should be aware (and wary) of the complexity of power analysis in general, and also acknowledge 

some specific limitations of MONITOR <strong>for</strong> many real-world applications. Our chief, 

immediate concern is that many users of MONITOR may be unaware of these limitations and 

may be using the program inappropriately. Below are comments from one of our statisticians 

on some of the aspects of MONITOR that users should be cognizant of: ÒThere are numerous 

issues with how Program Monitor calculates statistical power and sample size. One issue concerns 

the default option whereby the user assumes independence of plots or sites from one time 

period to the next. If you are randomly sampling new sites or plots each time period, then it is 

correct to assume independence (assuming that finite population correction factor is not an issue, 

which depends on how many plots or sites you are sampling, relative to the total population 

size of potential plots or sites). If you are sampling the same plots or sites repeatedly over time, 

however, then the default option in Program Monitor is unlikely to give a correct calculation of 

statistical power or sample size. If plots or sites are positively autocorrelated over time, as is 

usually the case in biological surveys, then Program Monitor will underestimate sample size, or 

conversely, it will overestimate the statistical power. The correct sample size estimate is likely 

to be greater, and depending upon the amount of autocorrelation, the correct sample size could 



MONITOR 

be vastly greater to achieve a stated power objective. 

We deal with some of these issues when we discuss the design and analysis of BACI surveys later in this 

<strong>course</strong>. 


Chapter 4 

Regression - hockey sticks, broken 

sticks, piecewise, change points 

A simple regression analysis assumes that the change in response is the same across the range of X values. 

In some cases, a model where the slope changes in different parts of the X space may be biologically more 

realistic. 

This chapter examines two cases of fitting regression lines with breaks in the slope. In the first case, 

the location of the change in slope is known in advance; the second cases also estimates the location of the 

change, also known as the change point problem. 

The examples in this chapter look at cases with a single change point – the extension to multiple change 

points (both known and unknown) is straight<strong>for</strong>ward. Similarly, the change from linear to quadratic lines is 

also straight<strong>for</strong>ward. 

A related method, a spline fit to the data, where a flexible curve is fit between (evenly) spaced knot points 

that is a like a non-parametric curve fit is explored in a different chapter. 

4.1 Hockey-stick, piecewise, or broken-stick regression 

In this section, the location of the change point is known. The statistical model is: 

Y = β 0 + β 1 (X) + β 2 (X − C) + + ɛ 

where β 0 is the intercept, β 1 is the slope be<strong>for</strong>e the change point C, and β 2 is the DIFFERENCE in slope 

311

CHAPTER 4. REGRESSION - HOCKEY STICKS, BROKEN STICKS, 

PIECEWISE, CHANGE POINTS 

after the change point. The slope after the change point is β 1 + β 2 . The variable (X − C) + is a derived 

variable which takes the value of 0 <strong>for</strong> values of X less than C and the values X − C <strong>for</strong> values of X greater 

than C. This is usually created using a Formula Editor based on the actual data. 

The hypothesis of interest is H : β 2 = 0 which indicates no change in slope between X < C and 

X > C. 

Because the value of C is specified in advance, ordinary least-squares can be used to fit the model. Most 

computer packages can easily fit this model. 

4.1.1 Example: Nenana River Ice Breakup Dates 

The Nenana river in the Interior of Alaska usually freezes over during October and November. The ice 

continues to grow throughout the winter accumulating an average maximum thickness of about 110 cm, 

depending upon winter weather conditions. The Nenana River Ice Classic competition began in 1917 when 

railroad engineers bet a total of 800 dollars, winner takes all, guessing the exact time (month, day, hour, 

minute) ice on the Nenana River would break up. Each year since then, Alaska residents have guessed at the 

timing of the river breakup. A tripod, connected to an on-shore clock with a string, is planted in two feet 

of river ice during river freeze-up in October or November. The following spring, the clock automatically 

stops when the tripod moves as the ice breaks up. The time on the clock is used as the river ice breakup 

time. Many factors influence the river ice breakup, such as air temperature, ice thickness, snow cover, wind, 

water temperature, and depth of water below the ice. Generally, the Nenana river ice breaks up in late April 

or early May (historically, April 20 to May 20). The time series of the Nenana river ice breakup dates can 

be used to investigate the effects of climate change in the region. 

In 2010, the jackpot was almost $300,000 and the ice went out at 9:06 on 2010-04-29. In 2012, the 

jackpot was over $350,000 and the ice went out at 19:39 on 2012-04-23 - as reported at http://www. 

cbc.ca/news/offbeat/story/2012/05/02/alaska-ice-contest.html. The latest winner, 

Tommy Lee Waters, has also won twice be<strong>for</strong>e, but never has been a solo winner. Waters spent time 

drilling holes in the area to measure the thickness of the ice. Altogether he spent $5,000 on tickets <strong>for</strong> 

submitting guesses (he purchased every minute of the afternoon of 23 April) and spent an estimated 1,200 

hours working out the math by hand. And, it was also his birthday! (What are the odds?) You too can use 

statistical methods to gain fame and <strong>for</strong>tune! 

More details about the Ice Classic are available at http://www.nenanaakiceclassic.com. 

The data is available in the nenana.jmp data file available in the the Sample Program Library at http: 


A simple regression line fit to the time of break up with year as the predictor show evidence of a decline 

over time (i.e. the time of breakup is tending to occur earlier) and there is no evidence of auto-correlation. 




A closer inspection of the top graph gives the impression that until about 1970, the regression line was 




“flat” and only after 1970 did the time of breakup seem to decrease. 

A broken stick model (separate slopes in the pre-1970 and the post-1970 eras) can be easily fit. We need 

to create a new variable that is zero <strong>for</strong> the pre-1970 period and equal to (year − 1970) in the post 1970 

period. This is easily created in JMP using the Formula Editor: 

This is then fit using the Analyze->Fit Model plat<strong>for</strong>m: 




which gives the estimates: 

The <strong>for</strong>mal statistical model is: 

Date = β 0 + β 1 (year) + β 2 (year − 1970) + + ɛ 

In years prior to 1970, the slope is β 1 . In years after 1970, the slope is β 1 + β 2 . A test <strong>for</strong> differential slopes 

in the two eras is then equivalent to a test if β 2 = 0. 




In this case the p-value <strong>for</strong> the β 2 coefficient (associated with the (year − 1970) + variable) is just under 

0.05 providing some evidence of a different slope in the two eras. 

A plot of the fitted line is obtained by saving the predicted values to the data table: 

and then plotting the actual data and the fitted points on the same graph using the Graph->Overlay 


Confidence intervals <strong>for</strong> the MEAN response in a particular year (not likely of interest in this example) 




and <strong>for</strong> the individual responses in a particular year are generated in the usual way. 

Note that the estimated slope <strong>for</strong> the pre-1970 era is not statistically different from 0. If you wanted to 

fit a model where the line was flat (i.e. the slope was 0) in the pre-1970 era, this is done by using only the 

(year − 1970) + variable. Many of the automatically generated plots look odd (e.g. all of the points appear 

to be replotted at 1970), the intercept has a different interpretation in the two models because year = 0 has 

a different definition in the two models, but if the fitted model is plotted against the original year variable 

everything works out properly. In this particular case, the two latter models give predicted lines that are 

almost identical. It is quite RARE that you would fit a line where the slope is known to be zero in practice. 

4.2 Searching <strong>for</strong> the change point 

In the previous section on segmented regression (also known as hockey-stick regression or broken-stick 

regression), the locations of the break are assumed to be known. In many cases, the location of the break is 

not known, and it is of interest to estimate the break point as well. 

The problems of identifying changes at unknown times and of estimating the location of changes is 

known as “the change-point problem”. Numerous methodological approaches have been implemented in 

examining change-point models. Maximum-likelihood estimation, Bayesian estimation, isotonic regression, 

piecewise regression, quasi-likelihood and non-parametric regression are among the methods which have 

been applied to resolving challenges in change- point problems. Grid-searching approaches have also been 

used to examine the change-point problem. A review of the literature especially as it applies to regression 

problem (as of 2008) is available at: http://biostats.bepress.com/cgi/viewcontent. 

cgi?article=1075&context=cobra. 

The standard change-point problem in regression models consists of 

• testing the null hypothesis that no change in regimes has taken place against the alternative that observations 

were generated by two (or possibly more) distinct regression equations, and 

• estimating the two regimes that gave rise to the data. 

There are two common models. First are models where the regression line is continuous at the break 

point, and models where the regression line can be discontinuous. In these notes, we only consider the 

continuous case. 

This problem has a long history. A nice summary and treatment of the problem is available in 

Toms, J. D. and Lesperance, M L. (2003). 

Piecewise regression: A tool <strong>for</strong> identifying ecological thresholds. 

<strong>Ecology</strong>, 84, 2034-2041 

http://dx.doi.org/10.1890/02-0472. 




The change point model starts with the broken-stick model seen earlier, i.e. 

Y = β 0 + β 1 (X) + β 2 (X − C) + + ɛ 

where Y is the response variable, X is the covariate, and C is the change point, i.e. where the break occurs. 

This model is appropriate where there is an abrupt transition at the break point, but a smooth transition may 

be more realistic <strong>for</strong> some data. One drawback of this model is that convergence problems can occur in 

locating C when the data are sparse around the neighborhood of C. 

Toms and Lesperance (2003) review the use of model with gentler transitions, e.g. the hyperbolic tangent 

model or the bent-cable model. The bent-cable regression model was recently developed by Chuis, Lockhart 

and Routledge (2006, Bent-cable regression theory and application, Journal of the American Statistical 

Association, 101, 542-553). The bent-cable regression model fits a smooth transition between the two linear 

parts of the model. The latter is also applicable to regression models where the X variable is time and 

auto-correlation may be present 1 . 

The simple piece-wise linear model can be fit using the Analyze->Modelling ->NonLinear plat<strong>for</strong>m of 

JMP. 

4.2.1 Change point model <strong>for</strong> the Nenana River Ice Breakup 

Refer to the previous section about details on the Nenana River Ice Breakup contest. Rather than specifying 

a break point at 1970, we will fit the change point model to estimate the change point. 

The data are available in the Nenana.jmp data table in the the Sample Program Library at http:// 

www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The statistical model is: 

JulianDate = β 0 + β 1 (Y ear) + β 2 (Y ear − C) + + ɛ 

where JulianDate is the date of breakup, Y ear is the calendar year. The parameters to be estimated are 

β 0 the intercept, β 1 the change in the breakup prior to the change point, β 2 the change in slope after the 

breakup, and C the change point. 

We first need to define the parameters of the model (β 0 , β 1 , β 2 , C) and the predicted value in terms of 

the parameters of the model. We start by creating a new column in the data table ChangePointPredictor and 

start the Formula Editor. 

1 Chiu, G. S. and Lockhart, R. L. (2010). Bent-cable regression with auto-regressive noise. Canadian Journal of Statistics, 38, 

386-407. http://dx.doi.org/10.1002/cjs.10070 




New parameters are defined (along with initial starting guesses), by using the drop-down menu in the 

top left of the <strong>for</strong>mula editor: 

Click on the New Parameters item and create the four parameters and their initial values (based on the results 

from the previous example). The choice of initial values is not that crucial. Then create the predicted value 

in terms of the parameters and the columns in the data table: 




Notice the use of the If function to adjust <strong>for</strong> the break point. You can switch back and <strong>for</strong>th between the 

parameters, data table columns, etc. using the drop down menu in the top right of the <strong>for</strong>mula editor. When 

you are finished, close the Formula Editor, and the data table will be updated with initial predictions based 

on the initial values specified. 

Select the Analyze->Modelling ->NonLinear plat<strong>for</strong>m: 




Specify the predicted value and Y variables appropriately: 




Notice that the <strong>for</strong>mula <strong>for</strong> the predictions is displayed. 

This brings up the Analyze->Modelling ->NonLinear plat<strong>for</strong>m control panel. The initial fit is displayed. 

Press the Go button to find the non-linear least squares fit. 




The non-linear least squares algorithm appears to have converged at the estimates listed in the table. The 

estimated change-point of 1967 is close to the value of 1970 “guess-timated” earlier. Approximate standard 

errors are also presented at the bottom of the output: 

These standard errors are based on large-sample theory. In order to compute a 95% confidence interval <strong>for</strong> 

the break point, you could use the standard estimate ± 2(se), but in small samples, the resulting confidence 

intervals may not per<strong>for</strong>m well. Toms and Lesperance (2003) recommend that a likelihood ratio confidence 

interval be computed. JMP attempts to compute profile-likelihood confidence intervals when you press the 




Confidence Interval button which gives: 

In this case, the profile intervals fail to give upper and lower bounds because the slope after the change 

point is just on the boundary of statistical significance at (α = 0.05). If you change the confidence coefficient 

<strong>for</strong>m 95% to 90%, the procedure is able to find confidence bounds on the C parameter. Consequently, there 

may or may not be a change point. Notice that the lower boundary of the confidence interval <strong>for</strong> C is quite 

far below the point estimate! 

Confidence intervals <strong>for</strong> the mean response and prediction intervals <strong>for</strong> a future response are obtained by 

clicking on the red triangle: 

These are interpreted in the same way as in ordinary regression. 

The Analyze->Modelling ->NonLinear plat<strong>for</strong>m also allows you to “play” with the estimates to invesc○2012 

Carl James Schwarz 324 November 23, 2012



tigate the sensitivity of the fit to the parameters. The Profiler option under the red triangle is also useful in 

these cases. 

4.3 How NOT to search <strong>for</strong> a change point! 

A fairly common “request” in our Statistical Consulting Service is <strong>for</strong> help in finding the time at which 

some treatment gives a difference in response from a control. For example, a group of animals may be fed 

a control diet and are measured over time, while another group of animals are fed an experimental diet and 

are measured over time. At which point do the responses between the two groups start to differ? 

Let us assume, <strong>for</strong> simplicity, that separate animals are measured at each time point so that the problem 

of longitudinal data are ignored. For example, suppose that animals must be sacrificed at each time point 

to measure the response. A naive analysis starts by plotting the means of the two groups over time and 

searching <strong>for</strong> the first time point at which the two means are statistically different: 

This is NOT A VALID ANALYSIS! The problem is that the estimate of the change point <strong>for</strong> this analysis 

will depend on the sample size. If the sample size is small in each group, then the standard error bars are 

larger, and the estimated change point tends to be larger than if the sample size is large and the standard 

errors are smaller. The actual change point does NOT depend on sample size! All that should happen is that 

the estimated precision of the change point problem should be worse <strong>for</strong> smaller sample sizes than <strong>for</strong> larger 

sample sizes. 




The proper way to search <strong>for</strong> a change point is to find the DIFFERENCE in means at each time point and 

then apply the analysis in the previous sections to the difference. A model where the difference in means is 

<strong>for</strong>ced to be zero prior to the unknown change point may be a suitable alternate model. 


Chapter 5 

Analysis of Covariance - ANCOVA 


In previous chapters, we looked at comparing group means from data collected from a single-factor completely 

randomized design and analyzed using ANOVA. We also looked at estimating the slope of a straight 

line between two variables. In both cases the response variable, Y , was continuous (interval or ratio scale). 

In the case of ANOVA, the X variables was nominal or ordinal in scale and served to identify the treatment 

groups. In the regression setting, the X variable was also continuous. 

The Analysis of Covariance (ANCOVA) is a combination of both analyses. Groups are identified by a 

nominal or ordinal scale variable and a continuous covariate is also measured. 

There are two uses of ANCOVA which, on the surface, appear to be separate analyses. In fact, both 

analyses are identical. 

The first use is to check if the regression line <strong>for</strong> the groups are parallel. If there is evidence that the 

individual regression lines are not parallel, then a separate regression line must be fit <strong>for</strong> each group <strong>for</strong> 

prediction purposes. If there is no evidence of non-parallelism, then the next task is to see if the lines are 

co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the lines are not 

coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate the common 

slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled 

together and a single regression line fit <strong>for</strong> all of the data. 


obvious: 

327

CHAPTER 5. ANALYSIS OF COVARIANCE - ANCOVA 





Second, ANCOVA has been used to test <strong>for</strong> differences in means among the groups when some of the 

variation in the responsible variable can be “explained” by a covariate. For example, the effectiveness of two 

different diets can be compared by randomizing people to the two diets and measuring the weight change 

during the experiment. However, some of the variation in weight change may be related to initial weight. 

Perhaps by “standardizing” everyone to some common weight, we can more easily detect differences among 

the groups. 

Insert graphs here 

A very nice book on the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance 

by G. A. Milliken and D. E. Johnson. Details are available at http://www.statsnetbase. 

com/ejournals/books/book_summary/summary.asp?id=869. 



5.2 Assumptions 


ANCOVA is a combination of ANOVA and Regression, the assumptions are similar. Both goals of ANCOVA 

have similar assumptions: 

• The response variable Y is continuous (interval or ratio scaled) 

• The data are collected under a completely randomized design. 1 This implies that the treatment must 

be randomized completely over the entire set of experimental units if an experimental study, or units 

must be selected at random from the relevant populations if an observational study. 













5.3 Comparing individual regression lines 



randomization structure.. Let Y be the response variable; X be the continuous X-variable, and Group be 

the group factor. 

In all cases that follow, we are assuming that a completely randomized design was used <strong>for</strong> the randomization 

structure. This implies that there are no explicit terms <strong>for</strong> the randomization structure in the 

model. 

Similarly, there is a single size of experimental unit with no blocking or sub-sampling occurring. This 

also implies there will be no terms in the model <strong>for</strong> the experimental unit structure. In more advanced 

<strong>course</strong>s, the analyses in this chapter can be extended to more complex designs. 

1 It is possible to relax this assumption - this is beyond the scope of this <strong>course</strong>. 




In earlier chapters, we saw that the model <strong>for</strong> a single-factor completely randomized design is 

Y = Group 

This is read as saying that variation in Y can be partially explained by an overall grand mean (never specified) 

with differences in the mean caused by Groups plus an implicit random noise (which is never specified). 

Again, from an earlier chapter, we say that the model <strong>for</strong> a regression of Y on X is 

Y = X 

This is read as saying that the variation in Y can be partially explained by an intercept (never specified) plus 

changes in the X plus an implicit random noise (which is never specified). 

As ANCOVA is a combination of the above two analyses, it will not be surprising that the models will 

have terms corresponding to both Group and X. Again, there are three cases: 

If the lines <strong>for</strong> each group are not parallel: 




Y 1 = Group X Group ∗ X 


specified) followed by group effects (different intercepts), a common slope on X, and an “interaction” between 

Group and X which is interpreted as different slopes <strong>for</strong> each group. This model is almost equivalent 

to fitting a separate regression line <strong>for</strong> each group. The only advantage to using this joint model <strong>for</strong> all groups 

is similar to that enjoyed by using ANOVA - all of the groups contribute to a better estimate of residual error. 

If the number of data points per group is small, this can lead to improvements in precision compared to 

fitting each group individually. 



Y 2 = Group X 










Y 3 = X 

. Now the difference between this model and the previous model is the Group term that has been dropped. 


test. The test <strong>for</strong> co-incident lines should only be done if there is insufficient evidence against the hypothesis 

of parallelism. 




5.4 Comparing Means after covariate adjustments 

to be added later 

5.5 Power and sample size 

to be added later 

- use the MSE as the estimate of variance <strong>for</strong> testing MEANS and <strong>for</strong> testing the slope. 

5.6 Example - Degradation of dioxin 








on the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs 




As seen in the chapter on regression, the appropriate response variable is log(T EQ). 


Here are the raw data which are also available in the dataset dioxin2.jmp in the Sample Program Library 

at SampleProgramLibrary. 






Site Year TEQ log(TEQ) 

a 1990 179.05 5.19 

a 1991 82.39 4.41 

a 1992 130.18 4.87 

a 1993 97.06 4.58 

a 1994 49.34 3.90 

a 1995 57.05 4.04 

a 1996 57.41 4.05 

a 1997 29.94 3.40 

a 1998 48.48 3.88 

a 1999 49.67 3.91 

a 2000 34.25 3.53 

a 2001 59.28 4.08 

a 2002 34.92 3.55 

a 2003 28.16 3.34 

b 1990 93.07 4.53 

b 1991 105.23 4.66 

b 1992 188.13 5.24 

b 1993 133.81 4.90 

b 1994 69.17 4.24 

b 1995 150.52 5.01 

b 1996 95.47 4.56 

b 1997 146.80 4.99 

b 1998 85.83 4.45 

b 1999 67.72 4.22 

b 2000 42.44 3.75 

b 2001 53.88 3.99 

b 2002 81.11 4.40 

b 2003 70.88 4.26 

The data can be entered into JMP in the usual fashion. Make sure that Site is a nominal scale variable, 

and that Year is a continuous variable. 


is easily accomplished in JMP by selecting the rows (say <strong>for</strong> site a) and using the Rows->Markers to set the 

























variable: 




Then specify a grouping variable by clicking on the pop-down menu near the Bivariate Fit title line: 















The scatterplot doesn’t show any obvious outliers. The estimated slope <strong>for</strong> the a site is -.107 (se .02) 

while the estimated slope <strong>for</strong> the b site is -.06 (se .02). The 95% confidence intervals (not shown on the 



The MSE from site a is .10 and the MSE from site b is .12. This corresponds to standard deviations of 

√ 

.10 = .32 and 

√ 

.12 = .35 which are very similar so that assumption of equal standard deviations seems 

reasonable. 





output: 



















and has a value of -.083 (se .016). Because the analysis was done on the log-scale, this implies that the 


The 95% confidence interval <strong>for</strong> the slope on the log-scale is from (-.12 -> -.05) which corresponds to a 

potential factor between exp(−.12) = .88 to exp(−.05) = .95 per year, i.e. between a 12% and 5% decline 

per year. 5 



LSMeans correspond to the log(TEQ) at the average value of Year - not really of interest. As in previous 


using the pop-down menu and selecting a LSMeans Contrast or Multiple Comparison procedure to give: 











plot 


5.7 Change in yearly average temperature with regime shifts 

The ANCOVA technique can also be used <strong>for</strong> trends when there are KNOWN regime shifts in the series. 

The case when the timing of the shift is unknown is more difficult and not covered in this <strong>course</strong>. 

For example, consider a time series of annual average temperatures measured at Tuscaloosa, Alabama 

from 1901 to 2001. It is well known that shifts in temperature can occur whenever the instrument or location 

or observer or other characteristics of the station change. 

The data are available in the JMP datafile tuscaloosa-avg-temp.jmp in the Sample Program Library at 


A portion of the raw data is shown below: 



and a time series plot of the data: 

shows a shift in the readings in 1939 (thermometer changed), 1957 (station moved), and possibly in 1987 

(location and thermometer changed). 

It turns out that cases where the number of epochs tends to increase with the number of data points has 

some serious technical issues with the properties of the estimators. See 

Lu, Q. and Lund, R.B. (2007). 



Simple linear regression with multiple level shifts. 

Canadian Journal of Statistics, 35, 447-458 

<strong>for</strong> details. Basically, if the number of parameters tends to increase with sample size, this violates one of 

the assumptions <strong>for</strong> maximum likelihood estimation. This would lead to estimates which may not even be 

consistent! For example, suppose that the recording changed every two years. Then the two data points 

should still be able to estimate the common slope, but this corresponds to the well known problem with 

case-control studies where the number of pairs increases with total sample size. Fortunately, Lu and Lund 

(2007) showed that this violation is not serious. 

The analysis proceeds as in the dioxin example with two sites, except that now the series is broken into 

different epochs corresponding to the sets of years when conditions remained stable at the recording site. In 

this case, this corresponds to the years 1901-1938 (inclusive); 1940-1956 (inclusive); 1958-1986 (inclusive), 

and 1989-2000 (inclusive). Note that the years 1939, 1957, and 1987 are NOT used because the average 

temperature in these two years is an amalgam of two different recording conditions 6 . 

For example, the data file (around the first regime change) may look like: 

Note that year and Avg Temp and both set to have continuous scale; but epoch should have a nominal or 

ordinal scale. 

Model filling proceeds as be<strong>for</strong>e by first the model: 

AvgT emp = Y ear Epoch Y ear ∗ Epoch 

to see if the change in AvgTemp is consistent among Epochs and then fitting the model: 

AvgT emp = Y ear Epoch 

to estimate the common trend (after adjusting <strong>for</strong> shifts among the Epochs). 

The Analyze->Fit Model plat<strong>for</strong>m is used: 

6 If the exact day of the change were known, it is possible to weight the two epochs in these years and include the data points. 



There is no strong evidence that the slopes are different among the epochs (p=.10) despite the plot showing 

a potentially differential slope in the 3 rd epoch: 



The simpler model with common slopes is then fit: 



with fitted (common slope) lines: 



No further model simplification is possible and there is evident that the common slope is different from zero: 

The estimated change in average temperature is: 



i.e. an estimated increase of .033 (SE .006) per year. The 95% confidence interval does not cover 0. 

The residual plots (against predicted and the order in which the data were collected): 



shows no obvious problems. 

Whenever time series data are used, autocorrelation should be investigated. The Durbin-Watson test is 

applied to the residuals: 

with no obvious problem detected. 

The leverage plot (against year) 



also reveals nothing amiss. 

A more sophisticated analysis can be fit using SAS, but isn’t needed. The sample program and output are 

available in the Sample Program Library. 

5.8 Example - More refined analysis of stream-slope example 

In the chapter on paired comparisons, the example of the effect of stream slope was examined based on: 

Isaak, D.J. and Hubert, W.A. (2000). Are trout populations affected by reach-scale stream slope. 

Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477. 



In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis 

was per<strong>for</strong>med. In this section, we will use the actual stream slopes to examine the relationship between fish 

density and stream slope. 

Recall that a stream reach is a portion of a stream from 10 to several hundred metres in length that 

exhibits consistent slope. The slope influences the general speed of the water which exerts a dominant 

influence on the structure of physical habitat in streams. If fish populations are influenced by the structure 

of physical habitat, then the abundance of fish populations may be related to the slope of the stream. 

Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout 

populations, yet previous studies confound the effect of stream slope with other factors that influence trout 

populations. 

Past studies addressing this issue have used sampling designs wherein data were collected either using 

repeated samples along a single stream or measuring many streams distributed across space and time. 

Reaches on the same stream will likely have correlated measurements making the use of simple statistical 

tools problematical. [Indeed, if only a single stream is measured on multiple locations, then this is an 

example of pseudo-replication and inference is limited to that particular stream.] 

Inference from streams spread over time and space is made more difficult by the inter-stream differences 

and temporal variation in trout populations if samples are collected over extended periods of time. This extra 

variation reduces the power of any survey to detect effects. 

For this reason, a paired approach was taken. A total of twenty-three streams were sampled from a large 

watershed. Within each stream, two reaches were identified and the actual slope gradient was measured. 

In each reach, fish abundance was determined using electro-fishing methods and the numbers converted 

to a density per 100 m 2 of stream surface. 

Table 6.1 presents the (fictitious but based on the above paper) raw data 

Estimates of fish density from a paired experiment 

slope slope density 

Stream (%) class (per 100 m 2 ) 

1 0.7 low 15.0 

1 4.0 high 21.0 

2 2.4 low 11.0 

2 6.0 high 3.1 

3 0.7 low 5.9 

3 2.6 high 6.4 

4 1.3 low 12.2 

4 4.0 high 17.6 



5 0.6 low 6.2 

5 4.4 high 7.0 

6 1.3 low 39.8 

6 3.2 high 25.0 

7 2.0 low 6.5 

7 4.2 high 11.2 

8 1.3 low 9.6 

8 4.2 high 17.5 

9 2.0 low 7.3 

9 3.6 high 10.0 

10 0.7 low 11.3 

10 3.5 high 21.0 

11 2.3 low 12.1 

11 6.0 high 12.1 

12 2.5 low 13.2 

12 4.2 high 15.0 

13 2.3 low 5.0 

13 6.0 high 5.0 

14 1.2 low 10.2 

14 2.9 high 6.0 

15 0.7 low 8.5 

15 2.9 high 7.0 

16 1.1 low 5.8 

16 3.0 high 5.0 

17 2.2 low 5.1 

17 5.0 high 5.0 

18 0.7 low 65.4 

18 3.2 high 55.0 

19 0.7 low 13.2 

19 3.0 high 15.0 

20 0.3 low 7.1 

20 3.2 high 12.0 

21 2.3 low 44.8 

21 7.0 high 48.0 

22 1.8 low 16.0 



22 6.0 high 20.0 

23 2.2 low 7.2 

23 6.0 high 10.1 

Notice that the density varies considerably among stream but appears to be fairly consistent within each 

stream. 

The raw data is available in a JMP datafile called paired-stream.jmp in the Sample Programs Library at 

http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.. 

As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot 

be randomized within stream – the randomization occurs by selecting streams at random from some larger 

population of potential streams. As noted in the early chapter on Observational Studies, causal inference is 

limited whenever a randomization of experimental units to treatments cannot be per<strong>for</strong>med. 

Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class 

(low and high slope), we will now use the actual slope. A simple regression CANNOT be used because of 

the non-independence introduced by measuring two reaches on the same stream. However, an ANOCOVA 

will prove to be useful here. 

First, it seem sensible that the response to stream slope will will be multiplicative rather than additive, 

i.e. an increase in the stream slope will change the fish density by a common fraction, rather than simply 

changing the density by a fixed amount. For example, it may turn out that a 1 unit change in the slope, 

reduces density by 10% - if the density be<strong>for</strong>e the change was 100 fish/m 2 , then after the change, the new 

density will be 90 fish/m 2 . Similarly, if the original density was only 10 fish/m 2 , then the final density will 

be 9 fish/m 2 . In both cases, the reduction is a fixed fraction, and NOT the same fixed amount (a change of 

10 vs. 1). 

Create the log(density) column in the usual fashion (not illustrated here). In cases like this, the natural 

logarithm is preferred because the resulting estimates have a very nice simple interpretation. 7 

An appropriate model will be one where each stream has a separate intercept (corresponding to the 

different productivities of each stream - acting like a block), with a common slope <strong>for</strong> all streams. The 

simplified model syntax would look like 

log(density) = stream slope 

where the term stream represents a nominal scaled variable and gives the different intercepts and the term 

slope is the effect of the common slope on the log(density). 

menu. 

This is fit using the Analyze->Fit Model plat<strong>for</strong>m as: 

7 The JMP dataset also created a different plotting symbol <strong>for</strong> each stream using the Rows− > Color or Mark by Column 



Note that it stream must have a nominal scale and that slope must have a continuous scale. The order of the 

terms in the effects box is not important. 

The output from the Analyze->Fit Model plat<strong>for</strong>m is voluminous, but a careful reading reveals several 

interesting features. 

First is a plot of the common slope fit to each stream: 



This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs. 

predicted values is clearer: 



Generally, the observed are close to the predicted values except <strong>for</strong> two potential outliers. By clicking on 

these points, it is shown that both points belong to stream 2 where it appears that the slope increases causes 

a large decrease in density contrary to the general pattern seen in the the other streams. 

The effect tests: 

fail to detect any influence of slope. Indeed the estimated coefficient associated with a change in slope is 

found to be: 



is estimated to be .025 (se .0299) which is not statistically significant. 8 

Residual plots also show the odd behavior of stream 2: 

If this rogue stream is “eliminated” from the analysis, the the resulting plots do not show any problems 

(try it), but now the results are statistically significant (p=.035): 

8 Because the natural log trans<strong>for</strong>m was used and the data on the log scale was used, “smallish” slope coefficients have an approximate 

interpretation. In this example, a slope of .025 on the (natural) log scale implies that the estimated fish density INCREASES by 

2.5% every time the slope increases by one percentage point. 



The estimated change in log-density per percentage point change in the slope is found to be: 

i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases 

fish density by 5%. 9 

The remaining residual plot and leverage plots show no problems. 

Yet another alternate analysis! 

Because the treatment only has two levels, the same answers can also be obtained by estimating the ratio 

of the change in the log(density) to the change in slope. 10 To begin, we need to split the data table so that 

both the log(density) and the slope are in separate columns: 

9 This easy interpretation occurs because the natural log trans<strong>for</strong>m was used. If the common (base 10) log trans<strong>for</strong>m was used, there 

is no longer such a simple interpretation. 

10 If the slope-class had three or more levels, this analysis could not be done, and the previous analysis would the preferred route 



This creates a data table with separate columns <strong>for</strong> the log(density) and the stream slope <strong>for</strong> both the 

high and low slope categories: 



Now create two new variables (create new columns and write a <strong>for</strong>mula <strong>for</strong> each column) representing the 

differences in the log(density) and slope between the high and low slope classes: 

Finally, we wish to fit a line through the origin through these data points. We use the Analyze->Fit 

Y-by-X plat<strong>for</strong>m, 

the Fit Special from the red-triangle drop down menu: 



and then check the Constrain intercept 



This give the following output: 



We obtain the same estimated effect and se. The outlier from stream 2 is readily evident. When this outlier 

is excluded and the analysis is repeated, again a statistically significant result is obtained that matches the 

previous analysis. 



5.9 Comparing Fulton’s Condition Factor K 

Not all fish within a lake are identical. How can a single summary measure be developed to represent the 

condition of fish within a lake? 

In general, the the relationship between fish weight and length follows a power law: 

W = aL b 

where W is the observed weight; L is the observed length, and a and b are coefficients relating length to 

weight. The usual assumption is that heavier fish of a given length are in better condition than than lighter 

fish. Condition indices are a popular summary measure of the condition of the population. 

There are at least eight different measures of condition which can be found by a simple literature 

search. Conne (1989) raises some important questions about the use of a single index to represent the 

two-dimensional weight-length relationship. 

One common measure is Fulton’s 11 K: 

K = 

W eigt 

(Length/100) 3 

This index makes an implicit assumption of isometric growth, i.e. as the fish grows, its body proportions and 

specific gravity do not change. 

How can K be computed from a sample of fish, and how can K be compared among different subset of 

fish from the same lake or across lakes? 

The B.C. Ministry of Environment takes regular samples of rainbow trout using a floating and a sinking 

net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish was recorded. 

The data are available in the rainbow-condition.jmp data file in the Sample Program Library at http: 


A portion of the raw data data appears below: 

11 There is some doubt about the first authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J. 

(2005). The Origin of Fulton’s Condition Factor – Setting the Record Straight. Fisheries, 31, 236-238. 



K was computed <strong>for</strong> each individual fish, and the resulting histogram is displayed below: 

There is a range of condition numbers among the individual fish with an average (among the fish caught) K 

of about 13.6. 

Deriving a single summary measure to represent the entire population of fish in the lake depends heavily 

on the sampling design used to capture fish. 

Some case must be taken to ensure that the fish collected are a simple random sample from the fish in the 

population. If a net of a single mesh size are used, then this has a selectivity curve and the nets are typically 

more selective <strong>for</strong> fish of a certain size. In this experiment, several different mesh sizes were used to try and 

ensure that all fish of all sizes have an equal chance of being selected. 

As well, if regression methods have an advantage in that a simple random sample from the population is 



no longer required to estimate the regression coefficients. As an analogy, suppose you are interested in the 

relationship between yield of plants and soil fertility. Such a study could be conducted by finding a random 

sample of soil plots, but this may lead to many plots with similar fertility and only a few plots with fertility 

at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots with a range of 

fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit a regression curve 

to these selected data points. 

Fulton’s index is often re-expressed <strong>for</strong> regression purposes as: 

( ) 3 L 

W = K 

100 

This looks like a simple regression between W and ( L 

100) 3 

but with no intercept. 

A plot of these two variables: 

shows a tight relationship among fish but with possible increasing variance with length. 

There is some debate about the proper way to estimate the regression coefficient K. Classical regression 

methods (least squares) implicitly implies that all of the “error” in the regression is in the vertical direction, 

i.e. conditions on the observed lengths. However, the structural relationship between weight and length 

likely is violated in both variables. This would lead to the error-in-variables problem in regression, which 



has a long history. Fortunately, the relationship between the two variables is often sufficiently tight that it 

really doesn’t matter which method is used to find the estimates. 

JMP can be used to fit the regression line constraining the intercept to be zero by using the Fit Special 

option under the red-triangle: 



This gives rise to the fitted line and statistics about the fit: 





Note that R 2 really doesn’t make sense in cases where the regression is <strong>for</strong>ced through the origin because 

the null model to which it is being compared is the line Y = 0 which is silly. 12 For this reason, JMP does 

not report a value of R 2 . 

The estimated value of K is 13.72 (SE 0.099). 


shows clear evidence of increasing variation with the length variable. This usually implies that a weighted 

regression is needed with weights proportional to the 1/length 2 variable. In this case, such a regression 

gives essentially the same estimate of the condition factor ( ̂K = 13.67, SE = .11). 

Comparing condition factors 

This dataset has a number of sub-groups – do all of the subgroups have the same condition factor? For 

example, suppose we wish to compare the K value <strong>for</strong> immature and mature fish. As noted by Garcia- 

Berthou (2001) 13 , this is best done through a technique called Analysis of Covariance (ANCOVA). Some 

details on ANCOVA are presented in a separate chapter of these notes. 

As outlined in the ANCOVA chapter, we start with a model that has a separate K <strong>for</strong> each maturity class. 

The simplified syntax <strong>for</strong> this model is: 

W = (Len/100) 3 

(Len/100) 3 ∗ Maturity 

Note that unlike traditional ANOCOVA models, the model is lacking the simple effect of maturity. The 

reason <strong>for</strong> this is that unlike traditional ANCOVA models, the intermediate model with parallel slopes really 

12 Consult any of the standard references on regression such as Draper and Smith <strong>for</strong> more details. 

13 Garcia-Berthou E. (2001). On the misuse of residuals in ecology: testing regression residuals vs. the analysis of covariance. 

Journal of Animal <strong>Ecology</strong> 70, 708-711. http://dx.doi.org/10.1046/j.1365-2656.2001.00524.x 



doesn’t make sense when the regression lines are <strong>for</strong>ced through the origin. This syntax specifies that 

variation in length are attributable to variations in length and an interaction between the two variables. This 

latter term represents the differential K between the maturity classes. 

Here is where some care must be taken. By default. JMP “centers” (i.e. subtracts the mean) continuous 

X variables when they participate in an interaction or similar term: 

Hence, if you just try and implement this above model directly in JMP, you will actually fit the model: 

W = (Len/100) 3 

((Len/100) 3 − (Len/100) 3 ) ∗ Maturity 

which, when expanded, actually adds an intercept term to the model. Ordinarily, in regression models with 

intercepts, this would NOT be a problem – it is because the model is being <strong>for</strong>ced through the intercept that 

this causes a problem. 

In order to prevent JMP from “centering” the length variable when fitting these ANCOVA models, turn 

off the centering option (by unchecking the option) when the model is fit using the Analyze->Fit Model 

plat<strong>for</strong>m of JMP: 



Note the use of the No Intercept option to again <strong>for</strong>ce the line through the origin. JMP will ‘complain’ 

about the odd <strong>for</strong>m of the model because it is missing the simple maturity class effect, but just ignore the 

complaints. This gives the summary output <strong>for</strong> the effect test of: 

The p-value <strong>for</strong> the last term in the table of 0.027 indicates that there is strong evidence of a different K 

between the two maturity classes. 

The estimates <strong>for</strong> the separate maturity classes are obtained from the Custom Test option (some knowledge 

of the design matrix coding <strong>for</strong> categorical variables in JMP is needed to know that JMP uses a (1, -1) 

coding <strong>for</strong> indicator variables with 2 classes): 



which gives the estimated K <strong>for</strong> each maturity class. 



If you fit a separate regression <strong>for</strong> the two maturity classes (use the By option on the fit model box), you 

will get the two same estimates. The respective standard errors will be slightly different because the single 

model is able to pool over all of the data to estimate the standard errors, but separate estimates cannot do 

any pooling. 

The separate fitted lines are shown below: 



Similarly, a comparison of K can be made among the three sex classes (M, F, and U) where immature 

fish cannot be sexed and are given the code U, but mature fish are further subdivided into M and F classes 

(don’t <strong>for</strong>get to uncheck the centering option in the triangle in the upper left corner of the Analyze->Fit 

Model dialogue): 

also shows evidence (p=.025) of a differential K among the three sex classes (this is not unexpected), and a 

contrast can be done to see if there is further evidence of a difference between the male and females: 



As the p-value is .0074, there is also strong evidence of a differential K between the males and females as 

well. 

A final plot of the three lines is: 



Finally, because you have replicate fish at the same body length, it is possible to a <strong>for</strong>mal lack-of-fit test. 

The idea behind this test is to compare the variation in data points at the same replicate lengths (pure error) 

with the deviations around the line from the model (model error). If the model fits well, the ratio of these 

two estimates of residual variance should be comparable: 

The p-value <strong>for</strong> the lack-of-fit test is quite large indicating no evidence of a lack of fit. 

This same ANCOVA method can be used to compare the K values across lakes or across time within 

the same lake. If you have a large number of lakes each measured multiple times, some very interesting 

models can be fit that are beyond the scope of these notes – please contact me. Similarly, interest may lie 

in modeling the K as functions of other lake-specific covariates such as lake size, productivity, etc. Again, 

please contact me as this is beyond the scope of these notes. 



Statistical significance is not the same as biological significance! While there was evidence of differential 

K in this data set, this statistical significance does not imply biological importance. I have no idea of the 

observed differences in K among these three groups has any meaning biologically. 

5.10 Final Notes 

Some sections need to be added here on the following topics: 

• danger of ANCOVA is there is no overlap in the covariate 

• choice between paired t-test, multi-variate test, or ANCOVA in the case of two time points 


Chapter 6 

Multiple linear regression 


In previous chapters, the relationship between a single, continuous variable (Y a.k.a. the response variable) 

and a single continuous variable (X, a.k.a the predictor or explanatory variable) was explored using simple 

linear regression. In this chapter, this will be generalized to the case of more than one explanatory (X) 

variable. 1 

There are many good books covering this topic - refer to the list in previous chapters. 

Fortunately, many of the techniques learned in the previous chapter on simple linear regression carry 

over directly to the more general multiple regression. There are a few subtle differences in interpretation, 

and additional problems (such a variable selection) must be solved. 

It turns out that multiple regression methods are very general methods covering a wide range of statistical 

problems under the rubric of general linear models. Surprisingly, multiple regression is a general solution 

<strong>for</strong> two-sample t-tests, <strong>for</strong> ANOVA models, <strong>for</strong> simple linear regression models, etc. The exact theory is 

beyond the scope of these notes, but intuitive explanations will be provided as needed. 

6.1.1 Data <strong>for</strong>mat and missing values 

The data are collected and stored in a tabular <strong>for</strong>mat with rows representing observations, and columns 

representing different variables. One of the variables will be the response (Y ) variable; there can be several 

predictor (X) variables. Virtually all computer packages require variables to be stored in columns and 

1 It is possible to also have more than one Y variable – this is known as multivariate multiple regression but is not covered in this 

chapter. 

389

CHAPTER 6. MULTIPLE LINEAR REGRESSION 

observations stored in rows. 

The response variable (Y ) must be continuous. It is NOT appropriate to do multiple regression when 

the Y variable represents categories – the appropriate methodology in this case is logistic regression. If the 

Y variable represents counts, a technique known as Poisson-regression may be more appropriate – consult 

the chapter on generalized linear models <strong>for</strong> more details. Finally, in some cases, the value of Y may be 

censored, i.e. the exact value is not known, but it is known to be below certain threshold values (e.g. above 

or below detection limits). The analysis of such data is beyond the scope of these notes – consult the chapter 

on Tobit analysis <strong>for</strong> details. 

Surprising, there is much more flexibility in the type of the X variables. They may be continuous as 

seen previously in simple linear regression, or they may be dichotomous variables taking only the values of 

0 or 1 (known as indicator variables). 2 These indicator variables are used to represent different groups (e.g. 

male and female) in the data. 

The dataset is assumed to be complete, with NO missing values in any of the X variables. If an observation 

(row) has some missing X values, most computer packages practice what is known as case-wise 

deletion, i.e. the entire observation will be dropped from the analysis. Consequently, it is always important 

to check the computer output to see exactly how many observations have been used in the analysis. 

Missing Y values also imply that this observation (row) will be deleted from the analysis. However, if 

the set of X variables is complete, it is still possible to obtain predictions of Y <strong>for</strong> the observed set of X 

values. 

As in previous chapters, missing data should be examined to see if it is missing completely at random 

(MCAR) in which case there is usually no problem in the analysis other than reduced sample size, or missing 

at random (MAR), which is again handled relatively easily, or in<strong>for</strong>mative missing (IM) which poses serious 

problems in the analysis. Seek help <strong>for</strong> the latter case. 

6.1.2 The statistical model 

The statistical model <strong>for</strong> multiple-regression is a extension of that <strong>for</strong> simple linear regression. 

The response variable, denoted by Y , is measured along with a set of predictor variables, denoted by 

X 1 , X 2 , . . . , X p where p is the number of predictor variables. 

The <strong>for</strong>mal statistical model is: 

Y i = β 0 + β 1 X i1 + β 2 X i2 + . . . + β p X ip + ɛ i 

where the unknown parameters are the set of β ′ s. The deviation between the observed value of Y and the 

predicted value from the regression equation, ɛ i is distributed as a Normal distribution with a mean of 0 and 

an (unknown) variance of σ 2 . 

2 In actual fact, any set of two distinct values may be used, but traditional usage is to use 0 and 1. 



This is often written using a <strong>short</strong> hand notation in many statistical packages as: 

Y = X 1 X 2 . . . X p 

where the intercept (β 0 ) and the residual variation (ɛ) are implicit. 

This can also be written using matrices as 

Y = Xβ + ɛ 

where Y is an n × 1 column vector, X is an n × (p + 1) matrix [don’t <strong>for</strong>get the intercept column] of the 

predictors, β is a (p + 1) × 1 column vector (the intercept β 0 , plus the p “slopes” β 1 , . . . , β p ), and ɛ is a 

n×1 vector of residuals that has a multivariate normal distribution with a mean of 0 and a covariance matrix 

of Iσ 2 where I is the identity matrix. 

Note that this <strong>for</strong>mat <strong>for</strong> multiple regression is very flexible. By appropriate definition of the X variables, 

many different problems can be cast into a multiple-regression framework. In future <strong>course</strong>s you will see 

that ANOVA (a technique to compare means among multiple groups) is actually nothing but regression in 

disguise! 


Not surprising, the assumptions <strong>for</strong> a multiple regression analysis are very similar to those required <strong>for</strong> a 

simple linear regression. 

Linearity 

Because of the multiple X variables, the assumption of linearity is not as straight<strong>for</strong>ward as <strong>for</strong> simple linear 

regression. 

Multiple regression analysis assume that the MARGINAL relationship between Y and each X is linear. 

This means that if all other X variables are held constant, then changes in the particular X variable lead to 

a linear change in the Y variable. Because this is a MARGINAL relationship, simple plots of Y vs. each X 

variable may not be linear. This is because the simple pairwise plots can’t hold the other variables fixed. 

To assess this relationship, residuals from the fit should be plotted against each X variable in turn. If the 

scatter of the residuals is not random around 0 but shows some pattern (e.g. quadratic curve), this usually 

indicates that the marginal relationship between Y and that particular X is not linear. Alternatively, fit a 

model that includes both X and X 2 and test if the coefficient associated with X 2 is zero. Un<strong>for</strong>tunately, this 

test could fail to detect a higher order relationship. Third, if there are multiple readings at some X-values, 

then a test of goodness-of-fit (what JMP calls the Lack of Fit Test) can be per<strong>for</strong>med where the variation of 

the responses at the same X value is compared to the variation around the regression line. 




The Y values must be a random sample from the population of Y values <strong>for</strong> every set of X value in the 

sample. Fortunately, it is not necessary to have a completely random sample from the population as the 

regression line is valid even if the X values are deliberately chosen. However, <strong>for</strong> a given set of X, the 

values from the population must be a simple random sample. 

This latitude gives considerable freedom in selecting points to investigate the relationship between Y 

and X. This will be discussed more in class. 


All the points must belong to the relationship – there should be no unusual points. 

The residual plot of the residual against the row number or against the predicted value should be investigated 

to see if there are unusual points. 

The marginal scatter plot of the residuals from the fit vs. X should be examined. As well leverage plots 

(Section 6.2.6) are useful <strong>for</strong> detecting influential points. 

Outliers can have a dramatic effect on the fitted line. 


The variability about the regression plane must be similar <strong>for</strong> all sets of X, i.e. the scatter of the points above 

and below the fitted surface should be roughly constant over the entire surface. This is assessed by looking 

at the plots of the residuals against each X variable to see if the scatter is roughly uni<strong>for</strong>mly scattered around 

zero with no increase and no decrease in spread over the entire line. 

Independence 

Each value of Y is independent of any other value of Y . The most common cases where this fail are time 

series data. 







Y over all X values must be normally distributed, i.e. they look simply at the distribution of the Y ’s ignoring 

the Xs. The assumption of normality only states that the residuals, the difference between the value of Y 

and the point on the line must be normally distributed. 



important. 

X variables measured without error 

It sometimes turns out that the X variables are not known precisely. For example, if you wish to investigate 

the relationship of illness to second hand cigarette smoke, it is surprisingly difficult to get an estimate of the 

“dose” of cigarettes that a worker has been exposed to. 

This general problem is called the “error in variables” problem and has a long history in statistics. A 

detailed discussion of this issue is beyond the scope of these notes. 

The uncertainty in each X variable should be assessed. 


The same principle of least squares as in simple linear regression is used to obtain estimates. In general, the 

sum of deviations between the predicted and observed values is computed, and the regression surface that 

minimizes this value is the final relationship. 

The estimated intercept and slopes can be compactly expressed using matrix notation 

̂β = (X ′ X) −1 X ′ Y 

but details are beyond the scope of these notes. Hand <strong>for</strong>mulae are all but impossible except <strong>for</strong> trivially 

small examples - let the computer do the work. Of <strong>course</strong> this implies that the scientist has the responsibility 

to ensure that the brain in engaged be<strong>for</strong>e putting the package in gear! 


each of the estimates. Again, there are computational <strong>for</strong>mulae, but in this age of computers, these are not 

important. As be<strong>for</strong>e, approximate 95% confidence intervals <strong>for</strong> the corresponding population parameters 

are found as estimate ± 2 × se. Most packages will compute the 95% confidence intervals <strong>for</strong> the slope as 

well. 



Once the fit has been obtained, the fit of the model can be assessed in various ways as outlined below. 

The overall fit of the model is assess using a Whole Model Test that is traditionally placed in an ANOVA 

table. This test examines if there is at least one X variable that seems to be marginally related to the Y 

values. Usually, it is of little interest. 

The individual marginal contribution of each X variable (how each X variable affects the response 

holding all the other X variables constant) can be assessed directly either from the reported estimates and 

standard errors or from an Effect Test – these are exactly equivalent. 

Formal tests of hypotheses about the marginal contribution of each variable can also be done. Usually, 

these are only done on the slope parameter as this is typically of most interest. The null hypothesis is that 

population marginal slope of a particular X variable is 0, i.e. there is no marginal relationship between Y 

and and that particular X. More <strong>for</strong>mally the null hypothesis <strong>for</strong> the X i variable is: 

H : β i = 0 




A : β i ≠ 0 

although one-sided tests looking <strong>for</strong> either a positive or negative slope are possible. 

The test-statistics is found as 

T = b i − 0 

se(b i ) 

and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. This is usually 

automatically done by most computer packages. The p-value is interpreted in exactly the same way as in 

ANOVA, i.e. is measures the probability of observing this data if the hypothesis of no relationship were true. 

It is also possible to obtain test <strong>for</strong> set of predictors (e.g. can several X variables be simultaneously 

dropped from the model) as will be seen later in the notes. 

Finally, if there are a large number of X variables, is there an objective way to decide which subset of 

the X are useful in predicting Y ? Again, this is deferred until later in this chapter. 

6.1.5 Predictions 

Once the best fitting line is found it can be used to make predictions <strong>for</strong> new sets of X. 






set of X. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses 

at a particular set of X. 3 The prediction interval <strong>for</strong> an individual response is sometimes called a confidence 

interval <strong>for</strong> an individual response but this is an un<strong>for</strong>tunate (and incorrect) use of the term confidence interval. 

Strictly speaking confidence intervals are computed <strong>for</strong> fixed unknown parameter values; predication 

intervals are computed <strong>for</strong> future random variables. 


In both cases, the estimate is found in the same manner – substitute the new sets of X into the equation 


“dummy” observation in the dataset with the value of Y missing, but the values of X present. The missing 




In the first case (predicting single values), there are two sources of uncertainty involved in the prediction. 

First, there is the uncertainty caused by the fact that this estimated line is based upon a sample. Then there 

is the additional uncertainty that the value could be above or below the predicted line. This interval is often 

called a prediction interval at a new X. 

In the second case (predicting the mean of future responses), only the uncertainty caused by estimating 

the line based on a sample is relevant. This interval is often called a confidence interval <strong>for</strong> the mean at a 

new X. 







6.1.6 Example: blood pressure 

Blood pressure tends to increase with age, body mass, and stress. To investigate the relationship of blood 

pressure to these variables, a sample of men in a large corporation was selected. For each subject, their 

age (years), body mass (kg), and a stress index (ranges from 0 to 100) was recorded along with their blood 

pressure. 

The raw data is presented in the following table: 




Age Blood Pressure Body Mass Stress Index 

(years) (mm) (kg) (no units) 

50 120 55 69 

20 141 47 83 

20 124 33 77 

30 126 65 75 

30 117 47 71 

50 129 58 73 

60 123 46 67 

50 125 68 71 

40 132 70 77 

55 123 42 69 

40 132 33 74 

40 155 55 86 

20 147 48 84 

31 . 53 86 

32 146 59 . 

JMP Analysis 

The raw data is also available in a JMP data sheet called bloodpress.jmp available from the Sample 

Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The data has been entered with rows corresponding to the different subjects and columns corresponding 

to the different variables: 



Notice that the response variable is continuous as are the other variables. 4 Also notice that the blood 

pressure value is missing <strong>for</strong> one subject – it cannot be used in the analysis, but predictions can be made 

<strong>for</strong> this subject as all the X values are present. One subject is missing one of the X variables – this subject 

cannot be used in the fitting process nor <strong>for</strong> making predictions. The remaining sample size is only 13 

subjects. 

As usual, the researcher needs to think why certain values are missing. 

It is also interesting to note that measurement error in the X variables could be a concern. For example, 

it is highly unlikely that the first subject is exactly 20.000000 years old! People usually truncate their age 

when asked, e.g. even on the day be<strong>for</strong>e their 21st birthday, a person will still respond as their age being 

20 years old. Here the error in aging ranges from about 5% of values (when age is around 20 years old) to 

about 2% (when the age is around 50 years old). How was weight collected? If the subjects were actually 

weighed, the actual number may not be in dispute (i.e. it is unlikely that the scale is wrong), but then the 

weight includes shoes, clothes, and ???? If the weight is a recall measurement, many people under report 

their actual weight, often by quite a margin. An how is stress measured? It is likely an index based on a 

survey, but it is not even clear how to numerically measure stress – after all, stress can’t simply be measured 

like temperature. 

Begin by plotting the variables against each other - a simple way is a scatter plot matrix available under 

the Analyze->MultiVariateMethods->Multivariate plat<strong>for</strong>m: 

4 In actual fact, these variables have been discretized, but as the discretization interval is small relative to typical values, they can be 

treated as being continuous. 





The scatter plot matrix shows no strong simple relationships between pairs of variables. Rather surprisingly, 

weight seems to decrease with age, and there appears to be general increase of blood pressure with 

weight. 

These pairwise scatter plots are primarily useful <strong>for</strong> checking <strong>for</strong> outliers and other problems in the data 

– often a multivariate relationship is too complex to be seen in simple pairwise plots. 

We will fit the model where the response variable (blood pressure) is modeled as a function of the three 

predictor variables (age, weight, and stress index). Using the <strong>short</strong> hand notation discussed earlier, the model 



is 

BloodPressure = Age Weight Stress 

This model is fit using the Analyze->Fit Model plat<strong>for</strong>m: 



The X variables can be listed in any order. 

The output from the Analyze->Fit Model is voluminous and cannot be displayed in one panel, so it is 

necessary to look at several parts in more details. 

Because of the missing values, only 13 subjects could be used in the model fit: 



The number of cases actually used in the fit should alway be ascertained because in large datasets, the 

missing value pattern may not be easily discerned. 

First, assess the overall fit of the model by examining the plot of the actual blood pressure vs. the 

predicted blood pressure. If the model made exact predictions, then the points on the plot would all lie 

perfectly on the 45 ◦ line. The plot from this fit: 

shows the most points lie fairly close to the 45 degree line. As well there are no points that appear to have 

undue leverage on the plot as there is a general scatter around the 45 degree line. 




also shows a random scatter of residuals around the value of 0 with no apparent pattern. 

The whole model test, i.e. if any of the X variables provide in<strong>for</strong>mation on predicting Y is found in the 

Analysis of Variance Table: 

The p-value is very small, and so there is good evidence that at least one X variable appears to predict the 

blood pressure. Of <strong>course</strong>, at this point, it is unclear which X variable are good predictors and which X 

variables may be poor predictors.. 

The fitted regression equation is found by looking at the Parameter Estimates area: 



and is: 

̂ BloodP ressure = −61.3 + .45(Age) − 0.087(Stress) + 2.37(W eight) 

These coefficients are interpreted as the MARGINAL increase in the blood pressure when each variable 

changes by 1 unit AND ALL OTHER VARIABLES REMAIN FIXED. For example, the coefficient of 0.45 

<strong>for</strong> age indicates that the estimated blood pressure increases by .45 units <strong>for</strong> year increase in age assuming 

that the stress index and weight remain constant. The concept of marginality, i.e. the marginal increase in Y 

when a single X variable is changed but all other X variables are held fixed is the crucial concept in multiple 

regression. In some cases, <strong>for</strong> example, polynomial regression, it is impossible to hold all other X variables 

fixed as you will see later in this chapter. 

The sign of the coefficient <strong>for</strong> stress is somewhat surprising, but as you will see in a few minutes is 

nothing to worry about. 

Are there any X variables that don’t appear to be useful in predicting blood pressure? The Effect Tests 

or the Parameter Estimates table provide some clues: 

The p-values from the the Effect Test table or the the Parameter Estimates table are identical; the F -statistic 

is simply the t-ratio squared. These are MARGINAL tests, i.e. is a particular X variable useful in predicting 

the blood pressure given all other variables remain in the model. For example, the test <strong>for</strong> age examines if 

blood pressure changes with age after adjusting <strong>for</strong> stress and weight. The test <strong>for</strong> Stress examines if blood 

pressure changes with stress after adjusting <strong>for</strong> age and weight. 

In this example, the p-value <strong>for</strong> stress appears to be statistically not significant. This would imply that 

blood pressure does not seem to increase with stress after adjusting <strong>for</strong> age and weight. This would indicate 



that perhaps stress could be dropped from the model and a final model only using age and weight may be 

suitable. Consequently, the negative sign on the coefficient is not really worry some. 

Again, this concept of marginality is crucial <strong>for</strong> the proper interpretation of the statistical tests. If two 

X variables are related, it is possible that both of the statistical tests could be non-significant, but this does 

not imply that both variables can be dropped from the model. Later in this chapter (Section 6.4), it will be 

shown how to test if multiple variables can be simultaneously dropped from the model. 

The leverage plots should also be examined to see that any relationship between the predictor and response 

variable is not highly dependent upon a single (high leverage) point: 

Leverage plots in general, examine the new in<strong>for</strong>mation <strong>for</strong> each X variable in predicting Y after adjusting 

<strong>for</strong> all the other variables in the model. The general theory is presented in Section 6.2.6. Two features of the 

plot should be examined. The general statistical significance of the X variable is found by considering the 

slope of the line and if the confidence curves contain the horizontal line : 



We see that the confidence curves in the leverage plots <strong>for</strong> age and weight both do not contain the horizontal 

line. However, the confidence curve on the leverage plot <strong>for</strong> stress includes the horizontal line indicating 

that this variable’s contribution to predicting blood pressure is not statistically useful. 

The second feature of leverage plots that should be examined is the distribution of points along the X 

axis of the leverage plot. There should be a fairly even distribution along the bottom axis and the fitted line 

in the leverage plot should not be heavily influenced by a few points with high leverage. 

By clicking on the red-triangle associated with the fit: 

it is possible to save various predictions to the data table. For example, save the predicted values and the two 

types of confidence intervals (<strong>for</strong> the mean and <strong>for</strong> individuals): 



Notice that <strong>for</strong> observation 14, only the blood pressure was missing and so predictions of the blood pressure 

<strong>for</strong> that individual can be made. However, <strong>for</strong> individual 15, at least one of the X variables had a missing 

value and so no predictions can be made. 

The prediction are simply found by substituting in the X values into the prediction equation. As in 

simple linear regression, there are two different confidence intervals. The confidence interval <strong>for</strong> the MEAN 

response would be useful to predicting the average blood pressure over many people with the same values of 

X as recorded. The confidence interval <strong>for</strong> the INDIVIDUAL response would be useful to predict the blood 

pressure <strong>for</strong> a single future individual with those particular X values. A common error is to confuse these 

two types of intervals. 

As in simple linear regression, a common way to make predictions it to add rows to the end of the data 

table with the Y variable deliberately set to missing with the X values set to those of interest. These rows 

are NOT used in the model fitting, but because the X set is complete, predictions can be made. 

If the residuals are saved to the data table, a normal probability plot of the residuals can be made using 

the Analyze->Distribution plat<strong>for</strong>m on the saved residuals. 

Similarly, the residuals can be plotted against each X variable in turn to assess if there is a linear marginal 

relationship between Y and each X variable. Each of these residual plots should show a random scatter 

around zero. 

It is also possible to do inverse predictions, but this is beyond the scope of these notes. 

There are lots of other interesting features to the Analyze->Fit Model plat<strong>for</strong>m that are beyond the scope 



of these notes. 

6.2 Regression problems and diagnostics 


“All models are wrong, but some are useful.” G.E.P. Box on page 424 of Empirical Model- 

Building and Response Surfaces (1987), co-authored with Norman R. Draper. 

This famous quote implies that no study ever satisfies the assumptions made when modeling the data. 

However, unless the violations are extreme, perhaps the model can be useful to make predictions. 

In this section, we will take a detailed look at a number of diagnostic measures to assess the fit of our 

model to the data. 

6.2.2 Preliminary characteristics 

Be<strong>for</strong>e building complex models, the analyst should become familiar with the basic properties of their data. 

This is accomplished by: 

• Examine the RRR’s of experimental and survey design as it relates to this study. 

• What are the scale (nominal, ordinal, interval, ratio) of each variable? 

• Which are the predictors and response variables? 

• What are the types (discrete, continuous, discretized continuous) of each variable? 

Then do some basic plots and tabulations to spot potential problems in the data: 

• Missing values. Examine the pattern of missing values. Most regression packages practice case-wise 

deletion, i.e. any observation (row) that is missing any of the X or the Y variable is not used in the 

analysis. If you have a large dataset with many X variables, even a small percentage of missing 

values can lead to many rows being deleted from the analysis. Think about how the missing values 

came about - are the MCAR, MAR, or IM? JMP has a nice feature to tabulate the pattern of missing 

values under the Tables menu. 

• Single variable descriptive statistics For each variable in the dataset, do some basic descriptive 

statistics and plots (e.g. histograms, dot-plots, box-plots) to identify potentially extreme observations. 

Check to see that all values are plausible, e.g. if one variable records the sex of the subject, only two 

possible values should be recorded; It is unlikely that a women has 20 natural children; it is unlikely 

that a human male is more than 3 m tall, etc. 



• Pairwise plots Create bivariate plots of all the variables. Check to unusual looking observations. 

These may be perfectly valid observations, but they should be examined in more detail to make sure. 

A casement plot (a matrix of pairwise scatter plot) can be created easily in JMP using the Analyze- 

>MultiVariateMethods->Multivariate plat<strong>for</strong>m. 

6.2.3 Residual plots 

After the model is fit, compute the residuals which are simply the VERTICAL difference between the observed 

and predicted values, ̂ɛ i = Y i − Ŷi. Most computer packages will compute and plot residuals easily. 

The basic assumption about the VERTICAL discrepancies was that they have a mean of zero and a 

CONSTANT variance σ 2 . We estimated the variance by MSE in the ANOVA table. 

There are several different types of residuals that can be computed and plotted: 

• Standardized residual. This is simply computed as z i = ̂ɛ i / √ MSE and is an attempt to create 

residuals with a mean of 0 and a variance of 1, i.e. like a standard normal distribution. Because all the 

residuals are divided by the same value, the pattern seen in the standardized residuals will the same as 

seen in the ordinary residuals. 

• Studentized residual. The precision of the predictions changes at different parts of the regression 

line. You saw earlier that the confidence band <strong>for</strong> the mean response got wider as the prediction point 

moved further away from the center of the data. The studentized residuals (see book <strong>for</strong> computational 

details) attempts to standardize each residual by its approximate precision. Because each residual is 

adjusted individually, the plots of the studentized residuals will look slightly different from that of the 

regular or standardized residuals, but they will be similar. 

• Jackknifed residuals. Less commonly computed, jack-knifed residuals are computed by fitting a 

regression line after dropping each point in turn, and then finding the residual. For example, if there 

were 4 data points, the jack-knifed residual <strong>for</strong> the first point would be the difference between the 

observed value and the predicted value based on a regression line fit to points 2, 3 and 4 only. The 

jack-knifed residual <strong>for</strong> the second observation would be the difference between the observed value 

and the predicted value based on the 1, 3 and 4th observations. Plots based on these residuals will 

appear similar, but not exactly the same as plots based on the the other residuals. 

Several plots can be constructed. First, look at the univariate distribution of the residuals. Which observations 

correspond to the largest negative and positive residuals? 

Second, plot the residuals against each predictor variable, against the PREDICTED Y values, and against 

the order with which the data were collected (this may be, but is not necessarily the order of the observations 

in the dataset). Don’t plot the residuals against the observed Y values because you will see strange patterns 

that are artifacts of the plot. 5 A good residual plot will show random scatter around zero; bad residual plots 

5 Basically negative residuals will be associated with smaller Y values and these will increase as Y increases, and then crash and 

rise and then crash and rise again. 



will show a definite pattern. Typical residual plots are illustrated below – with small datasets, the patterns 

will not be as clear cut. 

With small datasets, don’t over analyze the plots - only gross deviations from ideal plots are of interest. 

Modern alternatives to residual plots are to plot the absolute value of the residuals and fit LOWESS 

curves through them. Consult our Stat400 <strong>course</strong> (Data Analysis) <strong>for</strong> details. 

Many books present <strong>for</strong>mal tests <strong>for</strong> residuals – I find these not particularly useful, and prefer the simple 

residual plots. However, one useful diagnostic is the Durbin-Watson test <strong>for</strong> autocorrelation – consult the 



chapter on trend analysis in this collection <strong>for</strong> details. 

Finally, many books also present what are known a normal probability plots to assess the normality of 

the residuals. Again, I have found these to be less than useful. 

6.2.4 Actual vs. Predicted Plot 

In multiple regression, it is very difficult to look at plots of Y vs. each X variable and come to anything very 

useful. In general, you are trying to view a multi-dimension space in two dimensions. 

A plot of the actual Y vs. the predicted Y ′ s is useful to assess how well the model does in predicting 

each observations. This plot is produced automatically by JMP and many other packages. In some packages, 

you will have to save the predicted values and do the plot yourself. 

6.2.5 Detecting influential observations 

An influential observation is defined as an observation whose deletion greatly changes the results of the 

regression. There are many techniques available <strong>for</strong> spotting individual influential points – however, many 

of these methods will fail to detect pairs of influential points in close proximity to each other. 

Cook’s D 

One popular measure of an observations influence is the Cook’s Distance. This statistic measures the extent 

to which the regression coefficients change when each individual observation is deleted. It is a summary 

measure of the impact of the observations deletion and is a weighted sum 6 of: (̂β 0 − ̂β 0(−i) ) 2 , (̂β 1 − 

̂β 1(−i) ) 2 , . . . , (̂β k − ̂β k(−i) ) 2 where ̂β k(−i) is the regression coefficient <strong>for</strong> the k th variable after dropping 

the i th observation. 

If a point has no effect on the fit, then D i will be zero. Large values of D i indicate points that have a large 

influence on the fit. There is no easy rule <strong>for</strong> determining which values of D i are extreme. 7 A general rule 

of thumb is to look at the distribution of the D ′ s and look at those observations corresponding to extreme 

values. 

6 Refer to the original paper <strong>for</strong> the exact <strong>for</strong>mula. 

7 An often quoted rule is to look at values of D i that are greater than 1, but recent work has shown that this rule does not per<strong>for</strong>m 

effectively 



Hats 

Oddly named statistics are the Hats, or leverage values. These are computed under the idea that if a point 

has extreme influence, the regression should predict it exactly. Consequently, the hats are computed from 

what is known (<strong>for</strong> historical reasons) the hat matrix which is defined as X ′ (X ′ X) −1 X and should not be 

attempted to be computed by hand! If the hat-value is larger than about twice the average hat-value, then 

is usually taken to indicates an influential point. There are more <strong>for</strong>mal rules to checking the hat values but 

these are seldom worthwhile. 

Caution 

It is clear that some observations must be the most extreme in every sample, and so it would be silly to 

automatically delete these extreme observations without a careful consideration of the the underlying data! 

The purpose of Cook’s D and other similar statistics is to warn the analyst that certain observations require 

additional scrutiny. Don’t data snoop simply to polish fit! 

6.2.6 Leverage plots 

These are likely the most useful of the diagnostic tools <strong>for</strong> spotting influential observations and are produced 

by many computer packages. 

The Leverage Plots produced by JMP are example of what are also called partial regression plot or 

adjusted variable plots. They are constructed <strong>for</strong> each individual variable. Suppose that we are regressing 

Y on four predictors X 1 , . . . , X 4 . The leverage plot <strong>for</strong> X 1 is constructed as follows:. 

1. Find the residuals when Y is regressed against all the other variables except X 1 , i.e. fit the model 

Y = X 2 X 3 X 4 . Denote this residual as ̂ɛ Y |X(−1) where the -1 indicates that the first variable was 

dropped from the set of X ′ s. 

2. Find the residuals when X 1 is regressed against the other X variables, i.e. fit the model X 1 = 

X 2 X 3 X 4 . Denote this residual as ̂ɛ X1|X (−1) 

where the -1 indicates that the first variable was 

dropped from the set of X ′ s. 

3. Plot the first residual against the second residual <strong>for</strong> each observation. 8 

. 

Now if X 1 has no further in<strong>for</strong>mation about Y (after accounting <strong>for</strong> the other X ′ s), then the X 1 variable 

really isn’t needed, and so all the first residuals should be centered around zero with random scatter. 

8 JMP actually adds the mean of Y and X 1 to the residuals be<strong>for</strong>e plotting, but this does not change the shape of the plot. 



But suppose that X 1 is important in predicting Y . Then the residuals from the regression of Y on the 

other X variables should be missing the contribution of X 1 and the residual plot should show an upward (or 

downward) trend relative to the other residuals. In fact, if you fit a regression line to the leverage plot, the 

slope will equal the slope in the full regression model. If the contribution of X 1 is not linear, then the plot 

will show a non-linear relationship. 

Why is X 1 regressed against the other X variables? Recall that the interpretation of the slope in multiple 

regression is the MARGINAL contribution after adjusting <strong>for</strong> all other variables in the model. In other 

words, the slope reflect the NEW in<strong>for</strong>mation in X 1 after adjusting <strong>for</strong> the other X ′ s. How is the new 

in<strong>for</strong>mation in X 1 found – yes, by regressing X 1 against the other variables. For example, suppose that X 1 

was an exact copy of another variable in the dataset. Then the second residuals would all be zero, indicating 

no new in<strong>for</strong>mation (why?). So, if the leverage plot shows a very thin vertical band of points, this may be 

an indication that a certain variable does NOT have useful marginal in<strong>for</strong>mation, i.e. is redundant given the 

other variables. This condition is known as multi-collinearity and is discussed later in this chapter. 

If a single observation has high leverage, the leverage plot will show the observations as an outlier. The 

diagram below demonstrates some of the important cases <strong>for</strong> leverage plots: 



In JMP, and many other packages, the points on these plots are hot-linked to the data sheet. By clicking 

on these points, you can identify the observation in the data sheet. 

The concept of leverage plots is sufficiently important and non-obvious that a numerical example will be 

examined. In JMP, open the Fitness.jmp dataset from the JMP sample dataset library. This dataset consists 

of measurements taken on subjects on the age, weight, oxygen consumption, times to run a mile, and three 

measurements on their pulse rate. 

The first few lines of the data file are: 

Fit a model to predict oxygen consumption as the Y variable with age, weight, runtime, and the three 

pulse measurements as the X variables. The estimated slopes are: 

and the leverage plot <strong>for</strong> Runtime is: 



To reproduce this leverage plot, first fit the model <strong>for</strong> oxygen consumption, dropping the run-time 

variable and save the residuals to the data sheet. 





Next, regress run-time against the other X variables and save the residuals to the data sheet: 



This will give the data sheet with two new columns added: 

Finally, plot the Residual of Oxygen on all but runtime vs. the Residual of runtime on other X variables 

and fit a line through that plot using the Analyze->Fit Y-by-X plat<strong>for</strong>m: 



You will see that this plot looks the same as the leverage plot (but the Y and X axes are scaled slightly 

differently) and that the slope on this plot (-2.639) matches the estimated slope seen earlier. 

Leverage plots should be used with some caution. They will show the nature of the functional relationship 

with the variable, but not its exact <strong>for</strong>m. As well, as these plots are after adjusting <strong>for</strong> other variables, a 

variety of curvature models should be investigated. As well, if the functional <strong>for</strong>m of the other variables is 

incorrect (e.g. age 2 is needed but has not been added to the model), then the true nature of the relationship 

may be missed. 

You can get JMP to save all the leverage pairs under the Save Columns pop-down menu. 



6.2.7 Collinearity 

It is often the case that many of the X variables are related to each other. For example, if you wanted 

to predict blood pressure as a function of several variables including height and weight, there is a strong 

relationship between these two latter variables. When the relationship among the predictor variables is 

strong, they are said to be collinear. This can lead to problems in fitting the model and in interpreting the 

results of a model fit. In this example, it is conceivable that you could increase the weight of a subject while 

holding height constant, but suppose the two variables were total hours of sunshine and total hours of clouds 

in a year. If one increases, the other must decrease. 

Because the regression coefficients are interpreted as the MARGINAL contribution of each predictor, 

collinearly among the predictors can mask the contribution of a variable. For example, if both height and 

weight are fit in a model, then the marginal contribution of height (given weight is already in the model) is 

small; similarly, the marginal contribution of weight (given height is in the model) is also small. However, 

it would not be valid to say that the marginal contribution of both height and weight (together) are small. 

In Section 6.4 methods <strong>for</strong> testing if several variables can be deleted simultaneously from the model are 

presented. 

If the predictor variables were perfectly collinear, the whole model fitting procedure breaks down. It 

turns out that a certain matrix used in the model fitting cannot be numerically inverted (similar to trying to 

divide by zero) and no estimates are possible. If the variables are not perfectly collinear, many different sets 

of estimates can be found that give very nearly the same predictions! 

Not all the story is bad – multicollinearity does not imply that the whole regression model is useless. 

Even if predictor variables are highly related, good predictions are still possible provided that you make 

predictions at values of X that are similar to those used in model fitting. 

The basic tool <strong>for</strong> diagnosing potential collinearity is the variance inflation factor (VIF) <strong>for</strong> each regression 

coefficient. In JMP this is obtained by right-clicking on the table of parameter estimates after the 

Analyze->Fit Model plat<strong>for</strong>m is run. For example, the VIF <strong>for</strong> the fitness dataset are: 



The VIF is interpreted as the increase in the variance (se 2 ) of the estimate compared to what would be 

expected if the variable was completely independent of all other predictor variables. The VIF equals 1 when 

a predictor is not collinear with other predictors. VIFs that are vary large, typically around 10 or higher, are 

usually taken as an indication of potential collinearity. 

In the fitness dataset, there is evidence of collinearity in the average pulse rate during the run (Run Pulse) 

and the maximum pulse rate during the run (Max Pulse) variables. This is not unexpected. 

If collinearity is detected, remedial measures include dropping some of the redundant predictor variables, 

9 or more sophisticated fitting methods such as ridge or robust regression (which are beyond the scope 

of this <strong>course</strong>). 

9 An obvious question is how do you tell which variables are redundant? Common methods are principal component analysis of the 

X variables, or examining the correlation among the predictors. Seek help if you run into a problem of extreme multicollinearity. 



6.3 Polynomial, product, and interaction terms 


The assumption of a marginal linear relationship between the response variable and the X variable is sometimes 

not true, and a quadratic and (rarely) cubic or higher polynomials are often fit in terms of X in order 

to approximate this non-linear relationship. 

The basic way to deal with polynomial regression (i.e. quadratic and higher terms) is to create new 

predictor variables involving X 2 , X 3 , . . .. Although not necessary with modern software, it is often a good 

idea to center variables that will be used in quadratic and higher relationship to avoid a high degree of 

collinearity among the terms. For example, replace X and X 2 by (X −X) and (X −X) 2 respectively While 

the actual coefficients may change, the p-value <strong>for</strong> testing the linear and quadratic slope are unaffected, and 

predictions are also unaffected – this is exactly analogous what happens in regression when there is a unit 

change between imperial and metric units <strong>for</strong> some variable. 

The model fit is 

If the square term is called X 2 , the model is: 

Y i = β 0 + β 1 X i1 + β 2 X 2 i1 + ɛ i 

Y i = β 0 + β 1 X i1 + β 2 X i2 + ɛ i 

which now looks exactly like a ordinary multiple regression model. 

The rest of the model fitting, testing, etc proceeds exactly as outlined in previous sections. However, 

there are two potential problems with polynomial models. 

• Models should be hierarchical. This means that if you include a term involving X 2 in the model, you 

must include a term involving X. If you include a quadratic, but not the linear term, you are restricting 

the quadratic curve to be a very special shape which is not usually wanted in practice. This will be 

outlined in class. 

• The interpretation of the estimates must be done with care. Normally, the estimated slopes are the 

MARGINAL contribution of this variable to the response, i.e. after holding all other variables constant. 

However, if the regression equation includes both X and X 2 terms, it is impossible to hold X fixed 

while changing X 2 alone. 

What degree of polynomial is suitable? This is usually done by fitting successively higher polynomial 

terms until the added term is no longer statistically significant and then using the previous model. While 

polynomial models allow <strong>for</strong> some degree of curvature in the response, it is very rare to fit terms involving 

cubic and higher powers. The reason <strong>for</strong> this is that such curves seldom have biological plausibility, and 

they have wide oscillations in their predicted values. 

The researcher should also investigate if a trans<strong>for</strong>m of the Y or X variable may linearize the relationship. 

For example, a plot of log(Y ) vs. X may show a linear fit. Similarly, 1/X may be a more suitable 



predictor. 10 It is possible to use least squares to actually fit non-linear models where no trans<strong>for</strong>mation or 

polynomial terms provide a good fit. This is beyond the scope of this <strong>course</strong>. 

6.3.2 Example: Tomato growth as a function of water 

An experiment was run to investigate the yield of tomato plants as a function of the amount of water provided 

over the season. A series of plots were randomized to different watering levels and at the end of the season, 

the yield of the plants was determined. 

The raw data follows: 

Water 

Yield 

6 49.2 

6 48.1 

6 48.0 

6 49.6 

6 47.0 

8 51.5 

8 51.7 

8 50.4 

8 51.2 

8 48.4 

10 51.1 

10 51.5 

10 50.3 

10 48.9 

10 48.7 

12 48.6 

12 48.0 

12 46.4 

12 46.2 

14 43.2 

14 42.6 

14 42.1 

14 43.9 

14 40.5 

10 For example, should fuel economy of car be measured a miles/gallon (distance/consumption) or L/100 km (consumption/distance). 



JMP Analysis: 

The raw data is also available in a JMP data sheet called tomatowater.jmp available from the Sample 


The data is entered into JMP in the usual fashion – columns represent variables and rows represent 

observations. The scale of both variables should be continuous. 

As usual, begin with a plot of the data: 



The relationship is clearly non-linear and looks as if a quadratic may be suitable. 

Be<strong>for</strong>e fitting the model, think about the assumptions required <strong>for</strong> the fit and assess if these are suitable 

to the data at hand. 

There are two ways to fit simple (i.e. only involving polynomial terms in X) polynomial models in JMP. 

If your regression model is a mixture of polynomial and other X variables, then the second method must be 

used. 

In the first method, the Analyze->Fit Y-by-X plat<strong>for</strong>m can be used directly. For example, select the 




and choose Polynomial Fit: 

which gives a plot of the fitted line: 



and statistics about the fit: 



The fitted curve is: 

Yield = 57.726857 − 0.762Water − 0.2928571(Water-10) 2 

Notice that JMP has automatically centered the quadratic term by subtracting the mean X of 10 from each 

value prior to squaring. As you will see in a few minutes, this has no affect upon the test of significance of 

the quadratic term, nor on the actual predicted values. 

The ANOVA table can be used to examine if either/or the linear or quadratic terms provide any predictive 

power. The table of estimates shows that the quadratic term is clearly statistically significant. Confidence 

intervals <strong>for</strong> the regression coefficients can be found in the usual fashion by right clicking in the table and 

requesting the appropriate columns (not shown). 

A residual plot is obtained in the usual fashion: 



which shows no evidence of a problem. 

If a cubic polynomial is fit (in the same fashion as the quadratic polynomial) you will see that the cubic 



term is not statistically significant indicating that a quadratic model is sufficient. 

Confidence band <strong>for</strong> the mean response at each X and the individual response at each X can also be 

obtained in the usual way: 



Again, the scientist must understand the difference between the confidence bounds <strong>for</strong> each type of prediction 

as outlined in earlier chapters. 

The second way to fit polynomial models (and the only way when polynomial terms are intermixed with 

other variables) is to use the Analyze->Fit Model plat<strong>for</strong>m. First variables corresponding to X 2 and X 3 (if 

needed) must be created using the <strong>for</strong>mula editor of JMP: 11 

11 It is preferable to use JMP’s <strong>for</strong>mula editor rather than creating these variables outside of the data sheet because these columns 

will be hot-linked to the original column. If, <strong>for</strong> example, a value of X is updated, then the values of the squared and cubic terms will 

also be updated automatically. 



and a portion of the resulting data table is shown below: 



Note that the X variable was centered be<strong>for</strong>e squaring and cubing. 

Now use the Analyze->Fit Model plat<strong>for</strong>m to fit using the water and water-squared terms: 



The plot of actual vs. predicted shows a good fit: 

The ANOVA table (not shown) can be used to assess the overall fit of the model as seen in earlier sections. 

The estimates match those seen earlier, as do the p-values: 

Confidence intervals <strong>for</strong> the regression coefficients can be found in the usual fashion by right clicking in the 

table and requesting the appropriate columns (not shown). 

The leverage plot <strong>for</strong> the X 2 term shows that this polynomial is required and is not influenced by any unusual 



values: 12 

Confidence intervals <strong>for</strong> the mean response or individual responses are saved to the data table in the 

usual fashion (but are not shown in these notes): 

12 Because of the hierarchical restriction, the leverage plot <strong>for</strong> the linear term is not of interest. 



Finally, getting a plot of the actual fitted line a bit of work if using the Analyze->Fit Model plat<strong>for</strong>m. 

First, save the predicted values to the data table: 



Then use the Overlay Plot under the Graph menu to plot the individual points and the predicted values: 



and then join up the predicted values (and remove the fitted points) 



to finally give the plot that we saw earlier (whew!) but un<strong>for</strong>tunately, there does not appear to be anyway to 

drawing a smooth curve <strong>short</strong> of getting predictions <strong>for</strong> many points between the observed values of X and 

drawing the curve through the smaller increments. 



6.3.3 Polynomial models with several variables 

The methods of the previous section can be extended to cases where several variables have quadratic or 

higher powers. It is also possible to include crossproducts of these variables as well. 

There are no conceptual difficulties in having multiple polynomial variables. However, the analyst must 

ensure that models are hierarchical (i.e. if higher powers or cross products are includes, then lower order 

terms must also be included). Consequently, leverage plots of the lower order terms are likely not be very 

useful when higher order terms are included in the model. 

In practice, polynomial models are commonly restricted to quadratic terms or lower. The goal is not 

so much as to elucidate the underlying mechanism of the response, but rather to get a good approximation 

to the response surface. indeed, there are a whole suite of techniques (commonly called response surface 

methodology) used to fit and explore polynomial models in this context. Often predictions of where the 

maximum or minimum response is found are important. 



There are many excellent books available. JMP also has specialized tools in the Analyze->Fit Model 

plat<strong>for</strong>m to assist in the fitting of response surfaces. These are beyond the scope of these notes 

6.3.4 Cross-product and interaction terms 

Recall that the interpretation of the regression coefficient associated with the i th predictor variable is the 

marginal (i.e. after keeping all other variables in the model and fixed) increase in Y per unit change in X i . 

This marginal increase is the same regardless of the values of the other X variables. 

But sometimes, the contribution of the i th variable depends upon the value of another, the j th , predictor. 

For example, suppose the blood pressure tends to increase by .5 units <strong>for</strong> every kg increase in body mass <strong>for</strong> 

people under 1.5 m in height, but tends to increase by .6 units <strong>for</strong> very kg increase in body mass <strong>for</strong> people 

over 1.5 m in height. We would say that the body mass interacts with the height variable. This concept is 

very similar to analogous interaction of factors in ANOVA models. 13 

Consider a model where blood pressure depends upon age and height via the model: 

This corresponds to the <strong>for</strong>mal statistical model of: 

BP = AGE HEIGHT 

Y i = β 0 + β 1 AGE i + β 2 HEIGHT i + ɛ i 

You can see that if age increases by 1 unit, then the value of Y increases by β 1 units regardless of the value 

of height. Similarly, every time height increases by 1 unit, Y increases by β 2 regardless of the value of age. 

Now consider the model written as: 

which corresponds to the <strong>for</strong>mal statistical model of: 

BP = AGE HEIGHT AGE*HEIGHT 

Y i = β 0 + β 1 AGE i + β 2 HEIGHT i + β 3 AGE i × HEIGHT i + ɛ i 

The crossproduct of age and height enter into the model as new predictor variable. 14 Now look what happens 

when age is increased by 1 unit. The value of Y increases not simply by β 1 but by β 1 + β 3 HEIGHT i . 

Now when height is small, the increase in Y per unit change in age is smaller than when height is large. 

Similarly, an increase by 1 unit in the value of height will lead to an increase by β 2 + β 3 AGE i . The effect 

of height will be less <strong>for</strong> younger subjects than <strong>for</strong> older subjects. 

The use of product terms in multiple-regression can be easily extended to products involving more than 

two variables, and, more importantly as discussed in Section 6.5.3, to products with indicator variables. 

13 Indeed, this is not surprising as ANOVA is actually a special case of regression. 

14 The actual X matrix would then have four columns. Column 1 would consists of all 1’s; column 2 would consist of the values 

of age; column 3 would consist of the values of height; and column 4 would contain the actual products of age and height <strong>for</strong> each 

individual. 



There is no real problem in fitting these models other than the model must con<strong>for</strong>m to the hierarchical 

principle. This principle states that if terms like X i X j are in the model, so must be all lower order terms – 

in this case, both X i and X j as separate terms must remain in the model. This is the same principle as you 

saw <strong>for</strong> polynomial models. 

6.4 The general linear test 


In previous sections, you say how to test if a specific regression coefficient in the population was zero using 

the t-test provided by most computer packages. It is tempting then, to try and test if multiple X variables 

can be dropped simultaneously if their individual p-values are all not statistically significant. 

Un<strong>for</strong>tunately this strategy often fails. The basic reason <strong>for</strong> its failure is that very often regression 

coefficients are highly interrelated because their corresponding X variables are not orthogonal to each other. 

For example, suppose that both height and weight were X variables in a model that was trying to predict 

blood pressure. The test of the hypothesis <strong>for</strong> the slope <strong>for</strong> weight and height are MARGINAL tests, i.e. is 

the slope associated with weight in the population zero assuming that all other variables (including height) 

are retained in the model. Because of the high interdependency between height and weight, the p-value 

<strong>for</strong> the test of marginal zero slope <strong>for</strong> weight may not be statistically significant. Similarly, the p-value <strong>for</strong> 

the test of marginal zero slope <strong>for</strong> height (assuming that weight is in the model) may also be statistically 

non-significant. However, both height and weight cannot be simultaneously removed from the model. 

In order to test if a set of predictor variables can be simultaneously removed from the model, a General 

Linear Test is per<strong>for</strong>med. The mechanics of the test are: 

1. Fit the full model, i.e. with all variables present. Find SSE full from the full model. 

2. Fit the reduced model, i.e. dropping the variables of interest. Find SSE reduced from the reduced 

model. 

3. If the reduced model is still an adequate fit, then SSE reduced should be very close to SSE full – after 

all, if the dropped variables were not important, then the reduction in prediction error should be small. 

Construct a test statistic as: 

F general = 

SSE reduced −SSE full 

df SSEreduced −df SSEfull 

SSE full 

df SSEfull 

This is compared an F -distribution with the appropriate degrees of freedom. Large values of the 

F -statistic indicate evidence that not all variables can be simultaneously dropped. 

Of <strong>course</strong>, this procedure has been automated in most statistical packages as will be illustrated by an 

example. 



6.4.2 Example: Predicting body fat from measurements 

The percentage of body fat in humans is a good indicator of future problems with cardiovascular and other 

diseases. 

The following was taken from Wikipedia: 15 

Body fat percentage is the fraction of the total body mass that is adipose tissue. This index 

is often used as a means to monitor progress during a diet or as a measure of physical fitness 

<strong>for</strong> certain sports, such as body building. It is more accurate as a measure of health than body 

mass index (BMI) since it directly measures body composition and there are separate body fat 

guidelines <strong>for</strong> men and women. However, its popularity is less than BMI because most of the 

techniques used to measure body fat percentage require equipment and skills that are not readily 

available. 

The most accurate method has been to weigh a person underwater in order to obtain the average 

density (mass per unit volume). Since fat tissue has a lower density than muscles and bones, 

it is possible to estimate the fat content. This estimate is distorted by the fact that muscles and 

bones have different densities: <strong>for</strong> a person with a more-than-average amount of bone tissue, the 

estimate will be too low. However, this method gives highly reproducible results <strong>for</strong> individual 

persons ±1%. The body fat percentage is commonly calculated from one of two <strong>for</strong>mulas: 

Brozek <strong>for</strong>mula : BF = (4.57/p − 4.142) ∗ 100 

Siri <strong>for</strong>mula : BF = (4.95/p − 4.50) ∗ 100 

In these <strong>for</strong>mulas, p is the body density in kg/L obtained by weighing the person out of water 

and then dividing by the volume obtained by dunking the person underwater. 

BTW, the American Council on Exercise has associated categories with ranges of body fat. 

Women generally have less muscle mass than men and there<strong>for</strong>e they have a higher body fat 

percentage range <strong>for</strong> each category. 

Description Women Men 

Essential fat 10-13% 2-5% 

Athletes 14-20% 6-13% 

Fitness 21-24% 14-17% 

Acceptable 25-31% 18-24% 

Obesity 32%+ 25%+ 

Many studies have been done to see if predictions of body fat can be made based on simple measurements 

such as circumferences of various body parts. 

A study of middle age men measured the percentage of body fat using the difficult methods explained 

above and also taking measurements of the circumference of their thigh, triceps, and mid-arm. 

15 2006-05-15, at http://en.wikipedia.org/wiki/Body_fat_percentage 




Triceps Thigh Mid-arm PerBodyFat 

19 43 29 11.9 

24 49 28 22.8 

30 51 37 18.7 

29 54 31 20.1 

19 42 30 12.9 

25 53 23 21.7 

31 58 27 27.1 

27 52 30 25.4 

22 49 23 21.3 

25 53 24 19.3 

31 56 30 25.4 

30 56 28 27.2 

18 46 23 11.7 

19 44 28 17.8 

14 45 21 12.8 

29 54 30 23.9 

27 55 25 22.6 

30 58 24 25.4 

22 48 27 14.8 

25 51 27 21.1 

JMP Analysis 

The raw data is also available in a JMP data sheet called bodyfat.jmp available from the Sample Program 


Fit the multiple-regression model using the Analyze->Fit Model plat<strong>for</strong>m: 



The resulting estimates all have tests <strong>for</strong> the marginal population slope statistically non-significant: 

But at the same time, the whole model test: 



shows that there is predictive ability in these X variables because the overall p-value is statistically significant. 

The problem is that the X variables are all highly related. Indeed a scatter plot matrix of their X variables 

shows a high degree of relationship among them: 

A general-linear test <strong>for</strong> dropping, say both the triceps and thigh X variables is constructed using the 

Custom Tests pop-down menu item: 



and then specifying which X variables are to be tested together. You need a separate column in the Custom 

Test <strong>for</strong> each variable to be tested – if you specify multiple variables in a single column, you will get a test 

<strong>for</strong> a crazy hypothesis: 

The final result: 



has a p-value of .000003 which is very strong evidence that both variables cannot be dropped simultaneously. 

If you look at the ANOVA table from the full model: 

the SSE full = 100.1 with 16 df . 

The reduced model is fit using the Analyze->Fit Model plat<strong>for</strong>m and with just the Mid-arm variable, and 

the reduced model ANOVA table is: 



with the SSE reduced = 487.4 with 18 df . 

The general linear test is found as: 

F general = 

SSE reduced−SSE full 

df SSEreduced −df SSEfull 

SSE full 

= 

df SSEfull 

487.4−100.1 

18−16 

100.1 

16 

= 193.65 

6.3 

= 30.94 

which is the value reported above. 

6.4.3 Summary 

The general linear test is often used to test if a “chunk” of X variables can be removed from the model. 

Often this chunk will be set of variables that has something in common. 

For example, often all quadratic terms are tested simultaneously, or a variable and all it higher order 

interactions terms (e.g. X, X 2 , X 3 , etc.). 

6.5 Indicator variables 


Indicator variables (also known as dummy variables) are a device to incorporate nominal-scaled variables 

into regression contexts. For example, suppose you looked at the relationship between blood pressure and 

weight. In general blood pressure of individual increases with weight. But in general, males are larger than 

females, so a body weight of 90 kg may have a different effect <strong>for</strong> males than <strong>for</strong> females. So how can sex 

(a nominally scaled variable) be incorporated into the regression equation? 

It turns out that using indicator variables makes ordinary regression a general tool <strong>for</strong> many more applications 

than simply regression. Indeed, it is possible to show that two-sample t-tests, single factor completely 



randomized design ANOVAs, and even more complex experimental designs can be analyzed using regression 

methods. This is why many computer packages call their analysis tool <strong>for</strong> comparing means and fitting 

regressions as variants of general linear models. 

6.5.2 Defining indicator variables 

Un<strong>for</strong>tunately, there is no standard way to define an indicator variable in a regression setting, but <strong>for</strong>tunately, 

it turns out that it doesn’t matter which <strong>for</strong>mulation is used – it is always possible to get an appropriate 

answer. 

In general, if a nominally scaled variable has k categories, you will require k − 1 indicator variables. In 

many cases, computer packages will generate these automatically if the package knows that variable is to be 

treated as a nominally scaled variables. 16 

For example, as sex only has two levels, only one indicator variable is required. It could be coded as: 

{ } 

1 if male 

X 1 = 

0 if female 

or 

Many other codings are possible. 

{ 

1 if male 

X 1 = 

−1 if female 

} 

For a nominally scaled variable with three levels, two indicator variables will be needed. For example, 

suppose that the size of a person is classified as small, medium, or large. Then the indicator variables could 

be defined as: 

{ } { } 

1 if small 

1 if medium 

X 1 = 

X 2 = 

0 if medium or large 

0 if small or large 

Now the pair of variables define the three classes as: (X 1 , X 2 ) = (1, 0) = small, (X 1 , X 2 ) = (0, 1) = 

medium, and (X 1 , X 2 ) = (0, 0) = large. 

Many packages use what is known as reference coding rules <strong>for</strong> indicator variables, where the i th indicator 

variable take the values of 1 to indicate the i th value of the variable <strong>for</strong> the first k − 1 values of the 

variable, and all the indicator variables take the value 0 to refer to the last value of the variable. 17 

So, how do indicator variables help incorporate the effects of a nominally scaled variable? Consider 

the variable sex (taking two levels labeled f and m in that order). A single indicator variable, say Sex, 

16 That is why it is good practice to code nominally scaled variables using alphanumeric codes (e.g. m and f <strong>for</strong> sex), rather than 

numeric codes such as 3 or 7. 

17 Always check the package documentation carefully to see if the package is using this rule. If it uses a different coding scheme, 

you will have to interpret the estimates carefully. 



is defined that takes the value of 1 <strong>for</strong> females and 0 <strong>for</strong> males. Now consider the following estimated 

regression equation: 

BloodPressure = 110 − 10 ∗ Sex + .10 ∗ Weight 

The estimated blood pressure <strong>for</strong> a female who weighs 100 kg would be: 

110 = 110 − 10(1) + .10(100) 

while the estimated blood pressure <strong>for</strong> a male who weighs 100 kg would be: 

100 = 110 − 10(0) + .10(100) 

Hence, the coefficient associated with sex (with a value of -10) would be interpreted as the difference in 

blood pressure between females and males <strong>for</strong> all weight classes, i.e. the relationship consists of two parallel 

lines (with a slope against weight of 0.10) with a separation of 10 units. 

On the other hand, consider the regression equation: 

BloodPressure = 110 − 10 ∗ Sex + .10 ∗ Weight − 0.05 ∗ Sex*Weight 

Notice that two variables (the Sex indicator variable and the weight variable) are multiplied together. Now, 

the estimated blood pressure <strong>for</strong> a female who weighs 100 kg would be: 

105 = 110 − 10(1) + .10(100) − 0.05 ∗ (1) ∗ (100) 

while the estimated blood pressure <strong>for</strong> a male who weighs 100 kg would be: 

120 = 110 − 10(0) + .10(100) − 0.05 ∗ (0) ∗ (100) 

Hence, the coefficient associated with the product of sex and weight would be interpreted as differential 

response to weigh between males and females, i.e. the relationship consists of two non-parallel lines. The 

slope <strong>for</strong> males against weight is .10 while the slope <strong>for</strong> females against weight is .10 − .05 = .05. 

This idea can be extended to nominally scaled variables with more than two levels in a straight<strong>for</strong>ward 

way. Fortunately, most packages will do the coding automatically <strong>for</strong> you and all that is necessary is to 

specify the model appropriately and understand what the various model <strong>for</strong>mulations imply. 

6.5.3 The ANCOVA model 

The of indicator variables, has <strong>for</strong> historical reasons, been referred to as the Analysis of Covariance (AN- 

COVA) approach. It actual has two separate, but functionally identical uses. 

The first use is to incorporate nominally scaled variables into regression situations. The modeling starts 

off with individual regression lines, one <strong>for</strong> each value in the nominal variable (e.g. a separate line <strong>for</strong> males 

and females). A statistical test is used to see if the lines are parallel. If there is evidence that the individual 

regression lines are not parallel, then a separate regression line must be used <strong>for</strong> each group <strong>for</strong> prediction 

purposes. If there is no evidence of non-parallelism, then the next task is to see if the lines are co-incident, 



i.e. have both the same intercept and the same slope. If there is evidence that the lines are not coincident, 

then a series of parallel lines are used to make predictions. All of the data are used to estimate the common 

slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled 

together and a single regression line fit <strong>for</strong> all of the data. 


obvious: 





Second, ANCOVA has been used to test <strong>for</strong> differences in means among the groups when some of the 

variation in the responsible variable can be “explained” by a covariate. For example, the effectiveness of two 

different diets can be compared by randomizing people to the two diets and measuring the weight change 

during the experiment. However, some of the variation in weight change may be related to initial weight. 

Perhaps by “standardizing” everyone to some common weight, we can more easily detect differences among 

the groups. This will be discussed in a later chapter. 

A very nice book on the Analysis of Covariance is Analysis of Messy Data, Volume III: Analysis of Covariance 

by G. A. Milliken and D. E. Johnson. Details are available at http://www.statsnetbase. 

com/ejournals/books/book_summary/summary.asp?id=869. 





ANCOVA is a combination of ANOVA and Regression, the assumptions are similar. Both goals of ANCOVA 

have similar assumptions: 

• The response variable Y is continuous (interval or ratio scaled). 

• The data are collected under a completely randomized design. 18 This implies that the treatment must 

be randomized completely over the entire set of experimental units if an experimental study, or units 

must be selected at random from the relevant populations if an observational study. 













6.5.5 Comparing individual regression lines 



randomization structure.. Let Y be the response variable; X be the continuous X-variable, and Group be the 

nominally scaled group variable with TWO levels, i.e. only one indicator variable will be generated, called 

I. 

In this and previous chapter, we uses a <strong>short</strong>hand model notation. For example, the model notation 

Y = X 

would refer to a regression of Y on X with the underlying statistical model: 

Y = β 0 + β 1 X + ɛ 

18 It is possible to relax this assumption - this is beyond the scope of this <strong>course</strong>. 




where the subscript corresponding to individual subjects has been dropped <strong>for</strong> clarity. 

We now use an extension of model notation. The model notation: 

Y = X Group Group*X 

refers to the model: 

Y = β 0 + β 1 X + β 2 I + β 3 I ∗ X + ɛ 

Lastly, the model notation: 

refers to the model 

Y = X Group 

Y = β 0 + β 1 X + β 2 I + ɛ 

These can be diagrammed in a graphs. If the lines <strong>for</strong> each group are not parallel: 




Y1 = X Group Group*X 


specified), with group effects (different intercepts), a common slope on X, and an “interaction” between 

Group and X which is interpreted as different slopes <strong>for</strong> each group. This model is almost equivalent to 

fitting a separate regression line <strong>for</strong> each group. The only advantage to using this joint model compared to 

fitting separate slopes is that all of the groups contribute to a better estimate of residual error. If the number 

of data points per group is small, this can lead to improvements in precision compared to fitting each group 

individually. 



Y2 = Group X 










Y3 = X 

. Now the difference between this model and the previous model is the Group term that has been dropped. 


test. The test <strong>for</strong> co-incident lines should only be done if there is insufficient evidence against parallelism. 




6.5.6 Example: Degradation of dioxin 








on the same inlet where the pulp mill was located.. The liver is excised and the livers from all four crabs 




As seen in the chapter on regression, the appropriate response variable is log(T EQ). 


Here are the raw data which are also available on the web in the SampleProgramLibrary available at 







Site Year TEQ log(TEQ) 

a 1990 179.05 5.19 

a 1991 82.39 4.41 

a 1992 130.18 4.87 

a 1993 97.06 4.58 

a 1994 49.34 3.90 

a 1995 57.05 4.04 

a 1996 57.41 4.05 

a 1997 29.94 3.40 

a 1998 48.48 3.88 

a 1999 49.67 3.91 

a 2000 34.25 3.53 

a 2001 59.28 4.08 

a 2002 34.92 3.55 

a 2003 28.16 3.34 

b 1990 93.07 4.53 

b 1991 105.23 4.66 

b 1992 188.13 5.24 

b 1993 133.81 4.90 

b 1994 69.17 4.24 

b 1995 150.52 5.01 

b 1996 95.47 4.56 

b 1997 146.80 4.99 

b 1998 85.83 4.45 

b 1999 67.72 4.22 

b 2000 42.44 3.75 

b 2001 53.88 3.99 

b 2002 81.11 4.40 

b 2003 70.88 4.26 

The data is entered into JMP in the usual fashion. Make sure that Site is a nominal scale variable, and 

that Year is a continuous variable. 


is easily accomplished in JMP by selecting the rows (say <strong>for</strong> site a) and using the Rows->Markers to set the 

























variable: 

Then specify a grouping variable by clicking on the pop-down menu near the Bivariate Fit title line: 














The scatter plot doesn’t show any obvious outliers. The estimated slope <strong>for</strong> the a site is −0.107 (se .02) 

while the estimated slope <strong>for</strong> the b site is −0.06 (se .02). The 95% confidence intervals (not shown on the 





The MSE from site a is 0.10 and the MSE from site b is 0.12. This corresponds to standard deviations 

of √ 0.10 = 0.32 and √ 0.12 = 0.35 which are very similar so that assumption of equal standard deviations 

seems reasonable. 





output: 



















and has a value of −0.083 (se 0.016). Because the analysis was done on the log-scale, this implies that the 


The 95% confidence interval <strong>for</strong> the slope on the log-scale is from (−.12 → −0.05) which corresponds to 

a potential factor between exp(−0.12) = .88 to exp(−0.05) = .95 per year, i.e. between a 12% and 5% 

decline per year. 22 



LSMeans correspond to the log(TEQ) at the average value of Year - not really of interest. As in previous 


using the pop-down menu and selecting a LSMEans Contrast or Multiple Comparison procedure to give: 











plot 


6.5.7 Example: More refined analysis of stream-slope example 

In the chapter on paired comparisons, the example of the effect of stream slope was examined based on: 

Isaak, D.J. and Hubert, W.A. (2000). Are trout populations affected by reach-scale stream slope. 

Canadian Journal of Fisheries and Aquatic Sciences, 57, 468-477. 

In that paper, stream slope was (roughly) categorized into high or low slope classes and a paired-analysis 

was per<strong>for</strong>med. In this section, we will use the actual stream slopes to examine the relationship between fish 

density and stream slope. 

Recall that a stream reach is a portion of a stream from 10 to several hundred meters in length that 

exhibits consistent slope. The slope influences the general speed of the water which exerts a dominant 

influence on the structure of physical habitat in streams. If fish populations are influenced by the structure 

of physical habitat, then the abundance of fish populations may be related to the slope of the stream. 



Reach-scale stream slope and the structure of associated physical habitats are thought to affect trout 

populations, yet previous studies confound the effect of stream slope with other factors that influence trout 

populations. 

Past studies addressing this issue have used sampling designs wherein data were collected either using 

repeated samples along a single stream or measuring many streams distributed across space and time. 

Reaches on the same stream will likely have correlated measurements making the use of simple statistical 

tools problematical. [Indeed, if only a single stream is measured on multiple locations, then this is an 

example of pseudo-replication and inference is limited to that particular stream.] 

Inference from streams spread over time and space is made more difficult by the inter-stream differences 

and temporal variation in trout populations if samples are collected over extended periods of time. This extra 

variation reduces the power of any survey to detect effects. 

For this reason, a paired approach was taken. A total of twenty-three streams were sampled from a large 

watershed. Within each stream, two reaches were identified and the actual slope gradient was measured. 

In each reach, fish abundance was determined using electro-fishing methods and the numbers converted 

to a density per 100 m 2 of stream surface. 

The following table presents the (fictitious but based on the above paper) raw data 

Estimates of fish density from a paired experiment 

slope slope density 

Stream (%) class (per 100 m 2 ) 

1 0.7 low 15.0 

1 4.0 high 21.0 

2 2.4 low 11.0 

2 6.0 high 3.1 

3 0.7 low 5.9 

3 2.6 high 6.4 

4 1.3 low 12.2 

4 4.0 high 17.6 

5 0.6 low 6.2 

5 4.4 high 7.0 

6 1.3 low 39.8 

6 3.2 high 25.0 

7 2.0 low 6.5 

7 4.2 high 11.2 

8 1.3 low 9.6 



8 4.2 high 17.5 

9 2.0 low 7.3 

9 3.6 high 10.0 

10 0.7 low 11.3 

10 3.5 high 21.0 

11 2.3 low 12.1 

11 6.0 high 12.1 

12 2.5 low 13.2 

12 4.2 high 15.0 

13 2.3 low 5.0 

13 6.0 high 5.0 

14 1.2 low 10.2 

14 2.9 high 6.0 

15 0.7 low 8.5 

15 2.9 high 7.0 

16 1.1 low 5.8 

16 3.0 high 5.0 

17 2.2 low 5.1 

17 5.0 high 5.0 

18 0.7 low 65.4 

18 3.2 high 55.0 

19 0.7 low 13.2 

19 3.0 high 15.0 

20 0.3 low 7.1 

20 3.2 high 12.0 

21 2.3 low 44.8 

21 7.0 high 48.0 

22 1.8 low 16.0 

22 6.0 high 20.0 

23 2.2 low 7.2 

23 6.0 high 10.1 

Notice that the density varies considerably among stream but appears to be fairly consistent within each 

stream. 



The raw data is available in a JMP datafile called paired-stream.jmp in the Sample Programs Library at 


As noted earlier, this is an example of an Analytical Survey. The treatments (low or high slope) cannot 

be randomized within stream – the randomization occurs by selecting streams at random from some larger 

population of potential streams. As noted in the early chapter on Observational Studies, causal inference is 

limited whenever a randomization of experimental units to treatments cannot be per<strong>for</strong>med. 

Unlike the example presented in other chapters where the slope is divided (arbitrarily) into two class 

(low and high slope), we will now use the actual slope. A simple regression CANNOT be used because of 

the non-independence introduced by measuring two reaches on the same stream. However, an ANOCOVA 

will prove to be useful here. 

First, it seem sensible that the response to stream slope will will be multiplicative rather than additive, 

i.e. an increase in the stream slope will change the fish density by a common fraction, rather than simply 

changing the density by a fixed amount. For example, it may turn out that a 1 unit change in the slope, 

reduces density by 10% - if the density be<strong>for</strong>e the change was 100 fish/m 2 , then after the change, the new 

density will be 90 fish/m 2 . Similarly, if the original density was only 10 fish/m 2 , then the final density will 

be 9 fish/m 2 . In both cases, the reduction is a fixed fraction, and NOT the same fixed amount (a change of 

10 vs. 1). 

Create the log(density) column in the usual fashion (not illustrated here). In cases like this, the natural 

logarithm is preferred because the resulting estimates have a very nice simple interpretation. 23 

An appropriate model will be one where each stream has a separate intercept (corresponding to the 

different productivities of each stream - acting like a block), with a common slope <strong>for</strong> all streams. The 

simplified model syntax would look like 

log(density) = Stream Slope 

where the term Stream represents a nominal scaled variable and gives the different intercepts and the Slope 

is the effect of the common slope on the log(density). 

This is fit using the Analyze->Fit Model plat<strong>for</strong>m as: 

23 The JMP dataset also created a different plotting symbol <strong>for</strong> each stream using the Rows− > Color or Mark by Column 

menu. 



Note that it stream must have a nominal scale and that slope must have a continuous scale. The order of the 

terms in the effects box is not important. 

The output from the Analyze->Fit Model plat<strong>for</strong>m is voluminous, but a careful reading reveals several 


First is a plot of the common slope fit to each stream: 



This shows a gradual increase as slope increases. This plot is hard to interpret, but a plot of observed vs. 

predicted values is clearer: 



Generally, the observed are close to the predicted values except <strong>for</strong> two potential outliers. By clicking on 

these points, it is shown that both points belong to stream 2 where it appears that increases in the slope 

causes a large decrease in density contrary to the general pattern seen in the the other streams. 

The effect tests: 

fail to detect any influence of slope. Indeed the estimated coefficient associated with a change in slope is 

found to be: 



is estimated to be .025 (se .0299) which is not statistically significant. 24 

Residual plots also show the odd behavior of stream 2: 

If this rogue stream is “eliminated” from the analysis, the the resulting plots do not show any problems 

(try it), but now the results are statistically significant (p = 0.035): 

24 Because the natural log trans<strong>for</strong>m was used and the data on the log scale was used, “smallish” slope coefficients have an approximate 

interpretation. In this example, a slope of .025 on the (natural) log scale implies that the estimated fish density INCREASES by 

2.5% every time the slope increases by one percentage point. 



The estimated change in log-density per percentage point change in the slope is found to be: 

i.e. the slope is .05 (se .02) which is interpreted that a percentage point increase in stream slope increases 

fish density by 5%. 25 

The remaining residual plot and leverage plots show no problems. 

6.6 Example: Predicting PM10 levels 

Small particulates are known to have adverse health effects. Here is some background in<strong>for</strong>mation from 

Wikipedia: 26 

The effects of inhaling particulate matter has been widely studied in humans and animals and 

include asthma, lung cancer, cardiovascular issues, and premature death. The size of the particle 

determines where in the body the particle will come to rest if inhaled. Larger particles are 

generally filtered by small hairs in the nose and throat and do not cause problems, but particulate 

matter smaller than about 10 micrometers, referred to as PM10, can settle in the bronchial 

tubes and lungs and cause health problems. Particles smaller than 2.5 micrometers, PM2.5, can 

penetrate directly into the lung, whereas particles smaller than 1 micrometer PM1 can penetrate 

into the alveolar region of the lung and tend to be the most hazardous when inhaled. 

The large number of deaths and other health problems associated with particulate pollution was 

first demonstrated in the early 1970s (Lave et. al, 1973) and has been reproduced many times 

25 This easy interpretation occurs because the natural log trans<strong>for</strong>m was used. If the common (base 10) log trans<strong>for</strong>m was used, there 

is no longer such a simple interpretation. 

26 Downloaded from http://en.wikipedia.org/wiki/Particulate on 2006-05-22 



since. PM pollution is estimated to cause 20,000-50,000 deaths per year in the United States 

(Mokdad et. al, 2004) and 200,000 deaths per year in Europe. For this reason, the US Environmental 

Protection Agency (EPA) sets standards <strong>for</strong> PM10 and PM2.5 concentrations in urban 

air. EPA regulates primary particulate emissions and precursors to secondary emissions (NOx, 

sulfur, and ammonia). Many urban areas in the US and Europe still frequently violate the particulate 

standards, though urban air has gotten cleaner, on average, with respect to particulates 

over the last quarter of the 20th century. 

The data are a subsample of 500 observations from a data set that originate in a study where air pollution 

at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads 

Administration. 

The response variable consist of hourly values of the logarithm of the concentration (why?) of PM10 

(particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor 

variables are the logarithm of the number of cars per hour, temperature 2 meters above ground (degree C), 

wind speed (meters/second), the temperature difference between 25 and 2 meters above ground (degree C), 

wind direction (degrees between 0 and 360), hour of day and day number from October 1. 2001. 

The data were extracted from http://lib.stat.cmu.edu/datasets/ and are available in the 

file pm10.jmp in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/ 

Notes/MyPrograms. 

Wind direction is an interesting variable as it ranges from 0 to 360 around a circle and cannot be used 

directly in a regression setting – after all a direction of 1 degree and 359 degrees is so similar, but have vastly 

“different” measured values. 

Examine the histogram of the wind directions (obtained from the Analyze->Distribution plat<strong>for</strong>m): 



This seems to indicate that there are two major wind directions. The “E” winds correspond to wind directions 

between from about 320 → 360 degrees and from 0 → 150 degrees, while the “W” winds correspond to 

directions between 150 → 320 degrees. 

Convert these measurements into a nominal scaled variable using JMP’s <strong>for</strong>mula editor: 



This classifies the wind direction into the two categories. A character coding is used to prevent computer 

packages from interpreting a numeric code as an interval or ratio scaled variable. An indicator variable could 

be created <strong>for</strong> this variable as seen in earlier chapters. 

An initial scatter plot matrix of the data is obtained by using the Analyze->MultiVariateMethods->Multivariate 






There is no obvious relationship among the variables. The plot of the day variable show a large gap. Inspection 

of the data shows that recording was stopped from about 100 days in the middle of the data set – 

the reasons <strong>for</strong> this are unknown. The number of cars/hour varies over the hour of the day in a predictable 

fashion. The wind direction variable shows that most of the data points have wind blowing in two major 

directions corresponding to E and W as broken into categories earlier. 

A plot of the log(PM10) concentration by the condensed wind direction: 



shows no obvious relationship between the PM10 and the wind direction. 

The Analyze->Fit Model plat<strong>for</strong>m was used to fit a model to the continuous and indicator variable. 



The leverage plots (not shown) don’t reveal any problems in the fit. The actual vs. predicted plot: 



appears to show some evidence that the fitted line tends to under predict at high log(PM10) concentrations 

and over predict at lower log(PM10) concentrations, but the visual impression may be an artifact of the 

density of points. The residual plot: 



don’t show any problems with the fit. In any case, the R 2 is not large indicating plenty of residual variation 

not explained by the regressor variables. 

The estimates table: 

doesn’t show any problems with variance inflation, but perhaps some variables can be deleted. Use the 

Custom Test option: 



to see if the day, wind-direction, and hour can be removed. [I suspect that any hour effect has been taken up 

by the log(cars) effect and so is redundant (why?). Similarly, any trend over time (the day effect) may also 

be included in the log(cars) effect (why?)]: 

[Why are three columns needed to test the three variables?] The results of the “chunk” test are: 



showing that these variables can be safely deleted. The Analyze->Fit Model plat<strong>for</strong>m is again used, but now 

dropping these apparently redundant variables. 

The revised estimates from this reduced model again show no problems in the leverage plots, no problems 

in the residual plots, and no problems in the VIF. The estimates are: 



This time, it appears that both temperature variables are also redundant. This is somewhat surprising, but on 

sober second thought, perhaps not so. The temperature wouldn’t affect the creation of particles – after all if 

the cars are the driving <strong>for</strong>ce behind the levels, the cars will produce the same particular levels regardless of 

temperature. Perhaps temperature only affects how the PM10 levels affect human health, i.e. on hot days, 

perhaps people feel affect more by pollution. 

A “chunk” test using the Custom Test procedure shows that the temperature variable can also be dropped 

(not shown). 

The final model includes only two variables, the log(cars/hour) and the wind speed. The final estimates 

are: 

As the number of cars/hour increases, the pollution level increases. As both the pollution level and 

the number of cars have been measured on the log scale, the coefficient must be interpreted carefully. A 

doubling of the number of cars corresponds to an increase of .7 on the natural logarithm scale (log(2)=.7). 

Hence, the log(PM10) increases by .7(.32)=.22 which corresponds to exp(.22) = 1.25 times increase on the 

anti-log scale. In other words, a doubling of cars/hour corresponds to a 25% increase in the PM10 levels. 

As wind speed increases, the concentration of PM10 decreases. A similar exercise shows that an increase 

in wind speed by 1 m/second causes the PM10 concentration decrease by about 10%. 

The leverage plots and residual plots show no problems in the data. 

How well does the model per<strong>for</strong>m in practice? On way to assess this is to save the Std Err of predictions 

of the mean and of individual predictions to the data table: 



(similar actions are done to save the std error <strong>for</strong> individual predictions and the actual predicted values). 

Then compute the ratio of each of the standard errors to the predicted values: 



(again, only one <strong>for</strong>mula is shown) and use the Analyze->Distribution plat<strong>for</strong>m to see the histograms of the 

relative prediction errors: 



Predictions of the MEAN response are fairly good – the relative standard errors are under 5% so the 95% 

confidence intervals <strong>for</strong> the predicted response will be fairly tight. However, as expected, the prediction 

intervals <strong>for</strong> individual response are fairly poor – the relative prediction standard errors are around 25% 

which means that the 95% prediction intervals will be ± 50%! It is unclear how useful this is <strong>for</strong> advising 

individuals to take preventive actions under certain conditions of traffic volume and wind speed. 

6.7 Variable selection methods 


Up to now, it has been assumed that the variables to be used in the regression equation are basically known 

and all that matters is perhaps deleting some variables as being unimportant, or deciding upon the degree of 

the polynomial needed <strong>for</strong> a variable. 

In some cases, researchers are faces with several tens (sometimes hundreds or thousands) of predictors 

and help is needed in even selecting a reasonable subset of variables to describe the relationship. The techc○2012 

Carl James Schwarz 497 November 23, 2012


niques in this section are called variable selection methods. CAUTION: Variable selection methods, 

despite their apparent objectivity, are no substitute <strong>for</strong> intelligent thought. As you will see in the remainder 

of this section, there are numerous caveats that must be kept in mind when using these methods. 

There are two philosophies underlying variable selection methods. The first philosophy is that the there 

is a unique correct model that explains the data. This MAY be true in physical systems where the goal of 

the project is to understand mechanisms of actions. The role of variable selection is to try and come and up 

with the variables that describe the mechanism of action. The second philosophy (and one that I personally 

find more appealing) is that reality is hopelessly complex and that all our models are wrong. We hope via 

regression methods to come up with a prediction function that works satisfactorily. There is NO unique set 

of predictors which is “correct” – there may be several sets of predictors that all give reasonable answers 

and the choice among these sets is not obvious. 

In both cases, model selection follows five general steps: 

1. Specify the maximum model (i.e. the largest set of predictors). 

2. Specify a criterion <strong>for</strong> selecting a model. 

3. Specify a strategy <strong>for</strong> selecting variables. 

4. Specify a mechanism <strong>for</strong> fitting the models – usually least squares. 

5. Assess the goodness-of-fit of the the models and the predictions. 

6.7.2 Maximum model 

The maximum model is the set of predictors that contains all potential predictors of interest. Often researchers 

will add polynomial (e.g. X 2 1 ), crossproduct terms (e.g. X 1 X 2 ), or trans<strong>for</strong>mations of variables 

(e.g. ln(X 1 )). 

If the first philosophy is correct, this maximal model must contain the correct model as a subset of the 

potential predictor variables. As the maximum model, this model has the highest predictive power, but some 

predictors may be redundant. Under the second philosophy, we know that this (and all models) are wrong, 

but we hope that this maximal model is a reasonable predictor function. Again, some predictors may be 

redundant. 

Some caution must be used in specifying a maximum model. First, try to avoid including many variables 

that are collinear. For example, height and weight are highly collinear and are both variables really 

needed? If including polynomial or cross product terms, center (i.e. subtract the mean) be<strong>for</strong>e squaring the 

variables or taking cross-products. Use scientific knowledge to select the potential predictors and the shape 

of the prediction function. Classification variables (i.e. nominal or interval scaled variables) will generate a 

separate indicator variable <strong>for</strong> each level of the variable. Some computer programs (e.g. JMP) may generate 

contrasts among these indicator variables as well. 



Second, there are various rule of thumb <strong>for</strong> the maximum number of predictors that should be entertained 

<strong>for</strong> a dataset. Generally, you want about 10 observations <strong>for</strong> each potential predictor variable. Hence, if your 

maximum model has 30 potential predictor variables, this rule of thumb would require you have at least 300 

observations! Remember that a nominal scaled variable with k values will required k−1 indicator variables! 

Third, examine the contrast within variables. If a variable is essentially constant (e.g. every subject 

had essentially the same weight), then this a useless predictor variable as no “effect” of weight will be 

apparent. If an indicator variable only points to a single case (e.g. only a single female in the dataset) then 

the results may be highly specific to the dataset analyzed. Low contrast variables should not be included in 

the maximum model. 

6.7.3 Selecting a model criterion 

The model criterion is an “index” that is computed <strong>for</strong> each candidate model and use to compare the various 

models. Given a particular criterion, one can order the models from “best” to “worst”. 

The criterion used should be related to the goal of the analysis. If the goal is prediction, the selection 

criterion should be related to errors in predictions. If the goal is variable subset selection, then the criterion 

should be related to the quality of the subset. 

There is NO single best criterion. A literature search will reveal at least 10 criteria that have been 

proposed. In this chapter, five of the criteria will be discussed – this is not to say that these five are the 

optimal criteria, but rather the most frequently chosen. These criteria are R 2 , F p , MSE p , C p , and AIC. 

R 2 

The R 2 criterion is the simplest criterion in use. The value of R 2 measures, in some sense, the proportion of 

total variation in the data that is explained by the predictors. Consequently, higher values of R 2 are “better”. 

However, this criterion has a number of defects. First R 2 will never decrease as you add variables 

(regardless of usefulness) to models. But in many cases, a plot of R 2 by the number of variables shows a 

rapid increase a variables are added, then a leveling off where new variables essentially add very little new 

in<strong>for</strong>mation. Model near the bend of the curve seem to offer an reasonable description of the data. Some 

packages attempt to adjust the value of R 2 <strong>for</strong> the number of variables (called the adjusted R 2 ), and so the 

value of the adjusted R 2 again near the bend of the curve would be the target. 

F p 

The F p criterion is essentially a number of hypothesis tests to see which set of p variables is not statistically 

different from the full model. If the test statistic <strong>for</strong> a set of p predictors is not statistically significant, then 

the other variables can be dropped. 



The danger with this criterion is that every test has a α probability of a Type I (false positive) error. So 

if you do 50 tests, each at α = .05, there is a very good chance that at least one of the tests will show a 

statistically significant results when in fact it is not. If you decide to use this criterion, you likely want to do 

the tests at a more stringent criterion, i.e. use α = .01 or α = .001. 

MSE p 

This criterion uses the estimated residual variance about the regression line. This residual variance is a 

combination of unexplainable variation and excess variation caused by unknown predictors. In many cases, 

there is a subset that has the minimal residual variation. 

C p and AIC 

These are two related (and in linear regression can be shown to be equivalent) criterion. 

Mallow’s C P , is computed as: 

C p = SSE(p) − [n − 2(p + 1)] 

MSE(k) 

where MSE(p) is the mean square error from the subset with p predictors EXCLUDING the intercept 27 ; 

MSE(k) is the MSE from the maximum model; and n is the number of observations. 

If the maximum model does contain the “truth”, then Mallow showed that C p should be close to p + 1 28 

<strong>for</strong> a subset model that is closest to the “truth”. 

Akaike In<strong>for</strong>mation Criterion (AIC) is a 1-1 trans<strong>for</strong>mation of the C p and can be thought of as 

AIC = fit + penalty <strong>for</strong> predictors 

. In the case of multiple regression, AIC has a simple <strong>for</strong>m: 

AIC = nlog( SSE 

n 

) + 2p 

where again p is the number of predictors INCLUDING the intercept. The model with the smallest AIC 

is usually preferred as this model has the best fit after accounting <strong>for</strong> a penalty <strong>for</strong> adding too many predictors. 

However, AIC goes further. Under the philosophy that all models are wrong, but some are useful, 

it is possible to obtain model weights <strong>for</strong> several potential models, and to “average” the results of several 

competing models. This avoids the entire discussion of which is the best wrong model, but rather works on 

the philosophy that if several models that all seem to fit the data similarly give wildly different answers, then 

27 Some textbooks define p to INCLUDE the intercept and so the last term may look like n − 2p rather than n − 2(p + 1). Both are 

equivalent. 

28 Again, if p is defined to include the intercept, then C p should be close to p rather than p + 1 



this uncertainty in the response must be incorporated. Burhnam and Anderson (2002) has a very nice book 

on the use of AIC and its philosophy. Un<strong>for</strong>tunately, the use of model weights is beyond the scope of this 

<strong>course</strong>. 

6.7.4 Which subsets should be examined 

When we start with k potential predictors, there are many, many potential models that involve subsets of the 

k predictors. How are these subsets chosen? 

All possible subsets 

If there are k predictors variables in the maximum models, there are around 2 k possible subsets. This number 

can be enormous – <strong>for</strong> example with 10 potential predictors, there are around 2 10 = 1024 subsets; with 20 

predictors, there are around 2 20 = 1, 048, 576 possible models etc. 

With modern computers and good algorithms, it is actually possible to search all subsets <strong>for</strong> up to about 

15 predictors (and this number gets higher each year). 29 Don’t use Excel! 

The all possible subsets strategy is preferred <strong>for</strong> reasonably sized problems. Because it looks at all 

possible models, it is unlikely that you would miss the “correct” among the subsets. However, there may be 

several different models that all are essentially the same and being <strong>for</strong>ced to select one of these models is a 

bit arbitrary – hence one of the driving <strong>for</strong>ces behind the AIC. 

Backward elimination 

If you have many predictors, then all possible subsets may not be feasible. The backward elimination 

procedure starts with the maximum model and successively “deletes” variables until no further variables can 

be deleted. 

The algorithm proceeds as follows: 

1. Fit the maximum model 

2. Decide which variable to delete. Look at each of the individual p-values <strong>for</strong> variables still in the 

model. If all of the p-values are less than some α (say .05 but this varies among packages), then stop. 

Else, find the variable with the largest (why?) p-value and drop this variable. 

3. Refit the model. Refit the model after dropping this variable, and repeat step 2 until no further variables 

can be deleted. 

29 It turns out that by cleverly computing various statistics, you can actually predict the results from many subsets without actually 

having to fit all the subsets. 



One must be careful to ensure that models are hierarchical, i.e. if a X 2 term remains in the model, then 

the corresponding X terms must also remain. Many computer packages will violate this restriction if left to 

their own devices. 

Forward addition 

This is the reverse of the backward elimination procedure. Start with a null model, and keep adding variables 

until no more can be added. The variable at each step with the smallest increment p-value is the variable that 

is added. 

Again, you must ensure that if X 2 terms are entered, that the corresponding X term is also entered. 

Stepwise selection 

It may turn out that adding a variable during a <strong>for</strong>ward process makes an existing variable redundant. The 

<strong>for</strong>ward addition process has no mechanism <strong>for</strong> deleting variables once they’ve entered the model. 

In a stepwise selection procedure, after a variable is entered, a backward elimination procedure is attempted 

to see if any variable can be removed. 

Closing words 

In all of these automated selection procedures, there is no guarantee that the chosen model will be “optimal” 

in any sense. As well, because of the many, many statistical tests per<strong>for</strong>med, none of the p-values at final 

step should be interpreted literally. It is also well known, that if data that is generated completely at random 

is used with stepwise methods, it will select a model <strong>for</strong> prediction that is often just noise. 

Consequently, the results that you obtain may be highly specific to the dataset collected and may not be 

reproducible with other datasets. Refer to Section 6.7.5 <strong>for</strong> ideas on evaluating the reliability of the analysis. 

6.7.5 Goodness-of-fit 

Even with automated variable selection methods, there is no guarantee that the fitted models actually fit the 

data well. Consequently, the usual residual diagnostics must be per<strong>for</strong>med as outlined in earlier sections. 

At the same time, the analyst should avoid becoming fixated with the results from a single dataset. There 

is no guarantee that the results from this particular dataset translate into other datasets. There are several 

ways to try and assess how well the chosen relationship will work in the future: 



• Try on a new dataset. In some cases, the study can be be repeated and a comparison of the model 

selected from the existing and new study is instructive. 

• Split-sample. If there are many observations, the sample can be split into two. Model selection is done 

on each half independently, and the two analyses compared. If a variable is selected in one half, but 

not the other, this is an indication of instability in the analysis. 

How well does the model do in predictions? Recall that R 2 measures the percentage of variation 

explained by the model. Use the first half of the data, fit a model, and find the R 2 <strong>for</strong> the first half. 

Use the model from the first sample to predict the data points <strong>for</strong> the second sample and compute the 

correlation 2 between the observed and predicted values. This second R 2 will typically be smaller 

than the R 2 based on the first sample. If the shrinkage in R 2 is large, this is is bad news – it implies 

that the results from the first sample did not do well in predicting the values in the second sample. 

• Cross validation. In some cases, you do not have sufficient data to split into two halves. In these cases, 

single case cross validation is often attempted. In this method, you fit a model excluding each case in 

turn, and then use the fitted model to fit the hold out case. A comparison of the fitted vs. actual values 

is a measure of predictive ability. 

6.7.6 Example: Calories of candy bars 

The JMP installation includes a dataset on the composition of popular candy bars. This is available under the 

Help → Sample Data Library → Food and Nutrition section or in the candybar.jmp file in the Sample Program 

Library in the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms 

directory. 

For each of about 50 brands of candy bars, the total calories and the composition (grams of fat, grams of 

fiber, etc.) were measured. Can the total calories be predicted from the various constituents? 

A preliminary scatter plot of the data: 





shows a strong relationship between calories and total grams of fat and/or grams of saturated fat, but a 

weaker relationship between calories and grams of protein and grams of carbohydrates. 

There are no obvious outliers except <strong>for</strong> a few candy bars which appear to have unusual levels of vitamins 

(?). 

The Analyze->Fit Model plat<strong>for</strong>m is used to request a stepwise regression analysis to try and predict the 

number of calories in the candy bars: 



In this case, the philosophy that the correct model must be a subset of these variables is likely correct. The 

mechanism by which calories “appear” in food is well understood - likely a combination of fat, protein, and 

carbohydrates. It is unlikely that fiber, or vitamins contribute anything substantial to the total calories. 

The stepwise dialogue box has a number (!) of options and statistics available: 



Detailed explanation of these features is available in the JMP help, but a summary is below: 

• The direction of the stepwise procedure can be changed from <strong>for</strong>ward, to backwards, or to mixed. If 

you wish to do backwards elimination, you will have to Enter All variables first be<strong>for</strong>e selecting this 

option. All possible regressions is available from the red-triangle pop-down menu. 

• The probability to enter and to leave are set fairly liberally. A probability to enter of 0.25 indicates 

that variables that have any chance of being useful are added; the probability to leave indicates that as 

long as some predictive marginal ability is available, the variable should be retained. 

• If the Go button is pressed, the procedure is completely automatic. If the Step button is pressed, the 

procedure goes step-by-step through the algorithms. The Make Model button is used at the end to fit 



the final selected model and obtain the usual diagnostic features. 

• The package reports the MSE, R 2 , the adjusted R 2 , C p , and AIC <strong>for</strong> each model. These can be used 

to assess the progress of the procedure. 

• The actual model under consideration are those variables with check marks inside the Entered boxes. 

If you wish to <strong>for</strong>ce a variable to be always present, this is possible by entering a variable and locking 

it in. 

Change the direction to Mixed and then repeatedly press the Step button. 

For the first step, the program computes the p-values <strong>for</strong> each new variable to enter the model. The 

variable with the smallest p-value that is below the Prob to Enter will be selected to enter, which is the Total 

Fat Variable 



The model now consists of the intercept and the total fat variable <strong>for</strong> a total of p = 2 predictors. The C p is 

extremely large; the R 2 has increased from the previous model; the MSE has decreased. 

None of the variables has p-values greater than the Prob to Leave so nothing happens <strong>for</strong> the “leaving 

step” and the step button must be pressed again. 

Based on the previous output, the carbohydrate variable will be entered (why?): 



and then the protein variable (why?): 



At this point we are now getting models with enormous R 2 values (close to 100%) which is practically 

unheard of in ecological contexts. Note that C p is becoming close to p. 

At this point which variable would be entered next? Surprisingly, sodium is entered next, followed by 

saturated fat and finally the procedure halts: 



Both backward elimination and <strong>for</strong>ward selection also pick this final model (try it). 

The Make Model button will take these selected variables and create the Analyze->Fit Model dialogue 

box to fit this final model: 



None of the leverage plots show anything a miss; the residual plots look good. The final estimates are: 

The VIF <strong>for</strong> total fat is a bit worrisome - notice that both total fat and saturated fat variables are in the model. 

Presumably, saturated fat is included in the total fat and is redundant. Try refitting this model dropping the 

saturated fat variable and reexamine the estimates: 



Again all the leverage plots look fine; and the VIF are all small. In our final model, each additional gram 

of total fat increases calories by 8.9 g 30 ; each additional gram of protein increases calories by 4.7 g; 31 

each additional gram of carbohydrates increases calories by 4.1 grams; 32 ; and each mg of sodium decreases 

calories by a miniscule amount. The biological relevance of the sodium contribution is unknown. Perhaps 

this is an artifact of this particular data set? 

This particular example was “easy” as the true model is known and the response is almost exactly predicted 

by the predictors. As noted earlier, most ecological contexts are not so nearly perfect. 

6.7.7 Example: Fitness dataset 

- this will be demonstrated in class 

6.7.8 Example: Predicting zoo plankton biomass 

What drives the biomass of zoo plankton on reefs? The zoo plankton was broken into two size classes (190- 

600 µm and >600 µm) and environmental variables were sampled at 51 irregularly spaced sites (sampling 

interval: 156-37 m), arranged along a straight-line cross-shelf transect 8.4 km in length. 

The raw data are available at http://www.esapubs.org/archive/ecol/E085/050/suppl-1. 

htm#anchorFilelist in the Guadeloupe.txt file and is in the guadeloupe.jmp file in the Sample Program 


The response variable is the log-trans<strong>for</strong>med zoo plankton biomasses of two size classes (original units: 

mg/m 3 ash-free dry mass) 33 . The predictor variables include 

• coordinate (km) of the sampling site along the transect. 

30 The accepted value <strong>for</strong> fat is 9 calories/gram. 

31 The accepted value <strong>for</strong> protein is 4 calories/gram. 

32 The accepted value <strong>for</strong> carbohydrates is 4 calories/gram 

33 Why was a log-trans<strong>for</strong>m used? 



• environmental variables such as dissolved oxygen (mg/L), salinity (psu), wind speed (m/s), phytoplankton 

biomass (log-trans<strong>for</strong>med, original units: µg/L), turbidity (NTU), swell height (m) 

• habitat variables codes as 14 indicator variables indicating various habitat classes. 

We will try and develop a prediction equation <strong>for</strong> the larger zoo plankton category. 

It is always good practice to do some preliminary plots of the data to search <strong>for</strong> outliers and general 

trends in the data be<strong>for</strong>e beginning a more sophisticated analysis. 

Start with a scatter plot-matrix of the continuous variables obtained from the Analyze->MultiVariateMethods- 

>Multivariate plat<strong>for</strong>m: 





There appears to be a strong bivariate relationship of biomass with distance along the transect line and 

phytoplankton biomass. At the same time, several of the predictors appear to be highly related. For example, 

the distance along the transect line and phytoplankton biomass are very strongly related as is wind speed and 

swell height. A quadratic relationship between some of the predictor variables is also apparent (e.g. wind 

speed vs. distance). A few unusual points appear, e.g. look at the plot of salinity vs. the log(zooplankton) 

where two points seem at odds with the rest of the data. By clicking on these points, we see that these 

correspond to site 5 (whose marker I subsequently changed to an X to see where it fit in the rest of the plot), 

and site 1 (whose marker I subsequent changed to a triangle <strong>for</strong> the remainder of the analysis). 



A common problem with indicator variables is insufficient contrast, i.e. there are only a few sampling 

sites with a particular habitat variable. You can see how many of each habitat variable are present by simply 

counting the number of 1’s in each indicator variable column or finding the “sum” of each column. 







These indicate that there is only 1 site with under 25% coverage of sea-grass on muddy sand, and most 

of the indicator variables occur on less than 10% of the sites. I will be hesitant to read too much into any 

regression equation that includes most of these indicator variables as I suspect they will specific to this 

particular dataset and not generalizable to other datasets. 

So, based on this preliminary analysis, I would expect that the distance and/or phytoplankton and/or 

turbidity would be the primary predictor <strong>for</strong> zooplankton biomass in this category. With only 51 data points, I 

would be reluctant to include more than about 5 predictor variables using the rule of thumb of 10 observations 

per predictor. 

The Analyze->Fit Model plat<strong>for</strong>m is used to request a stepwise regression analysis: 



A stepwise analysis is requested. 



The step history : 



shows that R 2 increases fairly rapidly until it hits around 80% and then tends to level off; the C p approaches 

p 34 also around step 9 or 10. 

The summary of the steps shows that the transect location is the first variable in, followed, surprisingly, 

by several indicator variables, followed by phytoplankton biomass. It is somewhat surprising that both the 

transect location and the phytoplankton biomass are both entered into the model as they are highly related. 

Rerun the stepwise procedure, a step at a time <strong>for</strong> the first 9 steps and then press the MakeModel button: 

34 Note that JMP uses the convention that the count p INCLUDES the intercept. 



to actually fit this model. The plot of actual vs. predicted : 



shows a reasonable fit. Some of the leverage plots <strong>for</strong> the indicator variables show that the fit is determined 

by a single or pair of sites: 

The VIF <strong>for</strong> the transect location and phytoplankton biomass variables: 



are large – a consequence of the strong relationship between these two variables. 

I would subsequently remove one of the the transect location or the phytoplankton biomass variables and 

would likely remove any indicator variable that is entered that depends on a single site as this is surely an 

artifact of this particular dataset. 

All possible subsets is barely feasible with this size of regression problem. It took less than three minutes 

to fit on my Macintosh G4 at home, but the output file was enormous! I suspect that unless some way is 

found to condense the output to something more user friendly, that this would not be a feasible way to 

proceed. 


Chapter 7 

Logistic Regression 


7.1.1 Difference between standard and logistic regression 

In regular multiple-regression problems, the Y variable is assumed to have a continuous distribution with 

the vertical deviations around the regression line being independently normally distributed with a mean of 0 

and a constant variance σ 2 . The X variables are either continuous or indicator variables. 

In some cases, the Y variable is a categorical variable, often with two distinct classes. The X variables 

can be either continuous or indicator variables. The object is now to predict the CATEGORY in which a 

particular observation will lie. 

For example: 

• The Y variable is over-winter survival of a deer (yes or no) as a function of the body mass, condition 

factor, and winter severity index. 

• The Y variable is fledging (yes or no) of birds as a function of distance from the edge of a field, food 

availability, and predation index. 

• The Y variable is breeding (yes or no) of birds as a function of nest density, predators, and temperature. 

Consequently, the linear regression model with normally distributed vertical deviations really doesn’t 

make much sense – the response variable is a category and does NOT follow a normal distribution. In these 

cases, a popular methodology that is used is logistic regression. 

There are a number of good books on the use of logistic regression: 

528

CHAPTER 7. LOGISTIC REGRESSION 

• Agresti, A. (2002). Categorical Data Analysis. Wiley: New York. 

• Hosmer, D.W. and Lemeshow, S. (2000). <strong>Applied</strong> Logistic Regression. Wiley: New York. 

These should be consulted <strong>for</strong> all the gory details on the use of logistic regression. 

7.1.2 The Binomial Distribution 

A common probability model <strong>for</strong> outcomes that come in only two states (e.g. alive or dead, success or failure, 

breeding or not breeding) is the Binomial distribution. The Binomial distribution counts the number of times 

that a particular event will occur in a sequence of observations. 1 The binomial distribution is used when a 

researcher is interested in the occurrence of an event, not in its magnitude. For instance, in a clinical trial, 

a patient may survive or die. The researcher studies the number of survivors, and not how long the patient 

survives after treatment. In a study of bird nests, the number in the clutch that hatch is measured, not the 

length of time to hatch. 

In general the binomial distribution counts the number of events in a set of trials, e.g. the number of 

deaths in a cohort of patients, the number of broken eggs in a box of eggs, or the number of eggs that hatch 

from a clutch. Other situations in which binomial distributions arise are quality control, public opinion 

surveys, medical research, and insurance problems. 

It is important to examine the assumptions being made be<strong>for</strong>e a Binomial distribution is used. The 

conditions <strong>for</strong> a Binomial Distribution are: 

• n identical trials (n could be 1); 

• all trials are independent of each other; 

• each trial has only one outcome, success or failure; 

• the probability of success is constant <strong>for</strong> the set of n trials. Some books use p to represent the probability 

of success; other books use π to represent the probability of success; 2 

• the response variable Y is the the number of successes 3 in the set of n trials. 

However, not all experiments, that on the surface look like binomial experiments, satisfy all the assumptions 

required. Typically failure of assumptions include non-independence (e.g. the first bird that hatches 

destroys remaining eggs in the nest), or changing p within a set of trials (e.g. measuring genetic abnormalities 

<strong>for</strong> a particular mother as a function of her age; <strong>for</strong> many species, older mothers have a higher probability 

of genetic defects in their offspring as they age). 

1 The Poisson distribution is a close cousin of the Binomial distribution and is discussed in other chapters. 

2 Following the convention that Greek letters refer to the population parameters just like µ refers to the population mean. 

3 There is great flexibility in defining what is a success. For example, you could count either the number of successful eggs that hatch 

or the number of eggs that failed to hatch in a clutch. You will get the same answers from the analysis after making the appropriate 

substitutions. 



The probability of observing Y successes in n trials if each success has a probability p of occurring can 

be computed using: 

( ) 

n 

p(Y = y|n, p) = p y (1 − p) n−y 

y 

where the binomial coefficient is computed as 

( 

n 

y 

and where n! = n(n − 1)(n − 2) . . . (2)(1). 

) 

= 

n! 

y!(n − y)! 

For example, the probability of observing Y = 3 eggs hatch from a nest with n = 5 eggs in the clutch if 

the probability of success p = .2 is 

( ) 

5 

p(Y = 3|n = 5, p = .2) = (.2) 3 (1 − .2) 5−3 = .0512 

3 

Fortunately, we will have little need <strong>for</strong> these probability computations. There are many tables that tabulate 

the probabilities <strong>for</strong> various combinations of n and p – check the web. 

There are two important properties of a binomial distribution that will serve us in the future. If Y is 

Binomial(n, p), then: 

• E[Y ] = np 

• V [Y ] = np(1 − p) and standard deviation of Y is √ np(1 − p) 

For example, if n = 20 and p = .4, then the average number of successes in these 20 trials is E[Y ] = np = 

20(.4) = 8. 

If an experiment is observed, and a certain number of successes is observed, then the estimator <strong>for</strong> the 

success probability is found as: 

̂p = Y n 

For example, if a clutch of 5 eggs is observed (the set of trials) and 3 successfully hatch, then the estimated 

proportion of eggs that hatch is ̂p = 3 5 

= .60. This is exactly analogous to the case where a sample is drawn 

from a population and the sample average Y is used to estimate the population mean µ. 

7.1.3 Odds, risk, odds-ratio, and probability 

The odds of an event and the odds ratio of events are very common terms in logistic contexts. Consequently, 

it is important to understand exactly what these say and don’t say. 



The odds of an event are defined as: 

Odds(event) = 

P (event) 

P (not event) = P (event) 

1 − P (event) 

The notation used is often a colon separating the odds values. Some sample values are tabulated below: 

Probability 

Odds 

.01 1:99 

.1 1:9 

.5 1:1 

.6 6:4 or 3:2 or 1.5 

.9 9:1 

.99 99:1 

For very small or very large odds, the probability of the event is approximately equal to the odds. For 

example if the odds are 1:99, then the probability of the event is 1/100 which is roughly equal to 1/99. 

The odds ratio (OR) is by definition, the ratio of two odds: 

OR A vs. B = odds(A) 

odds(B) = 

P (A) 

1−P (A) 

P (B) 

1−P (B) 

For example, of the probability of an egg hatching under condition A is 1/10 and the probability of an egg 

hatching under condition B is 1/20, then the odds ratio is OR = (1 : 9)/(1 : 19) = 2.1 : 1. Again <strong>for</strong> very 

small or very larger odds, the odds ratio is approximately equal to the ratio of the probabilities. 

An odds ratio of 1, would indicate that the probability of the two events is equal. 

In many studies, you will hear reports that the odds of an event have doubled. This give NO in<strong>for</strong>mation 

about the base rate. For example, did the odds increase from 1:million to 2:million or from 1:10 to 2:10. 

It turns out that it is convenient to model probabilities on the log-odds scale. The log-odds (LO), also 

known as the logit, is defined as: 

( ) P (A) 

logit(A) = log e (odds(A)) = log e 

1 − P (A) 

We can extend the previous table, to compute the log-odds: 



Probability Odds Logit 

.01 1:99 −4.59 

.1 1:9 −2.20 

.5 1:1 0 

.6 6:4 or 3:2 or 1.5 .41 

.9 9:1 2.20 

.99 99:1 4.59 

Notice that the log-odds is zero when the probability is .5 and that the log-odds of .01 is symmetric with 

the log-odds of .99. 

It is also easy to go back from the log-odds scale to the regular probability scale in two equivalent ways: 

p = 

elog-odds 

1 + e log-odds = 1 

1 + e −log-odds 

Notice the minus sign in the second back-translation. For example, a LO = 10, translates to p = .9999; a 

LO = 4 translates to p = .98; a LO = 1 translates to p = .73; etc. 

7.1.4 Modeling the probability of success 

Now if the probability of success was the same <strong>for</strong> all sets of trials, the analysis would be trivial: simply 

tabulate the total number of successes and divide by the total number of trials to estimate the probability of 

success. However, what we are really interested in is the relationship of the probability of success to some 

covariate X such as temperature, or condition factor. 

For example, consider the following (hypothetical) example of an experiment where various clutches of 

bird eggs were found, and the number of eggs that hatched and fledged were measured along with the height 

the nest was above the ground: 



Height Clutch Size Fledged ̂p 

2.0 4 0 0.00 

3.0 3 0 0.00 

2.5 5 0 0.00 

3.3 3 2 0.67 

4.7 4 1 0.25 

3.9 2 0 0.00 

5.2 4 2 0.50 

10.5 5 5 1.00 

4.7 4 2 0.50 

6.8 5 3 0.60 

7.3 3 3 1.00 

8.4 4 3 0.75 

9.2 3 2 0.67 

8.5 4 4 1.00 

10.0 3 3 1.00 

12.0 6 6 1.00 

15.0 4 4 1.00 

12.2 3 3 1.00 

13.0 5 5 1.00 

12.9 4 4 1.00 

Notice that the probability of a fledging seems to increase with height above the grounds (potentially 

reflecting distance from predators?). 

We would like to model the probability of success as a function of height. As a first attempt, suppose 

that we plot the estimated probability of success (̂p) as a function of height and try and fit a straight line to 

the plotted points. 

The Analyze->Fit Y-by-X plat<strong>for</strong>m was used, and ̂p was treated as the Y variable and Height as the X 

variable: 



This procedure is not entirely satisfactory <strong>for</strong> a number of reasons: 

• The data points seem to follow an S-shaped relationship with probabilities of success near 0 at lower 

heights and near 1 at higher heights. 

• The fitted line gives predictions <strong>for</strong> the probability of success that are more than 1 and less than 0 

which is impossible. 

• The fitted line cannot deal properly with the fact that the probability of success is likely close to 0% 

<strong>for</strong> a wide range of small heights and essentially close to 100% <strong>for</strong> a wide range of taller heights. 

• The assumption of a normal distribution <strong>for</strong> the deviations from the fitted line is not tenable as the ̂p 

are essentially discrete <strong>for</strong> the small clutch sizes found in this experiment. 

• While not apparent from this graph, the variability of the response changes over the different parts 

of the regression line. For example, when the true probability of success is very low (say 0.1), the 



standard deviation in the number fledged <strong>for</strong> a clutch with 5 eggs is found as √ 5(.1)(.9) = .67 while 

the standard deviation of the number of fledges in a clutch with 5 eggs and the probability of success 

of 0.5 is √ 5(.5)(.5) = 1.1 which is almost twice as large as the previous standard deviation. 

For these (and other reasons), the analysis of this type of data are commonly done on the log-odds (also 

called the logit) scale. The odds of an event is computed as: 

ODDS = 

p 

1 − p 

and the log-odds is found as the (natural) logarithm of the odds: 

( ) p 

LO = log 

1 − p 

This trans<strong>for</strong>mation converts the 0-1 scale of probability to a −∞ → ∞ scale as illustrated below: 

p 

LO 

0.001 -6.91 

0.01 -4.60 

0.05 -2.94 

0.1 -2.20 

0.2 -1.39 

0.3 -0.85 

0.4 -0.41 

0.5 0.00 

0.6 0.41 

0.7 0.85 

0.8 1.39 

0.9 2.20 

0.95 2.94 

0.99 4.60 

0.999 6.91 

Notice that the log-odds scale is symmetrical about 0, and that <strong>for</strong> moderate values of p, changes on the 

p-scale have nearly constant changes on the log-odds scale. For example, going from .5 → .6 → .7 on the 

p-scale corresponds to moving from 0 → .41 → .85 on the log-odds scale. 

It is also easy to go back from the log-odds scale to the regular probability scale: 

p = 

eLO 

1 + e LO = 1 

1 + e −LO 



For example, a LO = 10, translates to p = .9999; a LO = 4 translates to p = .98; a LO = 1 translates to 

p = .73; etc. 

We can now return back to the previous data. At first glance, it would seem that the estimated log-odds 

is simply estimated as: 

( ) ̂p 

̂LO = log 

1 − ̂p 

but this doesn’t work well with small sample sizes (it can be shown that the simple logit function is biased) 

or when values of ̂p close to 0 or 1 (the simple logit function hits ±∞). Consequently, in small samples or 

when the observed probability of success is close to 0 or 1, the empirical log-odds is often computed as: 

̂LO empirical = log 

( n̂p + .5 

n(1 − ̂p) + .5 

) 

= log 

( ̂p + .5/n 

1 − ̂p + .5/n 

) 

We compute the empirical log-odds <strong>for</strong> the hatching data: 

Height Clutch Fledged ̂p ̂LOemp 

2.0 4 0 0.00 -2.20 

3.0 3 0 0.00 -1.95 

2.5 5 0 0.00 -2.40 

3.3 3 2 0.67 0.51 

4.7 4 1 0.25 -0.85 

3.9 2 0 0.00 -1.61 

5.2 4 2 0.50 0.00 

10.5 5 5 1.00 2.40 

4.7 4 2 0.50 0.00 

6.8 5 3 0.60 0.34 

7.3 3 3 1.00 1.95 

8.4 4 3 0.75 0.85 

9.2 3 2 0.67 0.51 

8.5 4 4 1.00 2.20 

10.0 3 3 1.00 1.95 

12.0 6 6 1.00 2.56 

15.0 4 4 1.00 2.20 

12.2 3 3 1.00 1.95 

13.0 5 5 1.00 2.40 

12.9 4 4 1.00 2.20 

and now plot the empirical log-odds against height: 



The fit is much nicer, the relationship has been linearized, and now, no matter what the prediction, it can 

always be translated back to a probability between 0 and 1 using the inverse trans<strong>for</strong>m seen earlier. 

7.1.5 Logistic regression 

But this is still not enough. Even on the log-odds scale the data points are not normally distributed around 

the regression line. Consequently, rather than using ordinary least-squares to fit the line, a technique called 

generalized linear modeling is used to fit the line. 

In generalized linear models a method called maximum likelihood is used to find the parameters of the 

model (in this case, the intercept and the regression coefficient of height) that gives the best fit to the data. 

While details of maximum likelihood estimation are beyond the scope of this <strong>course</strong>, they are closely related 

to weighted least squares in this class of problems. Maximum Likelihood Estimators (often abbreviated as 

MLEs) are, under fairly general conditions, guaranteed to be the “best” (in the sense of having smallest 

standard errors) in large samples. In small samples, there is no guarantee that MLEs are optimal, but in 

practice, MLEs seem to work well. In most cases, the calculations must be done numerically – there are no 



simple <strong>for</strong>mulae as in simple linear regression. 4 

In order to fit a logistic regression using maximum likelihood estimation, the data must be in a standard 

<strong>for</strong>mat. In particular, both success and failures must be recorded along with a classification variable that 

is nominally scaled. For example, the first clutch (at 2.0 m) will generate two lines of data – one <strong>for</strong> the 

successful fledges and one <strong>for</strong> the unsuccessful fledges. If the count <strong>for</strong> a particular outcome is zero, it can 

be omitted from the data table, but I prefer to record a value of 0 so that there is no doubt that all eggs were 

examined and none of this outcome were observed. 

A new column was created in JMP <strong>for</strong> the number of eggs that failed to fledge, and after stacking the 

revised dataset, the dataset in JMP that can be used <strong>for</strong> logistic regression looks like: 5 

4 Other methods that are qute popular are non-iterative weighted least squares and discriminant function analysis. These are beyond 

the scope of this <strong>course</strong>. 

5 This stacked data is available in the eggsfledge2.jmp dataset available from the Sample Program Library at http://www.stat. 

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 





The Analyze->Fit Y-by-X plat<strong>for</strong>m is used to launch simple logistic regression: 

Note that the Outcome is the actual Y variable (and is nominally scaled) while the Count column simply 

indicates how many of this outcome were observed. The X variable is Height as be<strong>for</strong>e. JMP knows this is 

a logistic regression by the combination of a nominally or ordinally scaled variable <strong>for</strong> the Y variable, and a 

continuously scaled variable <strong>for</strong> the X variable as seen by the reminder at the left of the plat<strong>for</strong>m dialogue 

box. 

This gives the output: 



The first point to note is that most computer packages make arbitrary decisions on what is a “success” and 

what is a “failure” when fitting the logistic regression. It is important to always look at the output carefully 

to see what has been defined as a success. In this case, at the bottom of the output, JMP has indicated that 

fledged is considered as a “success” and not fledged as a “failure”. If it had reversed the roles of these two 



categories, everything would be “identical” except reversed appropriately. 

Second, rather bizarrely, the actual data points plotted by JMP really don’t any meaning! According 

the JMP help screens: 

Markers <strong>for</strong> the data are drawn at their x-coordinate, with the y position jittered randomly within 

the range corresponding to the response category <strong>for</strong> that row. 

So if you do the analysis on the exact same data, the data points are jittered and will look different even 

though the fit is the same. The explanation in the JMP support pages on the web state: 6 

The exact vertical placement of points in the logistic regression plots (<strong>for</strong> instance, on pages 

308 and 309 of the JMP User’s Guide, Version 2, and pages 114 and 115 of the JMP Statistics 

and Graphics Guide, Version 3) has no particular interpretation. The points are placed midway 

between curves so as to assure their visibility. However, the location of a point between a 

particular set of curves is important. All points between a particular set of curves have the same 

observed value <strong>for</strong> the dependent variable. Of <strong>course</strong>, the horizontal placement of each point is 

meaningful with respect to the horizontal axis. 

This is rather un<strong>for</strong>tunate, to say the least! This means that the user must create nice plot by hand. This plot 

should plot the estimated proportions as a function of height with the fitted curve then overdrawn. 

Fortunately, the fitted curves are correct (whew). The curves presented doesn’t look linear only because 

JMP has trans<strong>for</strong>med back from the log-odds scale to the regular probability scale. A linear curve on the 

log-odds scale has a characteristic “S” shape on the regular probability scale with the ends of the curve 

flattening out a 0 and 1. Using the Cross Hairs tool, you can see that a height of 5 m gives a predicted 

probability of success (fledged) of .57; by 7 m the estimated probability of success is almost 100%. 

The table of parameter estimates gives the estimated fit on the log-odds scale: 

̂LO = −4.03 + .72(Height) 

Substituting in the value <strong>for</strong> Height = 5, gives an estimated log-odds of −.43 which on the regular probability 

scale corresponds to .394 as seen be<strong>for</strong>e from using the cross hairs. 

The coefficient associated with height is interpreted as the increase in log-odds of fledging when height 

is increased by 1 m. 

As in simple regression, the precision of the estimates is given by the standard error. An approximate 

95% confidence interval <strong>for</strong> the coefficient associated with height is found in the usual fashion, i.e. 

estimate ± 2se. 7 This confidence interval does NOT include 0; there<strong>for</strong>e there is good evidence that the 

probability of fledging is not constant over the various heights. 

6 http://www.jmp.com/support/techsup/notes/001897.html 

7 It is not possible to display the 95% confidence intervals in the Analyze->Fit Y-by-X plat<strong>for</strong>m output by right clicking in the table 

(don’t ask me why not). However, if the Analyze->Fit Model plat<strong>for</strong>m is used to fit the model, then right-clicking in the Estimates table 

does make the 95% confidence intervals available. 



Similarly, the p-value is interpreted in the same way – how consistent is the data with the hypothesis 

of NO effect of height upon the survival rate. Rather than the t-test seen in linear regression, maximum 

likelihood methods often constructs the test statistics in a different fashion (called χ 2 likelihood ratio tests). 

The test statistic is not particularly of interest – only the final p-value matters. In this case, it is well below 

α = .05, so there is good evidence that the probability of success is not constant across heights. As in all 

cases, statistical significance is no guarantee of biological relevance. 

In theory, it is possible to obtain prediction intervals and confidence intervals <strong>for</strong> the MEAN probability 

of success at new values of X – JMP does not provide these in the Analyze->Fit Y-by-X plat<strong>for</strong>m with 

logistic regression. It does do Inverse Predictions and can give confidence bounds on the inverse prediction 

which require the confidence bounds to be computed, so it is a mystery to me why the confidence intervals 

<strong>for</strong> the mean probability of success at future X values are not provided. 

The Analyze->Fit Model plat<strong>for</strong>m can also be used to fit a logistic regression in the same way: 

Be sure to specify the Y variable as a nominally or ordinally scaled variable; the count as the frequency 

variable; and the X variables in the usual fashion. The Analyze->Fit Model plat<strong>for</strong>m automatically switches 

to indicate logistic regression will be run. 

The same in<strong>for</strong>mation as previously seen is shown again. But, you can now obtain 95% confidence 



intervals <strong>for</strong> the parameter estimates and there are additional options under the red-triangle pop-down menu. 

These features will be explored in more detail in further examples. 

Lastly, the Analyze->Fit Model plat<strong>for</strong>m using the Generalized Linear Model option in the personality 

box in the upper right corner, also can be used to fit this model. Specify a binomial distribution with the 

logit link. You get similar results with more goodies under the red-triangles such as confidence intervals <strong>for</strong> 

the MEAN probability of success that can be saved to the data table, residual plots, and more. Again, these 

will be explored in more details in the examples. 

7.2 Data Structures 

There are two common ways in which data can be entered <strong>for</strong> logistic regression, either as individual observations 

or as grouped counts. 

If individual data points are entered, each line of the data file corresponds to a single individual. The 

columns will corresponds to the predictors (X) that can be continuous (interval or ratio scales) or classification 

variables (nominal or ordinal). The response (Y ) must be a classification variable with any two possible 

outcomes 8 . Most packages will arbitrarily choose one of these classes to be the success – often this is the 

first category when sorted alphabetically. I would recommend that you do NOT code the response variable 

as 0/1 – it is far to easy to <strong>for</strong>get that the 0/1 correspond to nominally or ordinally scaled variables and not 

to continuous variables. 

As an example, suppose you wish to predict if an egg will hatch given the height in a tree. The data 

structure <strong>for</strong> individuals would look something like: 

Egg Height Outcome 

. . . 

1 10 hatch 

2 15 not hatch 

3 5 hatch 

4 10 hatch 

5 10 not hatch 

Notice that even though three eggs were all at 10 m height, separate data lines <strong>for</strong> each of the three eggs 

appear in the data file. 

In grouped counts, each line in the data file corresponds to a group of events with the same predictor 

(X) variables. Often researchers record the number of events and the number of successes in two separate 

columns, or the number of success and the number of failures in two separate columns. This data must be 

converted to two rows per group - one <strong>for</strong> the success and one <strong>for</strong> the failures with one variable representing 

8 In more advanced classes this restriction can be relaxed. 



the outcome and a second variable representing the frequency of this event. The outcome will be the Y 

variable, while the count will be the frequency variable. 9 

For example, the above data could be originally entered as: 

Height Hatch Not Hatch 

10 2 1 

15 0 1 

5 1 0 

. . . 

but must be translated (e.g. using the Tables → Stack command) to: 

Height Outcome Count 

10 Hatch 2 

10 Not Hatch 1 

15 Hatch 0 

15 Not Hatch 1 

. . . 

5 Hatch 1 

5 Not Hatch 0 

While it is not required that counts of zero have data lines present, it is good statistical practice to remind 

yourself that you did look <strong>for</strong> failures, but failed to find any of them. 

7.3 Assumptions made in logistic regression 

Many of the assumptions made <strong>for</strong> logistic regression parallel those made <strong>for</strong> ordinary regression with obvious 

modifications. 

1. Check sampling design. In these <strong>course</strong> notes it is implicitly assumed that the data are collected either 

as simple random sample or under a completely randomized design experiment. This implies that the 

units selected must be a random sample (with equal probability) from the relevant populations or 

complete randomization during the assignment of treatments to experimental units. The experimental 

unit must equal the observational unit (no pseudo-replication), and there must be no pairing, blocking, 

or stratification. 

It is possible to generalize logistic regression to cases where pairing, blocking, or stratification took 

place (<strong>for</strong> example, in case-control studies), but these are not covered during this <strong>course</strong>. 

9 Refer to the section on Poisson regression <strong>for</strong> an alternate way to analyze this type of data where the count is the response variable. 



Common ways in which assumption are violated include: 

• Collecting data under a cluster design. For example, class rooms are selected at random from 

a school district and individuals within a class room are then measured. Or herds or schools of 

animals are selected and all individuals within the herd or school are measured. 

• Quota samples are used to select individuals with certain classifications. For example, exactly 

100 males and 100 females are sampled and you are trying to predict sex as the outcome measure. 

2. No outliers. This is usually pretty easy to check. A logistic regression only allows two categories 

within the response variables. If there are more than two categories of responses, this may represent a 

typographical error and should be corrected. Or, categories should be combined into larger categories. 

It is possible to generalize logistic regression to the case of more than two possible outcomes. Please 

contact a statistician <strong>for</strong> assistance. 

3. Missing values are MCAR. The usual assumption as listed in earlier chapters. 

4. Binomial distribution. This is a crucial assumption. A binomial distribution is appropriate when 

there is a fixed number of trials at a given set of covariates (could be 1 trial); there is constant probability 

of “success” within that set of trials; each trial is independent; and the number of trials in the n 

successes is measured. 

Common ways in which this assumption is violated are: 

• Items within a set of trials do not operate independently of each other. For example, subjects 

could be litter mates, twins, or share environmental variables. This can lead to over- or underdispersion. 

• The probability of success within the set of trials is not constant. For example, suppose a set of 

trials is defined by weight class. Not everyone in the weight class is exactly the same weight and 

so their probability of “success” could vary. Animals all don’t have exactly the same survival 

rates. 

• The number of trials is not fixed. For example, sampling could occur until a certain number of 

success occur. In this case, a negative binomial distribution would be more appropriate. 

5. Independence among subjects. See above. 

7.4 Example: Space Shuttle - Single continuous predictor 

In January 1986, the space shuttle Challenger was destroyed on launch. Subsequent investigations showed 

that an O-ring, a piece of rubber used to seal two segments of the booster rocket, failed, allowing highly 

flammable fuel to leak, light, and destroy the ship. 10 

As part of the investigation, the following chart of previous launches and the temperature at which the 

shuttle was launched was presented: 

10 Refer to http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster. 



The raw data is available in the JMP file spaceshuttleoring.jmp available from the Sample Program Library 


Notice that the raw data has a single line <strong>for</strong> each previous launch even though there are multiple launches 

at some temperatures. The X variable is temperature and the Y variable is the outcome – either f <strong>for</strong> failure 

of the O-ring, or OK <strong>for</strong> a launch where the O-ring did not fail. 



With the data in a single observation mode, it is impossible to make a simple plot of the empirical logistic 

function. If some of the temperatures were pooled, you might be able to do a simple plot. 

The Analyze->Fit Y-by-X plat<strong>for</strong>m was used and gave the following results: 

First notice that JMP treats a failure f as a “success”, and will model the probability of failure as a function 



of temperature. This is why it is important that you examine computer output carefully to see exactly what 

a package is doing. 

The graph showing the fitted logistic curve must be interpreted carefully. While the plotted curve is 

correct, the actual data points are randomly placed - groan – see the notes in the previous section. 

The estimated model is: 

̂ logit(failure) = 10.875 − .17(temperature) 

So, the log-odds of failure decrease by .17 (se .083) units <strong>for</strong> every degree ( ◦ F) increase in launch temperature. 

Conversely, the log-odds of failure increase by .17 by every degree ( ◦ F) decrease in temperature. 

The p-value <strong>for</strong> no effect of temperature is just below α = .05. 

Using the same reasoning as was done <strong>for</strong> ordinary regression, the odds of failure increase by a factor of 

e .17 = 1.18, i.e. almost a 18% increase per degree drop. 

To predict the failure rate at a given temperature, a two stage-process is required. First, estimate the 

log-odds by substituting in the X values of interest. Second, convert the estimated log-odds to a probability 

using p(x) = 

eLO(x) 

1 

= . 

1+e LO(x) 1+e −LO(x) 

The actual launch was at 32 ◦ F. While it is extremely dangerous to try and predict outside the range 

of observed data, the estimated log-odds of failure of the O-ring are 10.875 − .17(32) = 5.43 and then 

p(failure) = e5.43 

1+e 5.43 = .99+, i.e. well over 99%! 

It is possible to find confidence bounds <strong>for</strong> these predictions – the easiest way is to create some “dummy” 

rows in the data table corresponding to the future predictions with the response variable left blank. Use 

JMP’s Exclude Rows feature to exclude these rows from the model fit. Then use the red-triangle to same 

predictions and confidence bounds back to the data table. 

The Analyze->Fit Model plat<strong>for</strong>m gives the same results with additional analysis options that we will 

examine in future examples. 

The Analyze->Fit Model plat<strong>for</strong>m using the Generalized Linear Model option also gives the same results 

with additional analysis options. For example, it is possible to compute confidence intervals <strong>for</strong> the predicted 

probability of success at the new X. Use the pop-down menu beside the red-triangle: 



The predicted values and 95% confidence intervals <strong>for</strong> the predicted probability are stored in the data table: 



These are found by finding the predicted log-odds and a 95% confidence interval <strong>for</strong> the predicted logodds 

and then inverting the confidence interval endpoints in the same way as the predicted probabilities are 

obtained from the predicted log-odds. 

While the predicted value and the 95% confidence interval are available, <strong>for</strong> some odd reason the se of 

the predicted probability is not presented – this is odd as it is easily computed. The confidence intervals are 

quite wide given that there were only 24 data values and only a few failures. 

It should be noted that only predictions of the probability of success and confidence intervals <strong>for</strong> the 



probability of success are computed. These intervals would apply to all future subjects that have the particular 

value of the covariates. Unlike the case of linear regression, it really doesn’t make sense to predict 

individual outcomes as these are categories. It is sensible to look at which category is most probable and 

then use this as a “guess” <strong>for</strong> the individual response, but that is about it. This area of predicting categories 

<strong>for</strong> individuals is called discriminant analysis and has a long history in statistics. There are many excellent 

books on this topic. 

7.5 Example: Predicting Sex from physical measurements - Multiple 

continuous predictors 

The extension to multiple continuous X variables is immediate. As be<strong>for</strong>e there are now several predictors. 

It is usually highly unlikely to have multiple observations with exactly the same set of X values, so the data 

sets usually consist of individual observations. 

Let us proceed by example using the Fitness data set available in the JMP sample data library. This 

dataset has variables on age, weight, and measurements of per<strong>for</strong>mance while per<strong>for</strong>ming a fitness assessment. 

In this case we will try and predict the sex of the subject given the various attributes. 

As usual, be<strong>for</strong>e doing any computations, examine the data <strong>for</strong> unusual points. Look at pairwise plots, 

the pattern of missing values, etc. 

It is important that the data be collected under a completely randomized design or simple random sample. 

If your data are collected under a different design, e.g. a cluster design, please see suitable assistance. 

Use the Analyze->Fit Model plat<strong>for</strong>m to fit a logistic regression trying to predict sex from the age, 

weight, oxygen consumption and run time: 



This gives the summary output: 



First determine which category is being predicted. In this case, the sex = f category will be predicted. 

The Whole Model Test examines if there is evidence of any predictive ability in the 4 predictor variable. 

The p-value is very small indicating that there is predictive ability. 

Because we have NO categorical predictors, the Effect Tests can be ignored <strong>for</strong> now. The Parameter 

Estimates look <strong>for</strong> the marginal contribution of each predictor to predicting the probability of being a Female. 

Just like in regular regression, these are MARGINAL contributions, i.e. how much would the log-odds <strong>for</strong> 

the probability of being female change if this variable changed by one unit and all other variables remained 

in the model and did not change. In this case, there is good evidence that weight is a good predictor (not 

surprisingly), but also some evidence that oxygen consumption may be useful. 11 If you look at the dot plots 

<strong>for</strong> the weight <strong>for</strong> the two sexes and <strong>for</strong> the oxygen consumption <strong>for</strong> the two sexes, the two groups seem to 

be separated on these variables: 

11 The output above actually appears to be a bit contradictory. The chi-square value <strong>for</strong> the effect of weight is 17 with a p-value 

< .0001. Yet the 95% confidence interval <strong>for</strong> the coefficient associated with weight ranges from (−1.57 → .105) which INCLUDES 

zero, and so whould not be statistically significant! It turns out that JMP has mixed two (asymptotically) equivalent methods in this 

one output. The chi-square value and p-value are computed using a likelihood ratio test (a model with and without this variable is fit 

and the difference in fit is measured), while the confidence intervals are computed using a Wald approximation (estimate ± 2(se)). 

In small samples, the sampling distribution <strong>for</strong> an estimate may not be very symmetric or close to normally shaped and so the Wald 

intervals may not per<strong>for</strong>m well. 



The estimated coefficient <strong>for</strong> weight is −.73. This indicates that the log-odds of being female decrease 

by .73 <strong>for</strong> every additional unit of weight, all other variables held fixed. This often appears in scientific 

reports as the adjusted effect of weight – the adjusted term implies that it is the marginal contribution. 

Confidence intervals <strong>for</strong> the individual coefficient (<strong>for</strong> predicting the log-odds of being female) are interpreted 

in the same way. 

Just like in regular regression, collinearity can be a problem in the X values. There is no easy test <strong>for</strong> 

collinearity in logistic regression in JMP, but similar diagnostics as in in ordinary regression are becoming 

available. 

Be<strong>for</strong>e dropping more than one variable, it is possible to test if two or more variables can be dropped. 

Use the Custom Test options from the drop-down menu: 



Complete the boxes in a similar way as in ordinary linear regression. For example, to test if both age and 

runtime can be dropped: 

which gives: 



It appears safe to drop both variables. 

Just as in regular regression, you can fit quadratic and product terms to try and capture some non-linearity 

in the log-odds. This affects the interpretation of the estimated coefficients in the same way as in ordinary 

regression. The simpler model involving weight and oxygen consumption, their quadratic terms and cross 

product term was fit using the Analyze->Fit Model plat<strong>for</strong>m: 



Surprisingly, the model has problems: 



Ironically, it is because the model is too good of a fit. It appears that you can discriminate perfectly between 

men and women by fitting this model. Why does a perfect fit cause problems. The reason is that if the 

p(sex = f) = 1, the log-odds is then +∞ and it is hard to get a predicted value of ∞ from an equation 

without some terms also being infinite. 

If you plot the weight against oxygen consumption using different symbols <strong>for</strong> males and females, you 

can see the near complete separation based on simply looking at oxygen consumption and weight without 

the need <strong>for</strong> quadratic and cross products: 



I’ll continue by fitting just a model with linear effects of weight and oxygen consumption as an illustration. 

Use the Analyze->Fit Model plat<strong>for</strong>m to fit this model with just the two covariates: 



Both covariates are now statistically significant and cannot be dropped. 

The Goodness-of-fit statistic is computed in two ways (which are asymptotically equivalent), but both 

are tedious to compute by hand. The Deviance of a model is a measure of how well a model per<strong>for</strong>ms. As 

there are 31 data points, you could get a perfect fit by fitting a model with 31 parameters – this is exactly 

what happens if you try and fit a line through 2 points where 2 parameters (the slope and intercept) will fit 

exactly two data points. A measure of goodness of fit is then found <strong>for</strong> the model in question based on the 

fitted parameters of this model. In both cases, the measure of fit is called the deviance which is simply twice 

the negative of the log-likelihood which in turn is related to the probability of observing this data given the 

parameter values. The difference in deviances is the deviance goodness-of-fit statistic. If the current model 

is a good model, the difference in deviance should be small (this is the column labeled chi-square). There is 

no simple calibration of deviances 12 , so a p-value must be found which say how large is this difference. The 

p-value of .96 which indicates that the difference is actually quite small, almost 96% of the time you would 

get a larger difference in deviances. 

Similarly, the row labeled the Pearson goodness-of-fit is based on the same idea. A perfect fit is obtained 

with a model of 31 parameters. A comparison of the observed and predicted values is found <strong>for</strong> the model 

with 3 parameters. How big is the difference in fit? How unusual is it? 

NOTE that <strong>for</strong> goodness-of-fit tests, you DO NOT WANT TO REJECT the null hypothesis. Hence 

p-values <strong>for</strong> a goodness-of-fit test that are small (e.g. less than α = .05) are NOT good! 

12 The df=31-3 



So <strong>for</strong> this model, there is no reason to be upset with the fit. 

The residual plots look strange, but this is an artifact of the data: 

Along the bottom axis is the predicted probability of being female. Now consider a male subject. If 

the predicted probability of being female is small (e.g. close to 0 because the subject is quite heavy), 

then there is an almost perfect agreement of the observed response with the predicted probability. If you 

compute a residual by defining a male=0 and female=1, then the residual here would be computed as 

(obs − predicted)/se(predicted) = (0 − 0)/blah = 0. This corresponds to points near the (0,0) area 

of the plots. 

What about males whose predicted probability of being female is almost .7 (which corresponds to observation 

15). This is a poor prediction. and the residual is computed as (0 − .7)/se(predicted) which is 

approximately equal to (0 − .7)/ √ .7(.3) ≈ −1.52 with some further adjustment to compute the se of the 

predicted value. This corresponds to the point near (.7, -1.5). 

On the other hand, a female with a predicted probability of being female will have a residual equal to 

approximately (1 − .7)/ √ .7(.3) = .65. 

Hence the two lines on the graph correspond to males and female respectively. What you want to see is 



this two parallel line system, particularly with few males near the probability of being female close to 1, and 

few females with probability of being female close to 0. 

There are four possible residual plots available in JMP – they are all based on a similar procedure with 

minor adjustments in the way they compute a standard error. Usually, all four plots are virtually the same – 

anomalies among the plots should be investigated carefully. 

7.6 Examples: Lung Cancer vs. Smoking; Marijuana use of students 

based on parental usage - Single categorical predictor 

7.6.1 Retrospect and Prospective odds-ratio 

In this section, the case where the predictor (X) variable is also a categorical variable will be examined. As 

seen in multiple linear regression, categorical X variables are handled by the creation of indicator variables. 

A categorical variable with k classes will generate k − 1 indicator variables. As be<strong>for</strong>e, there are many ways 

to define these indicator variables and the user must examine the computer software carefully be<strong>for</strong>e using 

any of the raw estimated coefficients associated with a particular indicator variable. 

It turns out that there are multiple ways to analyze such data – all of which are asymptotically equivalent. 

Also, this particular topic is usually divided into two sub-categories - problems where there are only two 

levels of the predictor variables and cases where there are three or more levels of the predictor variables. 

This division actually has a good reason – it turns out that in the case of 2 levels <strong>for</strong> the predictor and 2 

levels <strong>for</strong> the response variable (the classic 2 × 2 contingency table), it is possible to use a retrospective 

study and actually get valid estimates of the prospective odds ratio. 

For example, suppose you were interested in the looking at the relationship between smoking and lung 

cancer. In a prospective study, you could randomly select 1000 smokers and 1000 non-smokers <strong>for</strong> their 

relevant populations and follow them over time to see how many developed lung cancer. Suppose you 

obtained the following results: 

Cohort Lung Cancer No lung cancer 

Smokers 100 900 

Non-smoker 10 990 

Because this is a prospective study, it is quite valid to say that the probability of developing lung cancer 

if you are a smoker is 100/1000 and the probability of developing lung cancer if you are not a smoker is 

10/1000. The odds of developing cancer if you are smoker are 100:900 and the odds of developing cancer 

if you are non-smoker are 10:990. The odds ratio of developing cancer of a smoker vs. a non-smoker is then 

OR(LC) S vs. NS = 

100 : 900 

10 : 990 = 11 : 1 



But a prospective study takes too long, so an alternate way of studying the problem is to do a retrospective 

study. Here samples of 1000 people with lung cancer, and 1000 people without lung cancer are selected at 

random from their respective populations. For each subject, you determine if they smoked in the past. 

Suppose you get the following results: 

Lung Cancer Smoker Non-smoker 

yes 810 190 

no 280 720 

Now you can’t directly find the probability of lung cancer if you are smoker. It is NOT simply 810/(810+ 

280) because you selected equal number of smokers and non-smokers while less than 30% of the population 

generally smokes. Unless that proportion is known, it is impossible to compute the probability of getting 

lung cancer if you are a smoker or non-smoker directly, and so it would seem that finding the odds of lung 

cancer would be impossible. 

However, not all is lost. Let P (smoker) represent the probability that a randomly chosen person is a 

smoker; then P (non-smoker) = 1 − P (smoker). Bayes’ Rule 13 

P (lung cancer | smoker) = 

P (no lung cancer | smoker) = 

P (lung cancer | non-smoker) = 

P (no lung cancer | non-smoker) = 

P (smoker | lung cancer)P (lung cancer) 

P (smoker) 

P (smoker | no lung cancer)P (no lung cancer) 

P (smoker) 

P (non-smoker | lung cancer)P (lung cancer) 

P (non-smoker) 

P (non-smoker | no lung cancer)P (no lung cancer) 

P (non-smoker) 

This doesn’t appear to helpful, as P(smoker) or P(non-smoker) is unknown. But, look at the odds-ratio 

of getting lung cancer of a smoker vs. a non-smoker: 

OR(LC) S vs. NS = 

= 

ODDS(lung cancer if smoker) 

ODDS(lung cancer if non-smoker) 

P (lung cancer | smoker) 

P (no lung cancer | smoker) 

P (lung cancer | non-smoker) 

P (no lung cancer | non-smoker) 

If you substitute in the above expressions, you find that: 

OR(LC) S vs NS = 

P (smoker | lung cancer) 

P (smoker | no lung cancer) 

P (non-smoker | lung cancer) 

P (non-smoker | no lung cancer) 

which can be computed from the retro-spective study. Based on the above table, we obtain 

OR(LC) S vs NS = 

.810 

.280 

.190 

.720 

= 11 : 1 

This symmetric in odds-ratios between prospective and retrospective studies only works in the 2x2 case 

<strong>for</strong> simple random sampling. 

13 See http://en.wikipedia.org/wiki/Bayes_rule 



7.6.2 Example: Parental and student usage of recreational drugs 

A study was conducted where students at a college were asked about their personal use of marijuana and 

if their parents used alcohol and/or marijuana. 14 The following data is a collapsed version of the table that 

appears in the report: 

Parental 

Student Usage 

Usage Yes No 

Yes 125 85 

No 94 141 

This is a retrospective analysis as the students are interviewed and past behavior of parents is recorded. 

The data are entered in JMP in the usual <strong>for</strong>mat. There will be four lines, and three variables corresponding 

to parental usage, student usage, and the count. 

Start using the Analyze->Fit Y-by-X plat<strong>for</strong>m: 

14 “Marijuana Use in College, Youth and Society, 1979, 323-334. 



but don’t <strong>for</strong>get to specify the Count as the frequency variable. It doesn’t matter which variable is entered 

as the X or Y variable. Note that JMP actually will switch from the logistic plat<strong>for</strong>m to the contingency 

plat<strong>for</strong>m 15 as noted by the diagram at the lower left of the dialogue box. 

The mosaic plot shows the relative percentages in each of the student usage groups: 

15 Refer to the chapter on Chi-square tests. 



The contingency table (after selecting the appropriate percentages <strong>for</strong> display from the red-triangle popdown 

menu) 16 

16 In my opinion, I would never display percentages to more than integer values. Displays such as 42.92% are just silly as they imply 

a precision of 1 part in 10,000 but you only have 219 subjects in the first row. 



The contingency table approach tests the hypothesis of independence between the X and Y variable, i.e. 

is the proportion of parents who use marijuana the same <strong>for</strong> the two groups of students: 

As explained in the chapter on chi-square tests, there are two (asymptotically) equivalent ways to test this 

hypothesis – the Pearson chi-square statistic and the likelihood ratio statistic. In this case, you would come 

to the same conclusion. 

The odds-ratio is obtained from the red-triangle at the top of the display: 



and gives: 

It is estimated that the odds of children using marijuana if their parents use marijuana or alcohol are about 

2.2 times that of the odds of a child using marijuana <strong>for</strong> parents who don’t use marijuana or alcohol. The 

95% confidence interval <strong>for</strong> the odds-ratio is between 1.51 and 3.22. In this case, you would examine if 

the confidence interval <strong>for</strong> the odds-ratio includes the value of 1 (why?) to see if anything interesting is 

happening. 

If the Analyze->Fit Model plat<strong>for</strong>m is used and a logistic regression is fit: 






The coefficient of interest is the effect of student usage on the no/yes log-odds <strong>for</strong> parental usage. The 

test <strong>for</strong> the effect of student usage has chi-square test value of 17.02 with a small p-value which matches 

the likelihood ratio test from the contingency table approach. Many packages use different codings <strong>for</strong> 

categorical X variables (as seen in the section on multiple regression) so you need to check the computer 

manual carefully to understand exactly what the coefficient measures. 

However, the odds-ratio can be found from the red-triangle pop-down menu: 



and matches what was seen earlier. 

Finally, the Analyze->Fit Model plat<strong>for</strong>m can be used with the Generalized Linear Model option: 



This gives: 



The test <strong>for</strong> a student effect has the same results as seen previously. But, ironically, gives no easy easy to 

compute the odds ratio. It turns out that given the parameterization used by JMP, the log-odds ratio is twice 

the coefficient of the student-usage, i.e. twice of -.3955. The odds-ratio would be found as the anti-log of 

this value, i.e. e 2×−.3955 = .4522 and the confidence interval <strong>for</strong> the odds-ratio can be found by anti-logging 

twice the confidence intervals <strong>for</strong> this coefficient, i.e. ranging from (e 2×−.5866 = .31 → e 2×−.2068 = 

.66). 17 These values are the inverse of the value seen earlier but this is an artefact of which category is 

modelled. For example, the odds ratio of P arents Y vs.N (student Y vs.N ) = 

1 

P arents N vs.Y (student Y vs.N ) = 

1 

P arents Y vs.N (student N vs.Y ) = P arents N vs.Y (student N vs.Y ) 

7.6.3 Example: Effect of selenium on tadpoles de<strong>for</strong>mities 

The generalization of the above to more than two levels of the X variable is straight <strong>for</strong>ward and parallels the 

analysis of a single factor CRD ANOVA. Again, we will assume that the experimental design is a completely 

randomized design or simple random sample. 

Selenium (Se) is an essential element required <strong>for</strong> the health of humans, animals and plants, but be- 

17 This simple relationship may not be true with other computer packages. YMMV. 



comes a toxicant at elevated concentrations. The most sensitive species to selenium toxicity are oviparous 

(egg-laying) animals. Ecological impacts in aquatic systems are usually associated with teratogenic effects 

(de<strong>for</strong>mities) in early life stages of oviparous biota as a result of maternal sequestering of selenium in eggs. 

In aquatic environments, inorganic selenium, found in water or in sediments is converted to organic selenium 

at the base of the food chain (e.g., bacteria and algae) and then transferred through dietary pathways to other 

aquatic organisms (invertebrates, fish). Selenium also tends to biomagnify up the food chain, meaning that 

it accumulates to higher tissue concentrations among organisms higher in the food web. 

Selenium often occurs naturally in ores and can leach from mine tailings. This leached selenium can 

make its way to waterways and potentially contaminate organisms. 

As a preliminary survey, samples of tadpoles were selected from a control site and three sites identified 

as low, medium, and high concentrations of selenium based on hydrologic maps and expert opinion. These 

tadpoles were examined, and the number that had de<strong>for</strong>mities were counted. 

Here is the raw data: 

Site Tadpoles De<strong>for</strong>med % de<strong>for</strong>med 

Control 208 56 27% 

low 687 243 35% 

medium 832 329 40% 

high 597 283 47% 

The data are entered in JMP in the usual fashion: 



Notice that the status of the tadpoles as de<strong>for</strong>med or not de<strong>for</strong>med is entered along with the count of each 

status. 

As the selenium level has an ordering, it should be declared as an ordinal scale and the ordering of the 

values <strong>for</strong> the selenium levels should be specified using the Column In<strong>for</strong>mation → Column Properties → 

Value Ordering dialogue box 

The hypothesis to be tested can be written in a number of equivalent ways: 

• H: p(de<strong>for</strong>mity) is the same <strong>for</strong> all levels of selenium. 

• H: odds(de<strong>for</strong>mity) is the same <strong>for</strong> all levels of selenium. 

• H: log-odds(de<strong>for</strong>mity) is the same <strong>for</strong> all levels of selenium. 

• H: p(de<strong>for</strong>mity) is independent of the level of selenium. 18 

• H: odds(de<strong>for</strong>mity) is independent of the level of selenium. 

• H: log-odds(de<strong>for</strong>mity) is independent of the level of selenium. 

18 The use of independent in the hypothesis is a bit old-fashioned and not the same as statistical independence. 



• H: p c (D) = p L (D) = p M (D) = p H (D) where p L (D) is the probability of de<strong>for</strong>mities at low doses, 

etc. 

There are again several ways in which this data can be analyzed. 

Start with the Analyze->Fit Y-by-X plat<strong>for</strong>m: 

This will give a standard contingency table analysis (see chapter on chi-square tests). 

The mosaic plot: 



seems to show an increasing trend in de<strong>for</strong>mities with increasing selenium levels. It is a pity that JMP 

doesn’t display any measure of precision (such se bars or confidence intervals) on this plot. 

The contingency table (with suitable percentages shown 19 ) 

19 I would display percentages to the nearest integer. Un<strong>for</strong>tunately, there doesn’t appear to be an easy way to control this in JMP. 



also gives the same impression. 

A <strong>for</strong>mal test <strong>for</strong> equality of proportion of de<strong>for</strong>mations across all levels of the factor gives the following 

test statistics and p-values: 

There are two common test-statistics. The Pearson chi-square test-statistic which examines the difference 

between observed and expected counts (see chapter on chi-square tests), and the likelihood-ratio test which 

compares the model when the hypothesis is true vs. the model when the hypothesis is false. Both are asymptotically 

equivalent. There is strong evidence against the hypothesis of equal proportions of de<strong>for</strong>mities. 

Un<strong>for</strong>tunately, most contingency table analyses stop here. A naked p-value which indicates that there is 

evidence of a difference but does not tell you where the differences might lie, is not very in<strong>for</strong>mative! In the 

same way that ANOVA must be followed by a comparison of the mean among the treatment levels, this test 

should be followed by a comparison of the proportion of de<strong>for</strong>mities among the factor levels. 

Logistic regression methods will enable us to estimate the relative odds of de<strong>for</strong>mities among the various 



classes. 

Start with the Analyze->Fit Model plat<strong>for</strong>m: 




First, the Effect Tests tests the hypothesis of equality of the proportion of defectives among the four levels 

of selenium. The test-statistic and p-value match that seen earlier, so there is good evidence of a difference 

among the de<strong>for</strong>mity proportions among the various levels. 

At this point in a ANOVA, a multiple comparison procedure (such a Tukey’s HSD) would be used to 

examine which levels may have different means from the other levels. There is no simple equivalent <strong>for</strong> 

logistic regression implemented in JMP. 20 It would be possible to use a simple Bonferonni correction if the 

number of groups is small. 

JMP provides some in<strong>for</strong>mation on comparison among the levels. In the Parameter Estimates section, 

it presents comparisons of the proportion of defectives among the successive levels of selenium. 21 The 

estimated difference in the log-odds of de<strong>for</strong>med <strong>for</strong> the low vs. control group is .39 (se .18). The associated 

p-value <strong>for</strong> no difference in the proportion of de<strong>for</strong>med is .02 which is less than the α = .05 levels so there 

is evidence of a difference in the proportion of de<strong>for</strong>med between these two levels. 

By requesting the confidence interval and the odds-ratio these can be trans<strong>for</strong>med to the odds-scale 

(rather than the log-odds) scale. 

20 This is somewhat puzzling as the theory should be straight <strong>for</strong>ward. 

21 This is purely a function of the internal coding used by JMP. Other packages may use different coding. YMMV. 



Un<strong>for</strong>tunately, there is no simple mechanism to do a more general contrasts in this variant of the Analyze- 

>Fit Model plat<strong>for</strong>m. 

The Generalized Linear Model plat<strong>for</strong>m in the Analyze->Fit Model plat<strong>for</strong>m gives more options: 



The output you get is very similar to what was seen previously. Suppose that a comparison between the 

proportions of de<strong>for</strong>mities between the high and control levels of selenium are wanted. 

Use the red-triangle pop-down menu to select the Contrast options: 



Then select the radio button <strong>for</strong> comparisons among selenium levels: 

Click on the + and − to <strong>for</strong>m the contrast. Here you are interested in LO high − LO control where the LO 

are the log-odds <strong>for</strong> a de<strong>for</strong>mity. 



This gives: 



The estimated log-odds ratio is .89 (se .18). This implies that the odds-ratio of de<strong>for</strong>mity is e .89 = 2.43, 

i.e. the odds of de<strong>for</strong>mity are 2.43 greater in the high selenium site than the control site. The p-value is well 

below α = .05 so there is strong evidence that this effect is real. It is possible to compute the se of the oddsratio 

using the Delta method – pity that JMP doesn’t do this directly. 22 An approximate 95% confidence 

interval <strong>for</strong> the log-odds ratio could be found using the usual rule of estimate ± 2se. The 95% confidence 

interval <strong>for</strong> the odd-ratio would be found by taking anti-logs of the end points. 

This procedure could then be repeated <strong>for</strong> any contrast of interest. 

7.7 Example: Pet fish survival as function of covariates - Multiple 

categorical predictors 

There is no conceptual problem in having multiple categorical X variables. Unlike the case of a single 

categorical X variable, there is no simple contingency table approach. However, in more advanced classes, 

you will learn about a technique called log-linear modeling that can often be used <strong>for</strong> these types of tables. 

Again, be<strong>for</strong>e analyzing any dataset, ensure that you understand the experimental design. In these notes, 

it is assumed that the design is completely randomized design or a simple random sample. If your design is 

more complex, please seek suitable help. 

A fish is a popular pet <strong>for</strong> young children – yet the survival rate of many of these fish is likely poor. What 

factors seem to influence the survival probabilities of pet fish? 

A large pet store conducted a customer follow-up survey of purchasers of pet fish. A number of customers 

were called and asked about the hardness of the water used <strong>for</strong> the fish (soft, medium, or hard), where the 

fish was kept which was then classified into cool or hot locations within the living dwelling, if they had 

previous experience with pet fish (yes or no), and if the pet fish was alive six months after purchase (yes or 

no). 

Here is the raw data 23 : 

22 For those so inclined, if ̂θ is the estimator with associated se, then the se of êθ is found as se(êθ) = se(̂θ) × êθ. In this case, the 

se of the odd-ratio would be .18 × e .89 = .44. 

23 Taken from Cox and Snell, Analysis of Binary Data 



Softness Temp PrevPet N Alive 

h c n 89 37 

h h n 67 24 

m c n 102 47 

m h n 70 23 

s c n 106 57 

s h n 48 19 

h c y 110 68 

h h y 72 42 

m c y 116 66 

m h y 56 33 

s c y 116 63 

s h y 56 29 

There are three factors in this study: 

• Softness with three levels (h, m or s); 

• Temperature with two levels (c or h); 

• Previous ownership with two levels (y or n). 

This a factorial experiment because all 12 treatment combinations appear in the experiment. 

The experimental unit is the household. The observational unit is also the household. There is no 

pseudo-replication. 

The randomization structure is likely complete. It seems unlikely that people would pick particular 

individual fish depending on their water hardness, temperature, or previous history of pet ownership. 

The response variable is the Alive/Dead status at the end of six months. This is a discrete binary outcome. 

For example, in the first row of the data table, there were 37 households where the fish was still alive after 6 

months and there<strong>for</strong>e 89 − 37 = 52 households where the fish had died somewhere in the 6 month interval. 

One way to analyze this data would be to compute the proportion of households that had fish alive after 

six months, and then use a three-factor CRD ANOVA on the estimated proportions. Because each treatment 

combination is based on a different number of trial (ranging from 48 to 116) which implies that the variance 

of the estimated proportion is not constant. This violates (but not likely too badly) one of the assumptions 

of ANOVA – that of constant variance in each treatment combination. Also, this seems to throw away data, 

as these 1000 observations are basically collapsed into 12 cells. 

Because the outcome is a discrete binary response and each trial within each treatment is independent, a 

logistic regression (or generalized linear model) approach can be used. 



The data is available in the JMP data file fishsurvive.jmp available in the Sample Program Library 

at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is the data 

file: 

To begin with, construct some profile plots to get a feel <strong>for</strong> what is happening. Create new variables 

corresponding to the proportion of fish alive and its logit 24 . These are created using the <strong>for</strong>mula editor of 

JMP in the usual fashion. Also, <strong>for</strong> reasons which will become apparent in a few minutes, create a variable 

which is the concatenation of the Temperature and Previous Ownership factor levels. This gives: 

( ) 

24 Recall that logit(p) = log p 

. 1−p 



Now use the Analyze->Fit Y-by-X plat<strong>for</strong>m and specify that the p(alive) or logit(alive) is the response 

variable, with the WaterSoftness as the factor. 



Then specify a matching column <strong>for</strong> the plot (do this on both plots) using the concatenated variable defined 

above. 



This creates the two profile plots 25 : 

The profile plots seem to indicate that the p(alive) tends to increase with water softness if this is a first time 

pet owner, and (ironically) tends to decrease if a previous pet owner. Of <strong>course</strong> without standard error bars, 

it is difficult to tell if these trends are real or not. The sample sizes in each group are around 100 households. 

√ 

.5(.5) 

100 

= .05 or the approximate 

If p(alive) = .5, then the approximate size of a standard error is se = 

95% confidence intervals are ±.1. It looks as if any trends will be hard to detect with the sample sizes used 

in this experiment. 

25 To get the labels on the graph, set the concatenated variable to be a label variable and the rows corresponding to the h softness 

level to be labeled rows. 



In order to fit a logistic-regression model, you must first create new variable representing the number 

Dead in each trial 26 , and then stack 27 the Alive and Dead variables, label the columns as Status and the 

Count of each Status to give the final table: 

Whew! Now we can finally fit a model to the data and test <strong>for</strong> various effects. In JMP 6.0 and later, 

there are two ways to proceed (both give the same answers, but the generalized linear model plat<strong>for</strong>m gives 

a richer set of outputs). Use the Analyze->Fit Model plat<strong>for</strong>m: 

26 Use a <strong>for</strong>mula to subtract the number alive from the number of trials. 

27 Use the Tables->Stack command. 



Notice that the response variable is Status and that the frequency variable is the Count of the number of 

times each status occurs. The model effects box is filled with each factors effect, and the second and third 

order interactions. 




Check to see exactly what is being modeled. In this case, it is the probability of the first level of the 

responses, logit(alive). 

Then examine the effect tests. Just as in ordinary ANOVA modeling, start with the most complex term, 

and work backwards successively eliminating terms until nothing more can be eliminated. The third-order 

interaction is not statistically significant. Eliminate this term from the Analyze->Fit Model dialog box, and 

refit using only main effects and two factor interactions. 28 

Successive terms were dropped to give the final model: 

28 Just like regular ANOVA, you can’t examine the p-values of lower order interaction terms if a higher order interaction is present. 

In this case, you can’t look at the p-values <strong>for</strong> the second order interaction when the third order interaction is present in the model. You 

must first refit the model after the third order interaction is dropped. 



It appears that there is good evidence of Previous Ownership, marginal evidence of an effect of Temperature 

and an interaction between water softness and previous ownership. [Because the two factor interaction was 

retained, the main effects of softness and previous ownership must be retained in the model even though it 

looks as if there is no main effect of softness. Refer to the previous notes on two-factor ANOVA <strong>for</strong> details.] 

Save the predicted p(alive) to the data table 29 

29 CAUTION: the predicted p(alive) is saved to the data line even if the actual status is dead. 



and plot the observed proportions against the predicted values as seen in regression examples earlier. 30 

30 Use the Analyze->Fit Y-by-X plat<strong>for</strong>m, and then the Fit Special option to draw a line with slope=1 on the plot. 



The plot isn’t bad and seems to have captured most of what is happening. Use the Analyze->Fit Y-by-X 

plat<strong>for</strong>m, with the Matching Column as be<strong>for</strong>e to create the profile plot of the predicted values: 



It is a pity that JMP gives you no easy way to annotate the standard error or confidence intervals <strong>for</strong> the 

predicted mean p(alive), but the confidence bounds can be saved to the data table. 

Unlike regular regression, it makes no sense to make predictions <strong>for</strong> individual fish. 

By using the Contrast pop-down menu, you can estimate the difference in survival rates (but, un<strong>for</strong>tunately, 

on the logit scale) as needed. For example, suppose that you wished to estimate the difference in 

survival rates between fish raised in hard water and no previous experience and hard water with previous 

experience. Use the Contrast pop-down menu: 



The contrast is specified by pressing the - and + boxes as needed: 



This gives: 



Again this is on the logit scale and implies that the logit(p(alive)) h,n − logit(p(alive)) h,y = −.86 (se .22). 

This is highly statistically significant. But, what does this mean? Working backwards, we get: 

) 

logit (p (alive) hn 

) − logit 

(p (alive) hy 

= −.86 

[ ] [ 

p(alive)hn 

p(alive)hy 

] 

log 

1−p(alive) 

− log 

hn 1−p(alive) 

= −.86 

hy 

log 

[ 

odds(alive)hn 

odds(alive) hy 

] 

= −.86 

odds(alive) hn 

odds(alive) hy 

= e −.86 = .423 

Or, the odds of a fish being alive from a non-owner in hard water are about 1/2 of the odds of a fish being 

alive from a previous owner in hard water. If you look at the previous graphs, this indeed does match. It is 

possible to compute a se <strong>for</strong> this odds ratio, but is beyond the scope of this <strong>course</strong>. 

7.8 Example: Horseshoe crabs - Continuous and categorical predictors. 

As to be expected, combinations of continuous and categorical X variables can also be fit using similar 

reasoning as ANCOVA models discussed in the chapter on multiple regression. 

If the categorical X variable has k categories, k −1 indicator variables will be created using an appropriate 

coding. Different computer packages use different codings, so you must read the package documentation 

carefully in order to interpret the estimated coefficients. However, the different codings, must, in the end, 

arrive at the same final estimates of effects. 

Unlike the ANCOVA model with continuous responses, there are no simple plots in logistic regression 

to examine visually the parallelism of the response or the equality of intercepts. 31 Preliminary plots where 

data are pooled into various classes so that empirical logistic plots can be made seem to be the best that can 

be done. 

As in the ANCOVA model, there are three models that are usually fit. Let X represent the continuous 

predictor, let Cat represent the categorical predictor, and p the probability of success. The three models are: 

• logit(p) = X Cat X ∗ Cat - different intercepts and slopes <strong>for</strong> each group; 

• logit(p) = X Cat - different intercepts but common slope (on the logit scale); 

• logit(p) = X - same slope and intercept <strong>for</strong> all groups - coincident lines. 

The choice among these models is made by examining the Effect Tests <strong>for</strong> the various terms. For example, 

to select between the first and second model, look at the p-value of the X ∗ Cat term; to select between the 

second and third model, examine the p-value <strong>for</strong> the Cat term. 

31 This is a general problem in logistic regression because the responses are one of two discrete categories. 



These concepts will be illustrated using a dataset on nesting horseshoe crabs 32 that is analyzed in 

Agresti’s book. 33 

The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs, 

Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely 

randomized design or a simple random sampling. As in regression models, you do have some flexibility in 

the choice of the X settings, but <strong>for</strong> a particular weight and color, the data must be selected at random from 

that relevant population. 

Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting 

whether the female had any other males, called satellites residing nearby. These other factors includes: 

• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark. 

• spine condition where 1=both good, 2=one worn or broken, or 3=both worn or broken. 

• weight 

• carapace width 

The number of satellites was measured; <strong>for</strong> this example we will convert the number of satellite males into 

a presence (number at least 1) or absence (no satellites). 

A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www. 

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A portion of the datafile is shown 

below: 

32 See http://en.wikipedia.org/wiki/Horseshoe_crab. 

33 These are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html. 



Note that the color and spine condition variables should be declared with an ordinal scale despite having 

numerical codes. The number of satellite males was converted to a presence/absence value using the JMP 

<strong>for</strong>mula editor. 

A preliminary scatter plot of the variables shows some interesting features. 

There is a very high positive relationship between carapace width and weight, but there are few anomalous 

crabs that should be investigated further as shown in this magnified plot: 



There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights 

should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single crab whose 

weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally, there is one 

crab which is extremely large compared to the rest of the group. In the analysis that follows, I’ve excluded 

these five crabs. 

The final point also appears to have an unusual number of satellite males compared to the other crabs in 

the dataset. 

The Analyze->Fit Y-by-X plat<strong>for</strong>m was then used to examine the differences in mean or proportions in 

the other variables when grouped by the presence/absence score. These are not shown in these notes, but 

generally demonstrate some separation in the means or proportions between the two groups, but there is 

considerable overlap in the individual values between the two groups. The group with no satellite males 

tends to have darker colors than the presence group; while the distinction between the spine condition is not 

clear cut. 

Because of the high correlation between carapace size and weight, the weight variable was used as the 

continuous covariate and the color variable was used as the discrete covariate. 

A preliminary analysis divided weight into four classes (up to 2000g; 2000-2500 g; 2500-3000 g; and 

over 3000 g). 34 Similarly, a new variable (PA) was created to be 0 (<strong>for</strong> absence) or 1 (<strong>for</strong> presence) <strong>for</strong> the 

presence/absence of satellite males. The Tables->Summary was used to compute the mean PA (which then 

34 The <strong>for</strong>mula commands of JMP were used. 



corresponds to the estimated probability of presence) <strong>for</strong> each combination of weight class and color: 



Finally, the Analyze->Fit Y-by-X plat<strong>for</strong>m was used to plot the probability of presence by weight class, using 

the Matching Column to joint lines of the same color: 







Note despite the appearance of non-parallelism <strong>for</strong> the bottom line, the point in the 2500-3000 gram category 

is only based on 4 crabs and so has very poor precision. Similarly, the point near 100% in the 0000-2000 g 



category is based on 1 data point! The parallelism hypothesis may be appropriate. 

A generalized linear model using the Analyze->Fit Y-by-X plat<strong>for</strong>m was used to fit the most general 

model using the raw data: 

This gives the results: 



The p-value <strong>for</strong> non-parallelism (refer to the line corresponding to the Color*Weight term) is just over 

α = .05 so there is some evidence that perhaps the lines are not parallel. The parameter estimates are not 

interpretable without understanding the coding scheme used <strong>for</strong> the indicator variables. The goodness-of-fit 

test does not indicate any problems. 

Let us continue with a the parallel slopes model by dropping the interaction term. This gives the following 

results: 



There is good evidence that the log-odds of NO males present decrease as weight increases (i.e. the log-odds 

of a male being present increases as weight increases), with an estimated increase of .0016 in the log-odds 

per gram increase in weight. There is very weak evidence that the intercepts are different as the p-value is 

just under 10%. 

The goodness-of-fit test seems to indicate no problem. The residual plot must be interpreted carefully, 

but its appearance was explained in a previous section. 

The different intercepts will be retained to illustrate how to graph the final model. Use the red-triangle 

to save the predicted probabilities to the data table. Note that you may wish to rename the predicted column 

to remind yourself that the probability of NO male is being predicted. 



Use the Analyze->Fit Y-by-X plat<strong>for</strong>m to plot the predicted probability of absence against weight, use the 

group by option to separate by color, and then fit a spline (a smooth flexible curve) to draw the four curves: 





to give the final plot: 



Notice that while the models are linear on the log-odds scale, they plots will show a non-linear shape on the 

regular scale. 

It appears that the color=5 group appears to be different from the rest. If you do a contrast among the 

intercepts (not really a good idea as this could be considered data dredging), you indeed find evidence that 

the intercept (on the log-odds scale) <strong>for</strong> color 5 may be different than the average of the intercepts <strong>for</strong> the 

other three colors: 



7.9 Assessing goodness of fit 

As is the case in all model fitting in Statistics, it is important that the model provides an adequate fit to the 

data at hand. Without such an analysis, the inferences drawn from the model may be misleading or even 

totally wrong! 

One of the “flaws” of many published papers is a lack to detail on how the fit of the model to the data 

was assessed. The logistic regression model is a powerful statistical tool, but it must be used with caution. 

Goodness-of-fit <strong>for</strong> logistic regression models are more difficult than similar methods <strong>for</strong> multiple regression 

because of the binary (success/failure) nature of the response variable. Nevertheless, many of the 



methods used in multiple regression have been extended to the logistic regression case. 

A nice review paper of the methods of assessing fit is given by 

Hosmer, D. W., Tabler, S., and Lameshow, S. (1991). 

The importance of assessing the fit of logistic regression models: a case study. 

American Journal of Public Health, 81, 1630âĂŞ1635. 

http://dx.doi.org/10.2105/AJPH.81.12.1630 

In any statistical model, there are two components – the structural portion (e.g. the fitted curve) and the 

residual (or noise) (e.g. the deviation of the actual values from the fitted curve). The process of building a 

model focuses on the structural portion. Which variables are important in predicting value? Is the correct 

scale (e.g. should x or x 2 be used?) After the structural model is fit, the analyst should assess the degree fit. 

Assessing goodness-of-fit (GOF) usually entails two stages. First, computing a statistic that summarizes 

the general fit of the model to the data. Second, computing statistics <strong>for</strong> individual observations that assess 

the (lack of) fit of the model to individual observations and their leverage in the fit. This may indentify 

particular observations that are outliers or have undue influence or leverage on the fit. These points need to 

be inspected carefully, but it is important to remember that data should not be arbitrarily deleted based solely 

on a statistical measure. 

Let ̂π i represent the predicted probability <strong>for</strong> case i whose response is either 0 (<strong>for</strong> failure) or 1 (<strong>for</strong> 

success). The deviance of a point is defined as 

√ 

d i = 2| ln(̂π yi 

i (1 − ̂π i) 1−yi )| 

and is basically a function of the log-likelihood <strong>for</strong> that observation. 

The total deviance is defined as: 

D = ∑ d 2 i 

Another statistics, the Pearson residual, is defined as: 

f i = 

y i − ̂π i 

√̂πi (1 − ̂π i ) 

and the Pearson chi-square statistic is defined as 

χ 2 = ∑ r 2 i 

The summary statistics D and χ 2 each have degrees of freedom approximately equal to n − (p + 1) 

where p is the number of predictor variables, but don’t have any nice distributional <strong>for</strong>ms (i.e. you can’t 

assume that they follow a chi-square distribution). This is because the individual components are essentially 

fromed from n × 2 contingency table with all counts 1 or 0 so the problem of small expected counts found in 



chi-square tests is quite serious. So any p-value reported <strong>for</strong> these overall goodness-of-fit measures are not 

very reliable, and about the only thing that is useful is to compare these statistics to their degrees of freedom 

to compute an approximate variance inflation factor as seen earlier in the Fitness example. 

One strategy <strong>for</strong> sparse tables is to pool. The Lemeshow test divides the data into 10 groups of equal 

sizes based on the deciles of the fitted values. The observed and expected counts are computed by summing 

the estimated probabilities and the observed values in the usual fashion, and then computing a standard 

chi-square goodness-of-fit statistic. It is compared to a chi-square distribution with 8 df . 

Any assessment of goodness of fit should then start with the examination of the D, χ 2 and Lemeshow 

statistics. Then do a careful evauation of the individual terms d i and r i . 

To start with, examine the residual plots. Suppose we wish to predict membership in a category as a 

function of a continuous covariate. For example, can we predict the sex of an individual based on their 

weight? This is known as logistic regression and is discussed in another chapter in this series of notes. 

Again refer to the Fitness dataset. The (Generalized Linear) model is: 

Y i distributed as Binomial(p i ) 

φ i = logit(p i ) 

φ i = W eight 

The residual plot is produced automatically from the Generalized Linear Model option of the Analyze->Fit 

Model plat<strong>for</strong>m and looks like 35 : 

35 I added reference lines at zero, 2, and −2 by clicking on the Y axis of the plot 



This plot looks a bit strange! 

Along the bottom of the plot, is the predicted probability of being female 36 This is found by substituting 

in the weight of each person into the estimated linear part, and then back-trans<strong>for</strong>ming from the logit scale 

to the ordinary probability scale. The first point on the plot, identified by a square box, is from a male who 

weighs over 90 kg. The predicted probability of being female is very small, about 5%. 

The first question is exactly how is a residual defined when the Y variable is a category? For example, 

how would the residual <strong>for</strong> this point be computed - it makes no sense to simply take the observed (male) 

minus the predicted probability (.05)? 

Many computer packages redefine the categories using 0 and 1 labels. Because JMP was modeling the 

probability of being female, all males are assigned the value of 0, and all females assigned the value of 1. 

Hence the residual <strong>for</strong> this point is 0-.05-0.05 which after studentization, is plots as shown. 

The bottom line in the residual plot corresponds to the male subjects, The top line corresponds to the 

female subjects. Where are areas of concern? You would be concerned about females who have a very small 

probability of prediction <strong>for</strong> being female, and males who have a large probability of prediction of being 

36 The first part of the output from the plat<strong>for</strong>m states that the probability of being female is being modeled. 



female. These are located in the plot in the circled areas. 

The residual plot’s strange appearance is an artifact of the modeling process. 

What happens if the predictors in a logistic regression are also categorical. Based on what what seen <strong>for</strong> 

the ordinary regression case, you can expect to see a set of vertical lines. But, there are only two possible 

responses, so the plot reduces to a (non-in<strong>for</strong>mative) set of lattice points. 

For example, consider predicting survival rates of Titanic passengers as a function of their sex. This 

model is: 

Y i distributed as Binomial(p i ) 

φ i = logit(p i ) 

φ i = Sex 

The residual plot is produced automatically from the Generalized Linear Model option of the Analyze->Fit 

Model plat<strong>for</strong>m and looks like 37 : 

37 I added reference lines at zero, 2, and −2 by clicking on the Y axis of the plot 



The same logic applies as in the previous sections. Because Sex is a discrete predictor with two possible 

values, there are only two possible predicted probability of survival corresponding to the two vertical lines 

in the plot. Because the response variable is categorical, it is converted to a 0 or 1 values, and the residuals 

computed which then correspond to the two dots in each vertical line. Note that each dot represents several 

hundred data values! 

This residual plot is rarely in<strong>for</strong>mative – after all, if there are only two outcomes and only two categories 

<strong>for</strong> the predictors, some people have to lie in the two outcomes <strong>for</strong> each of the two categories of predictors. 

The leverage of a point measures how extreme the set of predictors is relative to the rest of the predictors 

in the study. Leverage in logistic regression depends no only this distance, but also the weight in predictions 

which is a function of π(1 − π). Consequently, points with very small predicted (i.e. ̂π i < 0.15) or very 

larger predicted (i.e. ̂π i > 0.85) actually have little weight on the fit and the maximum leverage occurs with 

points where the predicted probability is close to 0.15 or 0.85. 

Hosmer et al. (1991) suggest plotting the leverage of each point vs. ̂π i to determine the regions where 

the leverage is highest. These values may not be available in your package of choice. 

Hosmer et al. (1991) also suggest computing the Cook’s distance – how much does the regression coefficient 

change if a case is dropped from the model. These values may not be available in your package of 

choice. 



In the previous examples, there were only a few predictor variables and generally, there was only model 

really of interest. In many cases, the <strong>for</strong>m of the model is unknown, and some sort of variable selection 

methods are required to build realistic model. 

As in ordinary regression, these variable selection methods are NO substitute <strong>for</strong> intelligent thought, 

experience, and common sense. 

As always, be<strong>for</strong>e starting any analysis, check the sample or experimental design. This chapter only 

deals with data collected under a simple random sample or completely randomized design. If the sample or 

experimental design is more complex, please consult with a friendly statistician. 

Epidemiologists often advise that all clinically relevant variables should be included regardless if statistically 

significant or not. The rationale <strong>for</strong> this approach is to provide as complete control of confounding as 

possible – we saw in regular regression that collinearity among variables can mask statistical significance. 

The major problem with this approach is over-fitting. Over-fitted models have too many variables relative to 

the number of observations, leading to numerically unstable estimates with large standard errors. 



I prefer a more subdued approach rather than this shotgun approach and would follow these steps to find 

a reasonable model: 

• Start with a multi-variate scatter-plot matrix to investigate pairwise relationships among variables. Are 

there pairs of variables that appear to be highly correlated? Are there any points that don’t seem to 

follow the pattern seen with the other points? 

• Examine each variable separately using the Analyze->Distribution plat<strong>for</strong>m to check <strong>for</strong> anomalous 

values, etc. 

• Start with simple univariate logistic regression with each variable in turn. 

For continuous variables, there are two suggested analyses. First, use the binary variable as the X 

variable and do a simple two-sample t-test to look <strong>for</strong> differences among the means of the potential 

predictors. The dot plots should show some separation of the two groups. Second, try a simple 

univariate logistic-regression using the binary variable as the Y variable with each individual predictor. 

Third, although it seems odd to do so, convert the binary response variable to a 0/1 continuous response 

and try some of the standard smoothing methods, such a spline fit to investigate the general <strong>for</strong>m of 

the response. Does it look logistic? Are quadratic terms needed? 

For nominal or ordinal variables, the two above analyses often start with a contingency table. Particular 

attention should be paid to problem cases – cells in a contingency table which have a zero count. 

For example, if an experiment was testing different doses of a drug <strong>for</strong> the LD50 38 and no deaths 

occurred at a particular dose. In these situations, the log-odds of success are either ±∞ which is 

impossible to model properly using virtually any standard statistical package. 39 If there are cells with 

0 counts, some pooling is often required. 

Looking at all the variables, which variables appear to be statistically significant? Approximately how 

large are these simple effects – can the predictor variables be ranked in approximate order of univariate 

importance? 

• Based upon the above results, start with a model that includes what appear to be the most important 

variables. As a rule of thumb 40 include variables that have a p-value under .25 rather relying on a 

stricter criteria. At this stage of the game, building a good starting model is of primary importance. 

• Use standard variable selection methods, such as stepwise selection (<strong>for</strong>ward, backward, combined) 

or all subset regression to investigate potential models. These mechanical methods are not to be used 

as a substitute <strong>for</strong> thinking! Remember that highly collinear variables can mask the importance of 

each other. 

If categorical variables are to be included then some care must be used on how the various indicator 

variables are included. The reason <strong>for</strong> this is that the coding of the indicator variables is arbitrary and 

the selection of a particular indicator variable may be artifact of the coding used. One strategy is that 

all the indicator variables should be included or excluded as a set, rather than individually selecting 

separate indicator variables. As you will see in the example, JMP has four different rules that could 

be used. 

38 LD50=Lethal Dose 50th percentile – that dose which kills 50% of the subjects 

39 However, refer to Hosmer and Lemeshow (2000) <strong>for</strong> details on alternate approaches. 

40 Hosmer and Lemeshow (2000), p. 95 



• Once main effects have be identified, look at quadratic, interaction, and crossproduct terms. 

• Verify the final model. Look <strong>for</strong> collinearity, high leverage, etc. Check if the response to the selected 

variables are linear on the logistic scale. For example, break a continuous variable into 4 classes, and 

refit the same model with these discretized classes. The estimates of the effects <strong>for</strong> each class should 

then follow an approximate linear pattern. 

• Cross validate the model so that artifacts of that particular dataset are not highlighted. 

7.10.2 Example: Predicting credit worthiness 

In credit business, banks are interested in in<strong>for</strong>mation whether prospective consumers will pay back their 

credit or not. The aim of credit-scoring is to model or predict the probability that a consumer with certain 

covariates is to be considered as a potential risk. 

If you visit http://www.stat.uni-muenchen.de/service/datenarchiv/welcome_e. 

html you will find a dataset consisting of 1000 consumer credits from a German bank. For each consumer 

the binary response variable “creditability” is available. In addition, 20 covariates that are assumed to influence 

creditability were recorded. The dataset is available in the creditcheck.jmp datafile from the Sample 


The variable descriptions are available at http://www.stat.uni-muenchen.de/service/ 

datenarchiv/kredit/kreditvar_e.html and in the Sample Program Library. 

I will assume that the initial steps in variable selection have been done such as scatter-plots, looking <strong>for</strong> 

outliers etc. 

This dataset has a mixture of continuous variables (such as length of time an account has been paid in 

full), nominal scaled variables (such as sex, or the purpose of the credit request), and ordinal scaled variables 

(such as length of employment). Some of the ordinal variables may even be close enough to interval or ratio 

scaled to be usable as a continuous variables (such as length of employment). Both approaches should be 

tried, particularly if the estimates <strong>for</strong> the individual categories appear to be increasing in a linear fashion. 

The Analyze->Fit Model plat<strong>for</strong>m was used to specify the response variable, the potential covariates, 

and that a variable selection method will be used: 



This brings up the standard dialogue box <strong>for</strong> step-wise and other variable selection methods. 



In the stepwise paradigm, the usual <strong>for</strong>ward, backwards, and mixed (i.e. <strong>for</strong>ward followed by a backwards 

step at each iteration): 

In cases where variables are nominally or ordinally scales (and discrete), JMP provides a number of way 

to include/exclude the individual indicator variables: 

For example, consider the variable Repayment that had levels 0 to 4 corresponding from 0=repayment problems 

in the past, to 4=completely satisfactory repayment of past credit. JMP will create 4 indicator variables 

to represent these 5 categories. These indicator variables are derived in a hierarchical fashion: 



The first indicator variable, splits the classes in such a way to maximize the difference between the proportion 

of credit worthiness between the two parts of the split. This corresponds to grouping levels 0 and 1 vs. levels 

2, 3, and 4. The next indicator variables then split the splits, again, if possible, to maximize the difference 

in the credit worthiness between the two parts of the split. [If the split is of a pair of variables, there is 

no choice in the split.] This corresponds to splitting the 0&1 categories into another indicator variable that 

distinguishes category 0 from 1. The 2&3&4 class is split into two sub-splits corresponding to categories 

2&3 vs. category 4. Finally, the 2&3 class is split into an indicator variable differentiating categories 2 and 

3. 

Now the rules <strong>for</strong> entering effects correspond to : 

• Combined When terms enter the model, they are combined with all higher terms in the hierarchy and 

tested as a group to enter or leave. 

• Restrict Terms cannot be entered into the model unless terms higher in the hierarchy are already 

entered. Hence the indicator variable that distinguishes categories 0 and 1 in the repayment variable 

cannot enter be<strong>for</strong>e the indicator variable that contrasts 0&1 and 2&3&4. 

• No Rules Each indicator variable is free to enter or leave the model regardless of the presence or 

absence of other variables in the set. 

• Whole Effects All indicator variable in a set must enter or leave together as a set. 

The Combined or Whole Effects are the two most common choices. 

This plat<strong>for</strong>m also supports all possible subset regressions: 



This should be used cautiously with a large number of variables. 

Because it is computationally difficult to fit thousands of models using maximum likelihood methods 

<strong>for</strong> each of the potential new variables that enter the model, a computationally simpler (but asymptotically 

equivalent) test procedure (called the Wald or score test) is used in the table of variables to enter or leave. In 

a <strong>for</strong>ward selection, the variable with the smallest p-value or the largest Wald test-statistic is chosen: 

Once this variable is chosen, the current model is refit using maximum likelihood, so the report in the Step 

History may show a slightly different test statistics (the L-R ChiSquare) than the score statistic and the 

p-value may be different. 

The stepwise selection continues. 

In a few steps, the next variable to enter is the indicator variable that distinguishes categories 2&3 and 

4. Because of the restriction on entering terms, if this indicator variable is entered, the first cut must also be 

entered. Hence, this step actually enters 2 variables and the number of predictors jumps from 3 to 5: 



In a few more steps, some of the credit purpose variables are entered, again as a pair. 

The stepwise continues <strong>for</strong> a total of 18 steps. 

As be<strong>for</strong>e, once you have identified a candidate model, it must be fit and examined in more detail. Use 

the Make Model button to fit the final model. Note that JMP must add new columns to the data tables 

corresponding to the indicator variables created during the stepwise report. These can be confusing to the 

novice, but just keep in mind that any set of indicator variables is somewhat arbitrary. 



The model fit then has separate variables used <strong>for</strong> each indicator variable created: 



The log-odds of NOT repaying the loan is computed (see the bottom of the estimates table). Do the coefficient 

make sense? 

Can some variables be dropped? 

Pay attention to how the indicator variables have been split. For example, do you understand what terms 

are used if the borrower intends to use the credit to do repairs (CreditPurpose value =6)? 

Models that are similar to this one should also be explored. 

Again, just like in the case of ordinary regression, model validation using other data sets or hold-out 

samples should be explored. 

7.11 Model comparison using AIC 

Sorry, to be added later. 



7.12 Final Words 

7.12.1 Two common problems 

Two common problems can be encountered with logistic regression. 

Zero counts 

As noted earlier, zero counts <strong>for</strong> one category of a nominal or ordinal predictor (X) variable are problematic 

as the log-odds of that category then approach ±∞ which is somewhat difficult to model. 

One simplistic approach is that similar to the computation of the empirical logistic estimate – add 1/2n 

to each cell so that the counts are no longer-integers, but most packages will deal with non-integer counts 

without problems. 

If the zero counts arise from spreading the data over too many cells, perhaps some pooling of adjacent 

cells is warranted. If the data are sufficiently dense that pooling is not needed, perhaps this level of the 

variable can be dropped. 

Complete separation 

Ironically, this is a problem because the logistic models are per<strong>for</strong>ming too well! We saw an example of this 

earlier, when the fitness data could predict perfectly the sex of the subject. 

This is a problem, because not the predicted log-odds <strong>for</strong> the groups must again be ±∞. This can only 

happen if some of the estimated coefficients are also infinite which is difficult to deal with numerically. 

Theoretical considerations show that in the case of complete separation, maximum likelihood estimates do 

not exist! 

Sometimes this complete separation is an artifact of too many variables and not enough observations. 

Furthermore, it is not so much a problem of total observations, but also the division of observations between 

the two binary outcomes. If you have 1000 observations, but only 1 “success”, then any model with more 

than a few variables will be 100% efficient in capturing the single success – however, it is almost certain to 

be an artifact of the particular dataset. 



7.12.2 Extensions 

Choice of link function 

The logit link function is the most common choice <strong>for</strong> the link function between the probability of an 

outcome and the scale on which the predictors operate in a linear fashion. 

However, other link functions have been used in different situations. For example, a log-link (log(p)), 

the log-log link (log(−log(p))), the complementary log-link (log(− log(1 − p))), the probit function (the 

inverse normal distribution), the identity link (p) have all been proposed <strong>for</strong> various special cases. Please 

consult a statistician <strong>for</strong> details. 

More than two response categories 

Logistic regression traditionally has two response categories that are classified as “success” or “failure”. 

It is possible to extend this modelling framework to cases where the response variable has more than two 

categories. 

This is known as multinomial logistic regression, discrete choice, polychotomous logistic or polytomous 

logistic model, depending upon your field of expertise. 

There is a difference in the analysis if the responses can be ordered (i.e. the response variable takes an 

ordinal scale), or remain unordered (i.e. the response variable takes an nominal scale). 

The basic idea is to compute a logistic regression of each category against a reference category. So a 

response variable with three categories is translated into two logistic regressions where, <strong>for</strong> example, the 

first regression is category 1 vs. category 0 and the second regression is category 2 vs. category 0. These 

can be used to derive the results of category 2 vs. category 1. What is of particular interest is the role of the 

predictor variables in each of the possible comparison, e.g. does weight have the same effect upon mortality 

<strong>for</strong> three different disease outcomes. 

Consult one of the many book on logistic regression <strong>for</strong> details. 

Exact logistic regression with very small datasets 

The methods presented in this chapter rely upon maximum likelihood methods and asympototic arguments. 

In very small datasets, these large sample approximations may not per<strong>for</strong>m well. 

There are several statistical packages which per<strong>for</strong>m exact logistic regression and do not rely upon 

asymptotic arguments. 

A simple search of the web brings up several such packages. 



More complex experimental designs 

The results of this chapter have all assumed that the sampling design was a simple random sample or that 

the experiment design was a completely randomized design. 

Logistic regression can be extended to many more complex designs. 

In matched pair designs, each “success” in the outcome is matched with a randomly chosen “failure” 

along as many covariates as possible. For example, lung cancer patients could be matched with healthy 

patients with common age, weight, occupation and other covariates. These designs are very common in 

health studies. There are many good books on the analysis of such design. 

Clustered designs are also very common where groups of subjects all receive a common treatment. For 

example, classrooms may be randomly assigned to different reading programs, and the success or failure 

of individual students within the classrooms in obtaining reading goals is assessed. Here the experimental 

unit is the classroom, not the individual student and the methods of this chapter are not directly applicable. 

Several extensions have been proposed <strong>for</strong> this type of “correlated” binary data (students within the same 

classroom are all exposed to exactly the same set of experimenal and non-experimental factors). The most 

common is known as Generalized Estimating Equations and is described in many books. 

More complex experimental designs (e.g. split-plot designs) can also be run with binary outcomes. These 

complex designs require high power computational machinery to analyze. 

7.12.3 Yet to do 

- examples - dov’s example used in a comprehensive exam in previous years 


Chapter 8 

Poisson Regression 


In past chapters, multiple-regression methods were used to predict a continuous Y variable given a set of 

predictors, and logistic regression methods were used to predict a dichotomous categorical variable given a 

set of predictors. 

In this chapter, we will explore the use of Poisson-regression methods that are typically used to predict 

counts of (rare) events given a set of predictors. 

Just as multiple-regression implicitly assumed that the Y variable had a normal distribution and logisticregression 

assumed that the choice of categories in Y was based on binomial distribution, Poisson regression 

assumes that the observed counts are generated from a Poisson distribution. 

The Poisson distribution is often used to model count data when the events being counted are somewhat 

rare, e.g. cancer cases, the number of accidents, the number of satellite males around a female bird, etc. It is 

characterized by the expected number of events to occur µ with probability mass functions: 

P (Y = y|µ) = e−µ µ y 

where y! = y(y − 1)(y − 2) . . . 2(1), and y ≥ 0. The probability mass function is available in tabular <strong>for</strong>m, 

or can be computed by many statistical packages. While the values of Y are restricted to being non-negative 

integers, it is not necessary <strong>for</strong> µ to be an integer. 

In the following graph, 1000 observations were each generated from a Poisson distribution with differing 

means. 

y! 

635

CHAPTER 8. POISSON REGRESSION 



For very small values of µ, virtually all the counts are zero, with only a few counts that are positive. As µ 

increases, the shape of the distribution look more and more like a normal distribution – indeed <strong>for</strong> large µ, a 

normal distribution can be used as an approximation to the distribution of Y . 

Sometimes µ is further parameterized by a rate parameter and a group size, i.e. µ = Nλ where λ is the 

rate per unit, and N is the group size. For example, the number of cancers in a group of 100,000 people 

could be modeled using λ as the rate per 1000 people, and the N = 100. 

Two important properties of the Poisson distribution are: 

E[Y ] = µ 

V [y] = µ 

Unlike the normal distribution which has a separate parameter <strong>for</strong> the mean and variance, the Poisson distribution 

variance is equal to the mean. This means that once you estimate the mean, you have also estimated 

the variance and so it is not necessary to have replicate counts to estimate the sample variance from data. As 

will be seen later, this can be quite limiting when <strong>for</strong> many population, the data are over-dispersed, i.e. the 

variance is greater than you would expect from a simple Poisson distribution. 

Another important property is that the Poisson distribution is additive. If Y 1 is a Poisson(µ 1 ), and Y 2 is 

a Poisson(µ 2 ), then Y = Y 1 + Y 2 is also Poisson(µ = µ 1 + µ 2 ). 

Lastly, the Poisson distribution is a limiting distribution of a Binomial distribution as n becomes large 

and p becomes very small. 

Poisson regression is another example of a Generalized Linear Model (GLIM) 1 . As in all GLIM’s, the 

modeling process is a three step affair: 

Y i is assumed P oisson(µ i ) 

φ i = log(µ i ) 

φ i = β 0 + β 1 X i1 + β 2 X i2 + . . . 

Here the link function is the natural logarithms log. In many cases, the mean changes in a multiplicative 

fashion. For example, if population size doubled, then the expected number of cancer cases should also 

double. As population age, the rate of cancer increases linearly on a log-scale. Additionally, by modeling 

the log(µ i ), it is impossible to get negative estimates of the mean. 

The linear part of the GLIM can consist of continuous X or categorical X or mixtures of both types 

of predictors. Categorical variables will be converted to indicator variables in exactly the same way as in 

multiple- and logistic-regression. 

Unlike multiple-regression, there are no closed <strong>for</strong>m solutions to give estimates of parameters. Standard 

maximum likelihood estimation (MLE) methods are used. 2 MLEs are guaranteed to be the “best” estimators 

1 Logistic regression is another GLIM. 

2 A discussion of the theory of MLE is beyond the scope of this <strong>course</strong>, but is covered in Stat-330 and Stat-402. 



(smallest standard errors) as the sample size increases, and seem to work well even if the sample sizes are 

not large. Standard methods are used to estimate the standard errors of the estimates. Model comparisons 

are done using likelihood-ratio tests whose test statistics follow a chi-square distribution which is used to 

give a p-value which is interpreted in the standard fashion. Predictions are done in the usual fashion – these 

initially appear on the log-scale and must be anti-logged to provide estimates on the ordinary scale. 

8.2 Experimental design 

In this chapter, we will again assume that the data are collected under a completely randomized design. In 

some of the examples that follow, blocked designs will be analyzed, but we will not explore how to analyze 

split-plot or repeated measure designs or design with pseudo-replication. 

The analysis of such designs in a generalized linear models framework is possible – please consult with 

a statistician if you have a complex experimental design. 

8.3 Data structure 

The data structure is straight<strong>for</strong>ward. Columns represent variables and rows represent observations. The 

response variable, Y , will be a count of the number of events and will be set to continuous scale. The 

predictor variables, X, can be either continuous or categorical – in the later case, indicator variables will be 

created. 

As usual, the coding that a package uses <strong>for</strong> indicator variables is important if you want to interpret 

directly the estimates of the effect of the indicator variable. Consult the documentation <strong>for</strong> the package <strong>for</strong> 

details. 

8.4 Single continuous X variable 

The JMP file salamanders-burn.jmp available in the Sample Program Library at http://www.stat. 

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms contains data on the number of salamanders 

in a fixed size quadrat at various locations in a large <strong>for</strong>est. The location of quadrats were chosen to represent 

a range of years since a <strong>for</strong>est fire burned the understory. 

A simple plot of the data: 



shows an increasing relationship between the number of salamanders and the time since the <strong>for</strong>est understory 

burned. 

Why can’t a simple regression analysis using standard normal theory be used to fit the curve? 

First, the assumption of normality is suspect. The counts of the number of salamanders are discrete with 

most under 10. It is impossible to get a negative number of salamanders so the bottom left part of the graph 

would require the normal distribution to be truncated at Y = 0. 

Second, it appears that the variance of the counts at any particular age increases with age since burned. 

This violates the assumption of equal variance <strong>for</strong> all X values made <strong>for</strong> standard regression models. 

Third, the fitted line from ordinary regression could go negative. It is impossible to have a negative 

number of salamanders. 

It seems reasonable that a Poisson distribution could be used to model the number of salamanders. They 

are relatively rare and seem to <strong>for</strong>age independently of each other. This conditions are the underpinnings of 

a Poisson distribution. 

The process of fitting the model and interpreting the output is analogous to those used in logistic regression. 



The basic model is then: 

Y i ∼P oisson(µ i ) 

θ i =log(µ i ) 

θ i =β 0 + β 1 Years i 

As in the logistic model, the distribution of the data about the mean (line 1) has a link function (line 

2) between the mean <strong>for</strong> each Y and the linear structural part of the model (line 3). In logistic regression, 

the logit link was used to ensure that all values of p were between 0 and 1. In Poisson regression, the log 

(natural logarithm) is traditionally used to ensure that the mean is always positive. 

The model must be fit using maximum likelihood methods, just like in logistic regression. 

This model is fit in JMP using the Analyze->Fit Model plat<strong>for</strong>m: 

Be sure to specify the proper distribution and link function. 




Most of the output parallels that seen in logistic regression. At the top of the output is a summary of variable 

being analyzed, the distribution <strong>for</strong> the raw data, the link used, and the total number of observation (rows in 

the dataset). 

The Whole Model Test is analogous to that in multiple-regression - is there evidence that the set of 

predictors (in this case there is only one predictor) have any predictive ability over that seen by random 

chance. The test statistic is computed using a likelihood-ratio test comparing this model to a model with 

only the intercept. The p-value is very small, indicating that the model has some predictive ability. [Because 

there is only 1 predictor, this test is equivalent to the Effect Test discussed below.] 

The goodness-of-fit statistic compares the model with the intercept and the single predictor to a model 

where every observation is predicted individually. If the model fits well, the chi-square test statistic should 

be approximately equal to the degrees of freedom, and the p-value should be LARGE, i.e. much larger than 

.05. 3 There is no evidence of a problem in the fit. Later in this section, we will examine how to adjust <strong>for</strong> 

slight lack of fit. 

The Effect tests examine if each predictor (or in the case of a categorical variable, the entire set of 

indicator variables) makes a statistically significant marginal contribution to the fit. As in multiple-regression 

model, this are MARGINAL contributions, i.e. assuming that all other variables remain in the model and 

fixed at their current value. There is only one predictor, and there is strong evidence against the hypothesis 

of no marginal contribution. 

3 Remember, that in goodness-of-fit tests, you DON’T want to find evidence against the null hypothesis. 



Finally, the Parameter Estimates section reports the estimated β’s. So our fitted model is: 

Y i ∼P oisson(µ i ) 

θ i =log(µ i ) 

θ i =0.59 + .045Years i 

Each line also tests if the corresponding population coefficient is zero. Because each of the X variables 

in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the 

effect tests. 

We can obtain predictions by following the drop down menu: 



For example, consider the first row of the data. At 12 years since the last burn, we estimate the mean response 

by starting at the bottom of the model and working upwards: 

which is the predicted value in the table. 

θ 1 =0.59 + .045(12) = 1.12 

µ 1 =exp(1.12) = 3.08 

As in ordinary normal-theory regression, confidence limits <strong>for</strong> the mean response and <strong>for</strong> individual 

response may be found. The above table shows the confidence interval <strong>for</strong> the mean response. 

Finally, a residual plot may also be constructed: 

There is no evidence of a lack-of-fit. 



8.5 Single continuous X variable - dealing with overdispersion 

One of the weaknesses of Poisson regression is the very restrictive assumption that the variance of a Poisson 

distribution is equal to its mean. In some cases, data are over-dispersed, i.e. the variance is greater than 

predicted by a simple Poisson distribution. In this section, we will illustrate how to detect overdispersion 

and how to adjust the analysis to account <strong>for</strong> overdispersion. 

In the section on Logistic Regression, a dataset was examined on nesting horseshoe crabs 4 that is analyzed 

in Agresti’s book. 5 

The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs, 

Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely 

randomized design or a simple random sampling. As in regression models, you do have some flexibility in 

the choice of the X settings, but <strong>for</strong> a particular weight and color, the data must be selected at random from 

that relevant population. 

Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting 

whether the female had any other males, called satellites residing nearby. These other factors includes: 

• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark. 

• spine condition where 1=both good, 2=one worn or broken, or 3=both worn or broken. 

• weight 

• carapace width 

In the section on Logistic Regression, a derived variable on the presence or absence of satellite males 

was examined. In this section, we will examine the actual number of satellite males. 

A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www. 

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A portion of the datafile is shown 

below: 

4 See http://en.wikipedia.org/wiki/Horseshoe_crab. 

5 These are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html. 



Note that the color and spine condition variables should be declared with an ordinal scale despite having 

numerical codes. In this analysis we will use the actual number of satellite males. 

As noted on the section on Logistic Regression, a preliminary scatter plot of the variables shows some 




There is a very high positive relationship between carapace width and weight, but there are few anomalous 

crabs that should be investigated further as shown in this magnified plot: 



There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights 

should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single crab whose 

weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally, there is one 

crab which is extremely large compared to the rest of the group. In the analysis that follows, I’ve excluded 

these five data values. 

To begin with, fit a model that attempts to predict the mean number of satellite crabs as a function of the 

weight of the female crab, i.e. 

Y i distributed P oisson(µ i ) 

λ i = log(µ i ) 

µ i = β 0 + β 1 W eight i 

The Generalized Linear Model plat<strong>for</strong>m of JMP is used: 





This gives selected output: 





There are two parts of the output which show that the fit is not very satisfactory. First while the studentized 

residual plot does not show any structural defects (the residuals are scattered around zero) 6 , it does 

show substantial numbers of points outside of the (−2, 2) range. This suggests that the data are too variable 

relative to the Passion assumption. Second, the Goodness-of-fit statistic has a vary small p-value indicating 

that the data are not well fit by the model. 

This is an example of overdispersion. To see this overdispersion, divide the weight classes into categories, 

e.g. 0000 − 2500 g, 2500 − 3000 g., etc. [This has already been done in the dataset.] 7 Now find 

the mean and variance of the number of satellite males <strong>for</strong> each weight class using the Tables->Summary 


6 The “lines” in the plot are artifacts of the discrete nature of the response. See the chapter on residual plots <strong>for</strong> more details. 

7 The choice of 4 weight classes is somewhat arbitrary. I would usually try and subdivide the data into between 4 and 10 classes 

ensuring that at least 20-30 observations are in each class. 



If the Poisson assumption were true, then the variance of the number of satellite males should be roughly 

equal to the mean in each class. In fact, the variance in the number of satellite males appears to be roughly 

3× that of the mean. 

With generalized linear models, there are two ways to adjust <strong>for</strong> over-dispersion. 



A different distribution can be used that is more flexible in the mean-to-variance ratio. A common 

distribution that is used in these cases is the negative binomial distribution. In more advanced classes, you 

will learn that the negative binomial distribution can arise from a Poisson distribution with extra variation in 

the mean rates. JMP does not allow the fitting of a negative binomial distribution, but this option is available 

in SAS. 

An “ad hoc” method, that nevertheless has theoretical justification, is to allow some flexibility in the 

variance. For example, rather than restricting V [Y ] = E[Y ] = µ, perhaps, V [y] = cµ where c is called the 

over-dispersion factor. Note that if this <strong>for</strong>mulation is used, the data are no longer distributed as a Poisson 

distribution; in fact, there is NO actual probability function that has this property. Nevertheless, this quasidistribution 

still has nice properties and the over-dispersion factor can be estimated using quasi-likelihood 

methods that are analogous to regular likelihood methods. 

The end result is that the over-dispersion factor is used to adjust the se and the test-statistics. The adjusted 

se are obtained by multiplying the se from the Poisson model by √ ĉ. The adjusted chi-square test statistics 

are found by dividing the test statistic from the poisson model by ĉ, and p-value is adjusted by looking up 

the adjusted test-statistic in the appropriate table. 

How is the over-dispersion factor c estimated? There are two methods, both of which are asymptotically 

equivalent. These involve taking the goodness-of-fit statistic and dividing by their degrees of freedom: 

ĉ = goodness-of-fit-statistic 

df 

Usually, ĉ’s of less than 10 (corresponding to a potential inflation in the se by a factor of about 3) are 

acceptable – if the inflation factor is more than about 10, the lack-of-fit is so large that alternate methods 

should be used. 

In JMP, the adjustment of over-dispersion occurs in the Analyze->Fit Model dialogue box: 



The revised output is now: 



Notice that the overdispersion factor has been estimated as 

ĉ = 

chi − square 

df 

= 519.7857 

166 

= 3.13 

This is very close to the “guess” that we made based on looking at the variance-to-mean ratio among weight 

classes. 

The estimated intercept and slope are unchanged and their interpretation is as be<strong>for</strong>e. For example, the 



estimated slope of .000668 is the estimated increase in the log number of male satellite crabs when the female 

crab’s weight increases by 1 g. A 1000g increase in body-weight corresponds to a 1000 × .000668 = .668 

increase in the log(number of satellite males) which corresponds to an increase by a factor of e .668 = 1.95, 

i.e. the mean number of male satellite crabs almost doubles. The estimated se has been “inflated” by √ ĉ = 

√ 

3.13 = 1.77. The confidence intervals <strong>for</strong> the slope and intercept are now wider. 

The chi-square test statistics have been “deflated” by ĉ and the p-values have been adjusted accordingly. 

Finally, the residual plot has been rescaled by the factor of √ ĉ and now most residuals lie between −2 

and 2. Note that the pattern of the residual plot doesn’t change; all that the over-dispersion adjustment does 

is to change the residual variance so that the standardization brings them closer to 0. 

Predictions of the mean response at levels of X are obtained in the usual fashion: 

giving (partial output): 



The se of the predicted mean will also have been be adjusted <strong>for</strong> overdispersion as will have the confidence 

intervals <strong>for</strong> the mean number of male satellite crabs at each weight value. 

However, notice that the menu item <strong>for</strong> a prediction interval <strong>for</strong> the INDIVIDUAL response is “grayed 

out” and it is now impossible to obtain prediction intervals <strong>for</strong> the ACTUAl number of events. By using the 

overdispersion factor, you are no longer assuming that the counts are distributed as a Poisson distribution – 

in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using 

the overdispersion factor. Without an actual distribution, it is impossible to make predictions <strong>for</strong> individual 

events. 

We save the predicted values to the dataset and do a plot of the final results on both the ordinary scale: 





and on the log-scale (the scale where the model is “linear”): 





8.6 Single Continuous X variable with an OFFSET 

In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g. 

the number of satellite males around a single female). In some cases, the sampling unit are of different sizes. 

For example, if the number of weeds are counted in a quadrat plot, then hopefully the size of the plot 

is constant. However, it is conceivable that the size of the plot varies because different people collected 

different parts of the data. Of if the number of events are counted in a time interval (e.g. the number of fish 

captured in a fishing trip), the time intervals could be of different size. 

Often these type of data are pre-standardized, i.e. converted to a per m 2 or per hour basis and then an 



analysis is attempted on this standardized variable. However, standardization destroys the poisson shape of 

the data and turns out to be unnecessary if the size of the sampling unit is also collected. 

The incidence of non melanoma skin cancer among women in the early 1970’s in Minneapolis-St Paul, 

Minnesota, and Dallas-Fort Worth, Texas is summarized below: 

City Age Class Age Mid Count Pop Size 

msp 15-24 20 1 172,675 

msp 25-34 30 16 123,065 

msp 35-44 40 30 96,216 

msp 45-54 50 71 92,051 

msp 55-64 60 102 72,159 

msp 65-74 70 130 54,722 

msp 75-84 80 133 32,185 

msp 85+ 90 40 8,328 

dfw 15-24 20 4 181,343 

dfw 25-34 30 38 146,207 

dfw 35-44 40 119 121,374 

dfw 45-54 50 221 111,353 

dfw 55-64 60 259 83,004 

dfw 65-74 70 310 55,932 

dfw 75-84 80 226 29,007 

dfw 85+ 90 65 7,538 

We will first examine the relationship of cancer incidence to age by using the age midpoint as our 

continuous X variable and only using the Minneapolis data (<strong>for</strong> now). 

The data set is available in the JMP data file skincancer.jmp from the Sample Program Library at http: 


Is there a relationship between the age of a cohort and the cancer incidence rate? Notice that a comparison 

of the raw counts is not very sensible because of the different size of the age cohorts. Most people 

would first STANDARDIZE the incidence rate, e.g. find the incidence per person by dividing the number of 

cancers by the number of people in each cohort: 



A plot of the standardized incidence rate by the mid-age of each cohort: 

shows a curved relationship between the incidence rate and the mid-point of the age-cohort. This suggests a 



theoretical model of the <strong>for</strong>m: 

Incidence = Ce age 

i.e. an exponential increase in the cancer rates with age. 

This suggests that a log-trans<strong>for</strong>m is applied to BOTH sides, but a plot: of the logarithm of the incidence 

rate against log(age midpoint): 

is still not linear with a dip <strong>for</strong> the youhgest cohorts. There appears to be a strong relationship between the 

log(cancer rate) and log(age) that may not be linear, but a quadratic looks as if it could fit quite nicely, i.e. 

a model of the <strong>for</strong>m 

. 

log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual 



Is it possible to include the population size direct? Expand the above model: 

log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual 

log( count 

pop size ) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual 

log(count) − log(pop size) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual 

log(count) = log(pop size) + β 0 + β 1 log(age) + β 2 log(age) 2 + residual 

Notice that the log(pop size) has a known coefficient of 1 associated with it, i.e. there is NO β coefficient 

associated with log(pop size). 

Also notice that log(P OP SiZE) is known in advance and is NOT a parameter to be estimated. Variables 

such as population size are often called offset variables and notice that most packages expect to see the offset 

variable pre-trans<strong>for</strong>med depending upon the link function used. In this case, the log link was used, so the 

offset is log(P OP SIZE age ) as you will see in a minute. 

Our GLIM model will then be: 

Y age distributed P oisson(µ age ) 

φ age = log(µ age ) = log(P OP SIZE age ) + log(λ age ) 

φ age = β 0 + β 1 log(AGE) + β 2 log(AGE) 2 

This can be rewritten slightly as: 

Y age distributed P oisson(µ age ) 

φ age = log(µ age ) = log(P OP SIZE age ) + log(λ age ) 

φ age = log(P OP SIZE age ) + log(λ age ) = β 0 + β 1 log(AGE) + β 2 log(AGE) 2 

or 

log(λ age ) = β 0 + β 1 log(AGE) + β 2 log(AGE) 2 − log(P OP SIZE age ) 

So the modeling can be done in terms of estimating the effect of log(age) upon the incidence rate, rather 

than the raw counts, as long as the offset variable (log(P OP SIZE age )) is known. 

To per<strong>for</strong>m a Poisson regression, first create the offset variable (log(P OP SIZE age )) using the <strong>for</strong>mula 

editor of JMP. 

The Analyze->Fit Model plat<strong>for</strong>m launches the analysis: 



Note that the raw count is the Y variable, and that the offset variable is specified separately from the X 

variables. 

The output is: 



The goodness-of-fit statistic indictes no evidence of lack-of-fit, i.e. no need to adjust <strong>for</strong> over-dispersion. 

Based on the results of the Effect Test <strong>for</strong> the quadratic term, it appears that a linear fit may actually be 

sufficient as the p-value <strong>for</strong> the quadratic term is almost 10%.. The reason <strong>for</strong> this apparent non-need <strong>for</strong> the 

quadratic term is that the smaller age-cohorts have very few counts and so the actual incidence rate is very 

imprecisely estimated. 

Finally, the Parameter Estimates section reports the estimated β’s (remember these are on the log-scale). 

Each line also tests if the corresponding population coefficient is zero. Because each of the X variables 

in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the 

effect tests. 

Based on the output so far, it appears that we can drop the quadratic term.’ This term was dropped, and 

the model refit: 



The final model is 

The predicted log(λ) <strong>for</strong> age 40 is found as: 

̂ log(λ) age 

= −21.32 + 3.60(log(age)) 

̂ log(λ) 40 

= −21.32 + 3.60(log(40)) = −8.04 

This incidence rate is on the log-scale, so the predicted incidence rate is found by taking the anti-logs, or 

e −8.04 = .000322 or .322/thousand people or 322/million people. 

In order to make predictions about the expected number of cancers in each age cohort, that would be 

seen under this model, you would need to add back the log(P OP SIZE) <strong>for</strong> the appropriate age class: 

log(µ ̂ 40 ) = log(λ) ̂ 

40 

+ log(P OP SIZE 40 ) = −8.04 + 11.47 = 3.42 

Finally, the predicted number of cases is simply the anti-log of this value: 

Ŷ 40 = e ̂ logµ 40 

= e 3.42 = 30.96 

Of <strong>course</strong>, this can be done automatically the the plat<strong>for</strong>m by requesting: 



This also allows you save the confidence limits <strong>for</strong> the average number (the mean confidence bounds) of skin 

cancers expected <strong>for</strong> this age class (assuming the same population size) and confidence limits (the individual 

confidence bounds) 

In this case, the expected number of skin cancer cases <strong>for</strong> the 35-44 age group is 30.69 with a 95% confidence 

interval <strong>for</strong> the mean number of cases ranging from (26.0 → 36.8). The confidence bound <strong>for</strong> the actual 

number of cases (assuming the model is correct) is somewhere between 19 and 43 cases. 

By adding new data lines to the data table (be<strong>for</strong>e the model fit) with the Y variable missing, but the age 

and offset variable present, you can make <strong>for</strong>ecasts <strong>for</strong> any set of new X values. 




isn’t too bad – the large negative residual <strong>for</strong> the first age class (near when 0 skin cancers are predicted) is a 

bit worrisome, I suspect this is where the quadratic curve may provide a better fit. 

A plot of actual vs. predicted values can be obtained directly: 



or by saving the predicted value to the data sheet, and using the Analyze->Fit Y-by-X plat<strong>for</strong>m with Fit 

Special to add the reference line: 



These plot show excellent agreement with data. 

Finally, it is nice to construct an overlay plot the empirical log(rates) (the first plot constructed) with the 

estimated log(rate) and confidence bounds as a function of log(age). Create the predicted log(rate) using 

the <strong>for</strong>mula editor and the predicted skin cancer numbers by subtracting the log(P OP SIZE) (why?): 



Repeat the same <strong>for</strong>mula <strong>for</strong> the lower and upper bounds of the 95% confidence interval <strong>for</strong> the mean number 

of cases: 

Finally, use the Graph → OverlayPlot to plot the empirical estimates, the predicted values of λ and the 95% 

confidence interval <strong>for</strong> λ on the same plot: 



and fiddle 8 with the plot to join up predictions and confidence bounds but leave the actual empirical points 

as is to give the final plot: 

8 I had to use the turn on the connect through missing option the red-triangle. 



Remember that the point with the smallest log(rate) is based on a single skin cancer case and not very 

reliable. That is why the quadratic fit was likely not selected. 

8.7 ANCOVA models 

Just like in regular multiple-regression, it is possible to mix continuous and categorical variables and test <strong>for</strong> 

parallelism of the effects. Of <strong>course</strong> this parallelism is assessed on the link scale (in most cases <strong>for</strong> Poisson 

data, on the log scale). 

There is nothing new compared to what was seen with ordinary regression and logistic regression. The 

three appropriate models are: 

log(λ) = X 

log(λ) = X Cat 

log(λ) = X Cat X ∗ Cat 

where X is the continuous predictors, and Cat is the categorical predictor. The first model assumes a 

common line <strong>for</strong> all categories of the Cat variable. The second model assumes parallel slopes, but differing 

intercepts. The third model assumes separate lines <strong>for</strong> each category. 

Fitting would start with the most complex model (the third model) and test if there is evidence of nonparallelism. 

If none were found, the second model would be examined, and a test would be made <strong>for</strong> 

common intercepts. Finally, the simplest model may be an adequate fit. 

Let us return to the skin cancer data examined earlier in this chapter. It is of interest to see if there is a 

consistent difference in skin cancer rates between the two cities. Presumably, Dallas, which receives more 

intense sun, would have a higher skin cancer rate. 

The data is available in the skincancer.jmp data set in the Sample Program Library at http://www. 

stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Use all of the data. As be<strong>for</strong>e, the 

log(population size) will be the offset variable. 

A preliminary data plot of the empirical cancer rate <strong>for</strong> the two cities: 



shows roughly parallel responses, but now the curvature is much more pronounced in Dallas. 

Perhaps a quadratic model should be first fit, with separate response curve <strong>for</strong> both cities. In <strong>short</strong> hand 

model notation, this is: 

log(lambda) = City log(Age) log(Age) 2 City ∗ log(Age) City ∗ log(Age) 2 

where City is the effect of the two cities, log(age) is the continuous X variable, and the interaction terms 

represent the non-parallelism of the responses. This is specified as: 



As be<strong>for</strong>e, use the Generalized Linear Model option of the Analyze->Fit Model plat<strong>for</strong>m and don’t <strong>for</strong>get to 

specify the log(popsize) as the offset variable. This gives the output: 



The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test 

shows that this model is a reasonable fit (p-values around .30). The Effect Test shows that perhaps both of 

the interaction terms can be dropped, but some care must be taken as these are marginal tests and cannot be 

simply combined. 

A “Chunk Test” similar to that seen in logistic regression can be done to see if both interaction terms can 

be dropped simultaneously: 





The p-value is just above α = .05 so I would be a little hesitant to drop both interaction terms. On the other 

hand, some of the larger age classes have such large sample sizes and large count values that very minor 

differences in fit can likely be detected. 

The simpler model with two parallel quadratic curves was then fit: 





This simpler model also has no strong evidence of lack-of-fit. Now, however, the quadratic term cannot be 

dropped. 

The parameter estimates must be interpreted carefully <strong>for</strong> categorical data. Every package codes indicator 

variables in different ways, and so the interpretation of the estimates associated with the indicator 

variables differs among packages. JMP codes indicator variables so that estimates are the difference in 

response between that specified level and the AVERAGE of all other levels. So in this case, the estimate 

associated with City[dfw] ̂ = .401 represent 1/2 the distance between the two parallel curves. Consequently, 

the difference in logλ between Minneapolis and Dallas is 2 × .401 = .801 (SE 2 × .026 = .05). This is a 

consistent difference <strong>for</strong> all age groups. 

This can also be estimated without having to worry too much about the coding details by doing a contrast 

between the estimates <strong>for</strong> the city effects:: 





This gives the same results as above. 



This is a difference on the log-scale. As seen in earlier chapter, this can be converted to an estimate of 

the ratio of incidence by taking anti-logs. In this case, Dallas is estimated to have e .802 = 2.22 TIMES the 

skin cancer rate of Minneapolis. This is consistent with what is seen in the raw data. The SE of this ratio is 

found using an application of the Delta method 9 The delta-method indicates that the SE of an exponentiated 

estimate is found as 

SE(êθ) 

= SE(̂θ)êθ 

In this case 

SE(ratio) = .052 × 2.22 = .11 

Confidence bounds are found by finding the usual confidence bounds on the log-scale and then taking antilogs 

of the end points. In this case, the 95% confidence interval <strong>for</strong> the difference in log(λ) is (.802 − 

2(.052) → .802 + 2(.052)) or (.698 → .906). Taking antilogs, gives a 95% confidence interval <strong>for</strong> the ratio 

of skin cancer rates as (2.01 → 2.47). 

The residual plot (not shown) look reasonable. 

8.8 Categorical X variables - a designed experiment 

Just like ANOVA is used to analyze data from designed experiments, Generalized linear models can also be 

used to analyze count data from designed experiments. However, JMP is limited to designs without random 

effects, e.g. no GLIMs that involve split-plot designs. 

Consider an experiment to investigate 10 treatments (a control vs. a 3x3 factorial structure <strong>for</strong> two factors 

A and B) on controlling insect numbers. The experiment was run in a randomized block design (see earlier 

chapters). In each block, the 10 treatments were randomized to 10 different trees. On each tree, a trap was 

mounted, and the number of insects caught in each trap was recorded. 

Here is the raw data. 10 

9 A <strong>for</strong>m of a Taylor Series Expansion. Consult many books on statistics <strong>for</strong> details. 

10 This is example 10.4.1. from SAS <strong>for</strong> Linear Models, 4th Edition. Data extracted from http://ftp.sas.com/samples/ 

A56655 on 2006-07-19. 



Block Treatment A B Count 

1 1 1 1 6 

1 2 1 2 2 

1 5 2 2 3 

1 8 3 2 3 

1 7 3 1 1 

1 0 0 0 16 

1 3 1 3 4 

1 6 2 3 1 

1 9 3 3 1 

1 4 2 1 5 

2 1 1 1 9 

2 2 1 2 6 

2 5 2 2 4 

2 8 3 2 2 

2 7 3 1 2 

2 0 0 0 25 

2 3 1 3 3 

2 6 2 3 5 

2 9 3 3 0 

2 4 2 1 3 

3 1 1 1 2 

3 2 1 2 14 

3 5 2 2 6 

3 8 3 2 3 

3 7 3 1 2 

3 0 0 0 5 

3 3 1 3 5 

3 6 2 3 17 

3 9 3 3 2 

3 4 2 1 3 

4 1 1 1 22 

4 2 1 2 4 

4 5 2 2 3 

4 8 3 2 4 

4 7 3 1 3 

4 0 0 0 9 


4 3 1 3 5 

4 6 2 3 1 

4 9 3 3 9 

4 4 2 1 2


The data are available in JMP data file insectcount.jmp in the Sample Program Library at http:// 

www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. 

The RCB model was fit using a generalized linear model with a log link: 

Count i distributed P oisson(µ i ) 

φ i = log(µ i ) 

φ i = Block T reatment 

where the simplified syntax Block and Treatment refer to block and treatment effects. Both Blocks and 

Treatment are categorical, and will be translated to sets of indicator variables in the usual way. 

This model is fit in JMP using the Analyze->Fit Model plat<strong>for</strong>m: 

Note that the block and treatment variables must be nominally scaled. There is NO offset variable as the 

insect cages were all equal size. 



This produces the output: 

The Goodness-of-fit test shows strong evidence that the model doesn’t fit as the p-values are very small. 

Lack-of-fit can be caused by inadequacies of the actual model (perhaps a more complex model with block 

and treatment interactions is needed?), failure of the Poisson assumption, or using the wrong link-function, 




shows that the data is more variable than expected by a Poisson distribution (about 95% of the residual 

should be within ± 2). The base model and link function seem reasonable as there is no pattern to the 

residuals, merely an over-dispersion relative to a Poisson distribution. 

The adjustment of over-dispersion is made as seen earlier in the Analyze->Fit Model dialogue box: 



which gives the revised output: 



Note that the over-dispersion factor ĉ = 3.5. The test-statistic <strong>for</strong> the Effect Test are adjusted by this factor 

(compare the chi-square of 76.37 <strong>for</strong> the treatment effects in the absence of adjusting <strong>for</strong> over-dispersion 

with the chi-square of 21.79 after adjusting <strong>for</strong> over-dispersion), and the p-values have been adjusted as well. 

The residuals have been adjusted by √ ĉ and now look more acceptable: 



Note that the pattern of the residual plot doesn’t change; all that the over-dispersion adjustment does is to 

change the residual variance so that the standardization brings them closer to 0. 

If you compare the parameter estimates between the two models, you will find that the estimates are 

unchanged, but the reported se are increased by √ ĉ to account <strong>for</strong> over-dispersion. As the case with all 

categorical X variables, the interpretation of the estimates <strong>for</strong> the indicator variables depends upon the 

coding used by the package. JMP uses a coding where each indicator variable is compared to the mean 

response over all indicator variables. 

Predictions of the mean response at levels of X are obtained in the usual fashion. The se will also be 

adjusted <strong>for</strong> overdisperion. However, it is now impossible to obtain prediction intervals <strong>for</strong> the ACTUAl 

number of events. By using the overdispersion factor, you are no longer assuming that the counts are 

distributed as a Poisson distribution – in fact, there is NO REAL DISTRIBUTION that has the mean to 

variance ratio that implicitly assumed using the overdispersion factor. Without an actual distribution, it is 

impossible to make predictions <strong>for</strong> individual events. 

If comparisons are of interest among the treatment levels, it is better to use the built-in Contrast facilities 

of the package to compute the estimates and standard errors rather than trying to do this by hand. For 

example, suppose we are interested in comparing treatment 0 (the control), to the treatment with factor A at 

level 1 and factor B at level 1 (corresponding to treatment 1). The contrast is estimated as: 





The estimated difference in the log(mean) is -.34 (se .39) which corresponds to a ratio of e −.34 = .71 of 

treatment 1 to control, i.e. on, average, the number of insects in the treatment 1 traps are 71% of the number 

of insects in the control trap. An application of the delta-method shows that the se of the ratio is computed 

as se(êθ) = se(̂θ)êθ = .39(.71) = .28. However, there was no evidence of a difference in trap counts as the 

standard error was sufficiently large. A 95% confidence interval <strong>for</strong> the difference in log(mean) is found as 

−.34 ± 2(.39) which gives (−1.12 → .44). Because the p-value was larger than α = .05, this confidence 



interval includes zero. When this interval is anti-logged, the 95% confidence interval <strong>for</strong> the ratio of mean 

counts is (.32 → 1.55), i.e. the true ratio of treatment counts to control counts is between .32 and 1.55. 

Because the p-value was greater than α = .05, this interval contains the value of 1 (indicating that the ratio 

of counts was 1:1). It is also correct to compute the 95% confidence interval <strong>for</strong> the ratio using the estimated 

ratio ± its se@. This gives (.71 ± 2(.28)) or (.15 → 1.27). In large samples, these confidence intervals are 

equivalent. In smaller samples, there is no real objective way to choose between them. 

8.9 Log-linear models <strong>for</strong> multi-dimensional contingency tables 

In the chapter on logistic regression, k × 2 contingency tables were analyzed to see if the proportion of 

responses in the population that fell in the two categories (e.g. survived or died) were the same across the k 

levels of the factor (e.g. sex, or passenger class, or dose of a drug). 

The use of logisitic regression is a special case of the general r × c contingency table where observations 

are classified by r levels of a factor and c levels of a response. In a separate chapter, the use of χ 2 tests to 

test the hypothesis of equal population proportions in the c levels of the response across all levels of the the 

factor. This is also known as the test of independence of the response to levels of the factor. 

This can be generalized to the analysis of multi-dimensional tables using Poisson-regression. In more 

advanced <strong>course</strong>s, you can learn how the two previous cases are simple cases of this more general modelling 

approach. Consult Agresti’s book <strong>for</strong> a fuller account of this topic. 


To be added later 

8.11 Summary 

Poisson-regression is the standard tool <strong>for</strong> the analysis of “smallish” count data. If the counts are large (say 

in the orders of hundreds), you could likely use ordinary or weighted regression methods without difficulty. 

This chapter only concerns itself with data collected under a simple random sample or a completely 

randomized design. If the data are collected under other designs, please consult with a statistician <strong>for</strong> the 

proper analysis. 

A common problem that have encountered are data that have been prestandardized. For example, data 

may recorded on the number of tree stems in a 100 m 2 test plots. This data could likely be modeled using 

poisson regression. But, then the data are standardized to a “per hectare” basis. These standardized data 



are NO LONGER distributed as a Poisson distribution. It would be preferable to analyze the data using the 

sampling units that were used to collect the data with an offset variable being used to adjust <strong>for</strong> differing 

sizes of survey units. 

A common cause of overdispersion is non-independence in the data. For example, data may be collected 

using a cluster design rather than by a simple random sample. Overdispersion can be accounted <strong>for</strong> using 

quasi-likelihood methods. As a rule of thumb, overdispersion factor ĉ of 10 or less are acceptable. Very 

larger overdispersion factors indicate other serious problems in the model. Alternatives to the use of the 

correction factor are using a different distribution such as the negative binomial distribution. 

Related models <strong>for</strong> this chapter are the Zero-inflated Poisson (ZIP) models. In these models there are 

an excess number of zeroes relative to what would be expected under a Poisson model. The ZIP model has 

two parts – the probability that an observation will be zero, and then the distribution of the non-zero counts. 

There is a substantial base in the literature on this model.

Columbia Mountain Institute for Applied Ecology A short course on ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?