13.11.2012 Views

Introduction to Categorical Data Analysis

Introduction to Categorical Data Analysis

Introduction to Categorical Data Analysis

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

138 BUILDING AND APPLYING LOGISTIC REGRESSION MODELS<br />

5.1.1 How Many Predic<strong>to</strong>rs Can You Use?<br />

<strong>Data</strong> are unbalanced on Y if y = 1 occurs relatively few times or if y = 0 occurs<br />

relatively few times. This limits the number of predic<strong>to</strong>rs for which effects can be<br />

estimated precisely. One guideline1 suggests there should ideally be at least 10 outcomes<br />

of each type for every predic<strong>to</strong>r. For example, if y = 1 only 30 times out of<br />

n = 1000 observations, the model should have no more than about three predic<strong>to</strong>rs<br />

even though the overall sample size is large.<br />

This guideline is approximate. When not satisfied, software still fits the model. In<br />

practice, often the number of variables is large, sometimes even of similar magnitude<br />

as the number of observations. However, when the guideline is violated, ML estimates<br />

may be quite biased and estimates of standard errors may be poor. From results <strong>to</strong> be<br />

discussed in Section 5.3.1, as the number of model predic<strong>to</strong>rs increases, it becomes<br />

more likely that some ML model parameter estimates are infinite.<br />

Cautions that apply <strong>to</strong> building ordinary regression models hold for any GLM.<br />

For example, models with several predic<strong>to</strong>rs often suffer from multicollinearity –<br />

correlations among predic<strong>to</strong>rs making it seem that no one variable is important when<br />

all the others are in the model. A variable may seem <strong>to</strong> have little effect because<br />

it overlaps considerably with other predic<strong>to</strong>rs in the model, itself being predicted<br />

well by the other predic<strong>to</strong>rs. Deleting such a redundant predic<strong>to</strong>r can be helpful, for<br />

instance <strong>to</strong> reduce standard errors of other estimated effects.<br />

5.1.2 Example: Horseshoe Crabs Revisited<br />

The horseshoe crab data set of Table 3.2 analyzed in Sections 3.3.2, 4.1.3, and 4.4.1<br />

has four predic<strong>to</strong>rs: color (four categories), spine condition (three categories), weight,<br />

and width of the carapace shell. We now fit logistic regression models using all these<br />

<strong>to</strong> predict whether the female crab has satellites (males that could mate with her).<br />

Again we let y = 1 if there is at least one satellite, and y = 0 otherwise.<br />

Consider a model with all the main effects. Let {c1,c2,c3} be indica<strong>to</strong>r variables<br />

for the first three (of four) colors and let {s1,s2} be indica<strong>to</strong>r variables for the first<br />

two (of three) spine conditions. The model<br />

logit[P(Y = 1)] =α + β1weight + β2width + β3c1 + β4c2 + β5c3 + β6s1 + β7s2<br />

treats color and spine condition as nominal-scale fac<strong>to</strong>rs. Table 5.1 shows the results.<br />

A likelihood-ratio test that Y is jointly independent of these predic<strong>to</strong>rs simultaneously<br />

tests H0: β1 =···=β7 = 0. The test statistic is −2(L0 − L1) = 40.6<br />

with df = 7(P

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!