Logistic Regression residuals (UC Berkeley Admissions example)

UC Berkeley Admissions example: 

> odds.ratio=function(x,addtocounts=0){ 

+ x=x+addtocounts 

+ (x[1,1]*x[2,2])/(x[1,2]*x[2,1]) 

+ } 

> data(UCBAdmissions) 

> dimnames(UCBAdmissions) 

$Admit 

[1] "Admitted" "Rejected" 

$Gender 

[1] "Male" "Female" 

$Dept 

[1] "A" "B" "C" "D" "E" "F" 

> ftable(UCBAdmissions,row.vars="Dept",col.vars=c("Gender","Admit")) 

Gender Male Female 

Admit Admitted Rejected Admitted Rejected 

Dept 

A 512 313 89 19 

B 353 207 17 8 

C 120 205 202 391 

D 138 279 131 244 

E 53 138 94 299 

F 22 351 24 317 

Exploratory Analysis 

> ftable(round(prop.table(UCBAdmissions,c(2,3)),2),row.vars="Dept",col.vars=c("Gender","Admit")) 

Gender Male Female 

Admit Admitted Rejected Admitted Rejected 

Dept 

A 0.62 0.38 0.82 0.18 

B 0.63 0.37 0.68 0.32 

C 0.37 0.63 0.34 0.66 

D 0.33 0.67 0.35 0.65 

E 0.28 0.72 0.24 0.76 

F 0.06 0.94 0.07 0.93 

> round(apply(UCBAdmissions,3,odds.ratio),2) # conditional odds ratio (on department) 

A B C D E F 

0.35 0.80 1.13 0.92 1.22 0.83 

The odds of men being accepted is lower than that of females for departments A, B, D and F. 

> UCBGbyA=margin.table(UCBAdmissions,c(2,1)) 

> UCBGbyA 

Admit 

Gender Admitted Rejected 

Male 1198 1493 

Female 557 1278 

1

ound(prop.table(UCBGbyA,1),2) 

Admit 

Gender Admitted Rejected 

Male 0.45 0.55 

Female 0.30 0.70 

> odds.ratio(UCBGbyA) # Marginal odds ratio. 

[1] 1.84108 

Note that Simpson’s paradox is present, as the marginal odds ratio of 1.84108 illustrates that the 

odds for men being accepted is higher. Let us now test for independence between admission status 

and gender next. 

> chisq.test(UCBGbyA,correct=FALSE) 

data: UCBGbyA 

X-squared = 92.2053, df = 1, p-value < 2.2e-16 

There is evidence of dependence. 

> round(prop.table(margin.table(UCBAdmissions,c(2,3)),1),2) 

Dept 

Gender A B C D E F 

Male 0.31 0.21 0.12 0.15 0.07 0.14 

Female 0.06 0.01 0.32 0.20 0.21 0.19 

> round(prop.table(margin.table(UCBAdmissions,c(1,3)),2),2) 

Dept 

Admit A B C D E F 

Admitted 0.64 0.63 0.35 0.34 0.25 0.06 

Rejected 0.36 0.37 0.65 0.66 0.75 0.94 

Most males apply to Dept A and B where acceptance has a higher rate while more females apply 

to Dept C, D, E and F where acceptance has a lower rate. 

Logistic Regression 

> Dept=rep(c("A","B","C","D","E","F"),each=2) 

> Gender=rep(c("Male","Female"),6) 

> Counts=matrix(UCBAdmissions,ncol=2,byrow=TRUE,dimnames=list(NULL,c("Admit","Reject"))) 

> berk=data.frame(Dept,Gender,Counts);berk 

Dept Gender Admit Reject 

1 A Male 512 313 

2 A Female 89 19 

3 B Male 353 207 

4 B Female 17 8 

5 C Male 120 205 

6 C Female 202 391 

7 D Male 138 279 

8 D Female 131 244 

9 E Male 53 138 

10 E Female 94 299 

11 F Male 22 351 

12 F Female 24 317 

2

#------ Model 1 

> UCB.logit=glm(cbind(Admit,Reject)~Gender+Dept,family=binomial(link=logit), 

+ contrasts=list(Dept=contr.treatment(6,base=6,contrasts=TRUE)),data=berk) 

> summary(UCB.logit) 

Call: 

glm(formula = cbind(Admit, Reject) ~ Gender + Dept, family = binomial(link = logit), 

data = berk, contrasts = list(Dept = contr.treatment(6, base = 6, 

contrasts = TRUE))) 

Deviance Residuals: 

1 2 3 4 5 6 7 8 

-1.2487 3.7189 -0.0560 0.2706 1.2533 -0.9243 0.0826 -0.0858 

9 10 11 12 

1.2205 -0.8509 -0.2076 0.2052 

Coefficients: 

Estimate Std. Error z value Pr(>|z|) 

(Intercept) -2.62456 0.15773 -16.640

We note that residuals 1 and 2 that correspond to Department A are large. Residual 1 is negative 

indicating that there are fewer males accepted in Department A than predicted by the model. 

From the individual Wald test for gender (with a p-value of 0.217) there is an indication that the 

gender predictor should be removed. 

> #------ Model 2 

> UCBnoG.logit=update(UCB.logit,.~.-Gender) 

> summary(UCBnoG.logit) 

Call: 

glm(formula = cbind(Admit, Reject) ~ Dept, family = binomial(link = logit), 

data = berk, contrasts = list(Dept = contr.treatment(6, base = 6, 

contrasts = TRUE))) 


Min 1Q Median 3Q Max 

-1.4064 -0.4550 0.1456 0.5471 4.1323 

Coefficients: 


(Intercept) -2.6756 0.1524 -17.553

Residual analysis 

> # standardized pearson 

> round(residuals(UCBnoG.logit,type="pearson")/sqrt(1-lm.influence(UCBnoG.logit)$hat),2) 

1 2 3 4 5 6 7 8 9 10 11 12 

-4.15 4.15 -0.50 0.50 0.87 -0.87 -0.55 0.55 1.00 -1.00 -0.62 0.62 

> round(rstandard(UCBnoG.logit),2) # standardized deviance 

1 2 3 4 5 6 7 8 9 10 11 12 

-4.13 4.39 -0.50 0.51 0.86 -0.87 -0.55 0.54 0.99 -1.01 -0.63 0.61 

Department A residuals still indicate a lack of fit. Department C are marginally better than before 

(were 1.88). Remove Department A completely from the dataset and determine if we can at least 

fir a “good” model for the other departments. 

> #------ Model 3 

> UCBnoGA.logit=glm(cbind(Admit,Reject)~Dept,family=binomial(link=logit), 

+ contrasts=list(Dept=contr.treatment(5,base=5,contrasts=TRUE)), 

+ data=berk,subset=(Dept!="A")) 

> summary(UCBnoGA.logit) 

Call: 

glm(formula = cbind(Admit, Reject) ~ Dept, family = binomial(link = logit), 

data = berk, subset = (Dept != "A"), contrasts = list(Dept = contr.treatment(5, 

base = 5, contrasts = TRUE))) 


3 4 5 6 7 8 9 10 

-0.1041 0.4978 0.6950 -0.5177 -0.3764 0.3952 0.8119 -0.5754 

11 12 

-0.4341 0.4418 

Coefficients: 


(Intercept) -2.6756 0.1524 -17.553

Residual analysis 

> # standardized pearson 

> round(residuals(UCBnoGA.logit,type="pearson")/sqrt(1-lm.influence(UCBnoGA.logit)$hat),2) 

3 4 5 6 7 8 9 10 11 12 

-0.50 0.50 0.87 -0.87 -0.55 0.55 1.00 -1.00 -0.62 0.62 

> round(rstandard(UCBnoGA.logit),2) # standardized deviance 

3 4 5 6 7 8 9 10 11 12 

-0.50 0.51 0.86 -0.87 -0.55 0.54 0.99 -1.01 -0.63 0.61 

Remark: A likelihood ratio test cannot be performed to compare this model to any of the previous 

ones as we are now working with a subset of the original data. 

6

Logistic Regression residuals (UC Berkeley Admissions example)

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?