28.12.2014 Views

Logistic Regression residuals (UC Berkeley Admissions example)

Logistic Regression residuals (UC Berkeley Admissions example)

Logistic Regression residuals (UC Berkeley Admissions example)

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>UC</strong> <strong>Berkeley</strong> <strong>Admissions</strong> <strong>example</strong>:<br />

> odds.ratio=function(x,addtocounts=0){<br />

+ x=x+addtocounts<br />

+ (x[1,1]*x[2,2])/(x[1,2]*x[2,1])<br />

+ }<br />

> data(<strong>UC</strong>B<strong>Admissions</strong>)<br />

> dimnames(<strong>UC</strong>B<strong>Admissions</strong>)<br />

$Admit<br />

[1] "Admitted" "Rejected"<br />

$Gender<br />

[1] "Male" "Female"<br />

$Dept<br />

[1] "A" "B" "C" "D" "E" "F"<br />

> ftable(<strong>UC</strong>B<strong>Admissions</strong>,row.vars="Dept",col.vars=c("Gender","Admit"))<br />

Gender Male Female<br />

Admit Admitted Rejected Admitted Rejected<br />

Dept<br />

A 512 313 89 19<br />

B 353 207 17 8<br />

C 120 205 202 391<br />

D 138 279 131 244<br />

E 53 138 94 299<br />

F 22 351 24 317<br />

Exploratory Analysis<br />

> ftable(round(prop.table(<strong>UC</strong>B<strong>Admissions</strong>,c(2,3)),2),row.vars="Dept",col.vars=c("Gender","Admit"))<br />

Gender Male Female<br />

Admit Admitted Rejected Admitted Rejected<br />

Dept<br />

A 0.62 0.38 0.82 0.18<br />

B 0.63 0.37 0.68 0.32<br />

C 0.37 0.63 0.34 0.66<br />

D 0.33 0.67 0.35 0.65<br />

E 0.28 0.72 0.24 0.76<br />

F 0.06 0.94 0.07 0.93<br />

> round(apply(<strong>UC</strong>B<strong>Admissions</strong>,3,odds.ratio),2) # conditional odds ratio (on department)<br />

A B C D E F<br />

0.35 0.80 1.13 0.92 1.22 0.83<br />

The odds of men being accepted is lower than that of females for departments A, B, D and F.<br />

> <strong>UC</strong>BGbyA=margin.table(<strong>UC</strong>B<strong>Admissions</strong>,c(2,1))<br />

> <strong>UC</strong>BGbyA<br />

Admit<br />

Gender Admitted Rejected<br />

Male 1198 1493<br />

Female 557 1278<br />

1


ound(prop.table(<strong>UC</strong>BGbyA,1),2)<br />

Admit<br />

Gender Admitted Rejected<br />

Male 0.45 0.55<br />

Female 0.30 0.70<br />

> odds.ratio(<strong>UC</strong>BGbyA) # Marginal odds ratio.<br />

[1] 1.84108<br />

Note that Simpson’s paradox is present, as the marginal odds ratio of 1.84108 illustrates that the<br />

odds for men being accepted is higher. Let us now test for independence between admission status<br />

and gender next.<br />

> chisq.test(<strong>UC</strong>BGbyA,correct=FALSE)<br />

data: <strong>UC</strong>BGbyA<br />

X-squared = 92.2053, df = 1, p-value < 2.2e-16<br />

There is evidence of dependence.<br />

> round(prop.table(margin.table(<strong>UC</strong>B<strong>Admissions</strong>,c(2,3)),1),2)<br />

Dept<br />

Gender A B C D E F<br />

Male 0.31 0.21 0.12 0.15 0.07 0.14<br />

Female 0.06 0.01 0.32 0.20 0.21 0.19<br />

> round(prop.table(margin.table(<strong>UC</strong>B<strong>Admissions</strong>,c(1,3)),2),2)<br />

Dept<br />

Admit A B C D E F<br />

Admitted 0.64 0.63 0.35 0.34 0.25 0.06<br />

Rejected 0.36 0.37 0.65 0.66 0.75 0.94<br />

Most males apply to Dept A and B where acceptance has a higher rate while more females apply<br />

to Dept C, D, E and F where acceptance has a lower rate.<br />

<strong>Logistic</strong> <strong>Regression</strong><br />

> Dept=rep(c("A","B","C","D","E","F"),each=2)<br />

> Gender=rep(c("Male","Female"),6)<br />

> Counts=matrix(<strong>UC</strong>B<strong>Admissions</strong>,ncol=2,byrow=TRUE,dimnames=list(NULL,c("Admit","Reject")))<br />

> berk=data.frame(Dept,Gender,Counts);berk<br />

Dept Gender Admit Reject<br />

1 A Male 512 313<br />

2 A Female 89 19<br />

3 B Male 353 207<br />

4 B Female 17 8<br />

5 C Male 120 205<br />

6 C Female 202 391<br />

7 D Male 138 279<br />

8 D Female 131 244<br />

9 E Male 53 138<br />

10 E Female 94 299<br />

11 F Male 22 351<br />

12 F Female 24 317<br />

2


#------ Model 1<br />

> <strong>UC</strong>B.logit=glm(cbind(Admit,Reject)~Gender+Dept,family=binomial(link=logit),<br />

+ contrasts=list(Dept=contr.treatment(6,base=6,contrasts=TRUE)),data=berk)<br />

> summary(<strong>UC</strong>B.logit)<br />

Call:<br />

glm(formula = cbind(Admit, Reject) ~ Gender + Dept, family = binomial(link = logit),<br />

data = berk, contrasts = list(Dept = contr.treatment(6, base = 6,<br />

contrasts = TRUE)))<br />

Deviance Residuals:<br />

1 2 3 4 5 6 7 8<br />

-1.2487 3.7189 -0.0560 0.2706 1.2533 -0.9243 0.0826 -0.0858<br />

9 10 11 12<br />

1.2205 -0.8509 -0.2076 0.2052<br />

Coefficients:<br />

Estimate Std. Error z value Pr(>|z|)<br />

(Intercept) -2.62456 0.15773 -16.640


We note that <strong>residuals</strong> 1 and 2 that correspond to Department A are large. Residual 1 is negative<br />

indicating that there are fewer males accepted in Department A than predicted by the model.<br />

From the individual Wald test for gender (with a p-value of 0.217) there is an indication that the<br />

gender predictor should be removed.<br />

> #------ Model 2<br />

> <strong>UC</strong>BnoG.logit=update(<strong>UC</strong>B.logit,.~.-Gender)<br />

> summary(<strong>UC</strong>BnoG.logit)<br />

Call:<br />

glm(formula = cbind(Admit, Reject) ~ Dept, family = binomial(link = logit),<br />

data = berk, contrasts = list(Dept = contr.treatment(6, base = 6,<br />

contrasts = TRUE)))<br />

Deviance Residuals:<br />

Min 1Q Median 3Q Max<br />

-1.4064 -0.4550 0.1456 0.5471 4.1323<br />

Coefficients:<br />

Estimate Std. Error z value Pr(>|z|)<br />

(Intercept) -2.6756 0.1524 -17.553


Residual analysis<br />

> # standardized pearson<br />

> round(<strong>residuals</strong>(<strong>UC</strong>BnoG.logit,type="pearson")/sqrt(1-lm.influence(<strong>UC</strong>BnoG.logit)$hat),2)<br />

1 2 3 4 5 6 7 8 9 10 11 12<br />

-4.15 4.15 -0.50 0.50 0.87 -0.87 -0.55 0.55 1.00 -1.00 -0.62 0.62<br />

> round(rstandard(<strong>UC</strong>BnoG.logit),2) # standardized deviance<br />

1 2 3 4 5 6 7 8 9 10 11 12<br />

-4.13 4.39 -0.50 0.51 0.86 -0.87 -0.55 0.54 0.99 -1.01 -0.63 0.61<br />

Department A <strong>residuals</strong> still indicate a lack of fit. Department C are marginally better than before<br />

(were 1.88). Remove Department A completely from the dataset and determine if we can at least<br />

fir a “good” model for the other departments.<br />

> #------ Model 3<br />

> <strong>UC</strong>BnoGA.logit=glm(cbind(Admit,Reject)~Dept,family=binomial(link=logit),<br />

+ contrasts=list(Dept=contr.treatment(5,base=5,contrasts=TRUE)),<br />

+ data=berk,subset=(Dept!="A"))<br />

> summary(<strong>UC</strong>BnoGA.logit)<br />

Call:<br />

glm(formula = cbind(Admit, Reject) ~ Dept, family = binomial(link = logit),<br />

data = berk, subset = (Dept != "A"), contrasts = list(Dept = contr.treatment(5,<br />

base = 5, contrasts = TRUE)))<br />

Deviance Residuals:<br />

3 4 5 6 7 8 9 10<br />

-0.1041 0.4978 0.6950 -0.5177 -0.3764 0.3952 0.8119 -0.5754<br />

11 12<br />

-0.4341 0.4418<br />

Coefficients:<br />

Estimate Std. Error z value Pr(>|z|)<br />

(Intercept) -2.6756 0.1524 -17.553


Residual analysis<br />

> # standardized pearson<br />

> round(<strong>residuals</strong>(<strong>UC</strong>BnoGA.logit,type="pearson")/sqrt(1-lm.influence(<strong>UC</strong>BnoGA.logit)$hat),2)<br />

3 4 5 6 7 8 9 10 11 12<br />

-0.50 0.50 0.87 -0.87 -0.55 0.55 1.00 -1.00 -0.62 0.62<br />

> round(rstandard(<strong>UC</strong>BnoGA.logit),2) # standardized deviance<br />

3 4 5 6 7 8 9 10 11 12<br />

-0.50 0.51 0.86 -0.87 -0.55 0.54 0.99 -1.01 -0.63 0.61<br />

Remark: A likelihood ratio test cannot be performed to compare this model to any of the previous<br />

ones as we are now working with a subset of the original data.<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!