SUGI 26: A SAS/IML(r) Macro for Goodness-of-Fit Testing in Logistic ...

Statistics, Data Analysis, and Data MiningPaper 265-26A SAS/IML ® Macro for Goodness-of-Fit Testing in Logistic Regression Models withSparse DataOliver Kuss, Institute of Medical Epidemiology, Biometry and Informatics,Halle/Saale, GermanyABSTRACTThe logistic regression model has become the standard analyzingtool for binary responses in a variety of disciplines. Methods forassessing goodness-of-fit, however, are less developed and thisis especially pronounced in calculating goodness-of-fit tests withsparse data, when the standard tests (deviance and Pearsontest) behave unsatisfactorily.In our paper we show two solutions to the problem that areimplemented in the LOGISTIC procedure in SAS ® software, andintroduce five additional testing procedures from the statisticalliterature. By means of a simulation study we show that theseadditional tests are valid instruments for assessing goodness-offitin logistic regression models, even with sparse data. Finally,we present the SAS/IML macro %GOFLOGIT which allowscalculation of the introduced tests and illustrate the macro with anexample from occupational epidemiology on hand eczema inhairdressers.this would be the case if there are continuous or many covariates.In extreme cases each individual observation has its own riskprofile or pattern of covariates. In this case of sparseness, whichin our view is more the rule than the exception in today’s datasets, the deviance and the Pearson test no longer have a chisquaredistribution under the null hypothesis and so no longer arevalid measures of model fit. Note that this is just an extension ofthe familiar problem of small cell counts in contingency tables.In the following, we state the problem with some moremathematical rigor, show two possibilities to circumvent theproblem with PROC LOGISTIC, give five additional testingprocedures from the statistical literature and present some resultsfrom a simulation study which demonstrate that these additionaltests are valid goodness-of-fit tests for logistic regression models,even with sparse data. Finally, we present the SAS/IML macro%GOFLOGIT that allows the calculation of this new proceduresand illustrate the macro with an example from occupationalepidemiology on hand eczema in hairdressers.INTRODUCTIONThe logistic regression model has become the standard analyzingtool for binary responses in a variety of disciplines. This hasmany reasons: ease of interpretation of parameters as adjustedodds ratios, possibility of calculating prognoses for the event ofinterest, and availability of standard software. The LOGISTICprocedure is the standard tool in SAS software for fitting logisticregression models, but solutions with the GENMOD, the PROBITor the CATMOD procedure are also possible.Methods for assessing goodness-of-fit, however, are lessdeveloped, which may be due to the relative youth and theenhanced mathematical complexity of the logistic regressionmodel, compared to, for example, the linear regression model.In principle, there are two different approaches to assessinggoodness-of-fit in logistic regression models. The first, known asresidual analysis, investigates the model on the level of individualobservations and looks for those observations which are notadequately described by the model or which are highly influentialon the model fit. Among the different SAS procedures for logisticregression PROC LOGISTIC offers the most extensivepossibilities for residual analysis: the INFLUENCE option in theMODEL statement supplies a number of influence and outlierdiagnostics, and the IPLOTS option provides the correspondingplots.The second approach to goodness-of-fit on which we will focusseeks to combine the information on the amount of lack-of-fit in asingle number. Statistical tests, so called goodness-of-fit-tests,are then performed to judge if the observed lack-of-fit isstatistically significant or due to random chance. There are twostandard procedures, the deviance and the Pearson test, andthese are routinely provided by PROC GENMOD and PROCPROBIT and optionally by PROC LOGISTIC (use theSCALE=none option in the MODEL statement).These tests, however, have a serious problem with sparse data,where “sparse data” means, that for every pattern of covariatevalues we have only a small number of observations. In general,GOODNESS-OF-FIT TESTS IN LOGISTICREGRESSION WITH SPARSE DATATHE MODELLet y i be the response with y i ~ binomial(m i, π i). The modelequation is logit(π i)=x iβ, i=1,...,N, where β=(β 0,..., β p)’ is a vectorof regression parameters corresponding to a vector of p+1covariates x i=(1, x i1,...,x ip). Estimates of the β j are usuallycalculated by maximum likelihood and we get estimates of the π iby plugging the ^β j into the model equation.Note that we consider grouped observations where two individualobservations with the same covariate pattern belong to the samegroup. Translated into PROC LOGISTIC language this meansthat we consider the model to be specified in theevents/trials syntax where events counts the number ofevents (y i) and trials counts the number of individualobservations (m i) in a specific covariate pattern.STANDARD GOODNESS-OF-FIT TESTSTo assess goodness-of-fit in logistic regression one in generalcalculates the Pearson statisticor the devianceX2=N⎛ y ⎞iD = 2∑y log⎜ +1 ˆ⎟ii ⎝ miπi ⎠N2( yi− m ˆiπi)∑m ˆ π ( 1 ˆ π )i= 1 i i−( m − y )Both rely on the principle of comparing observed (y i) to predicted(m i^π i) values and should be large if the model does not fit the datawell. To judge statistical significance they are usually comparedto a χ 2 N-p-1-distribution. The validity of this distribution, however,relies on the assumption of large m i, and both tests showunsatisfactory behaviour with sparse data, that is, small m i. It canbe shown (McCullagh and Nelder, 1986) that D degenerates toi⎛ m − ylog⎜⎝ mi⎞⎟⎠i i( 1−ˆ π ) ⎟ .i i= i

Statistics, Data Analysis, and Data MiningOsius, G. and Rojek, D. (1992), “Normal Goodness-of-Fit Testsfor Multinomial Models With Large Degrees of Freedom,” Journalof the American Statistical Association, 87, 1145-1152.SAS Institute Inc. (1988), SAS/IML Users’s Guide, Release6.03 Edition, Cary, NC: SAS Institute Inc.So, Y. (1993), “A Tutorial on Logistic Regression,” Proceedings ofthe Eighteenth Annual SAS Users Group InternationalConference, 18, 1290-1295.White, H. (1982), “Maximum Likelihood Estimation ofMisspecified Models,” Econometrica, 50, 1-25.CONTACT INFORMATIONThe %GOFLOGIT macro is available on request from the author.Contact him at:Oliver KussInstitute of Medical Epidemiology, Biometry and Informatics06097 Halle/Saale, GermanyPhone: +49-345-5573582Fax: +49-345-5573580Email:Oliver.Kuss@medizin.uni-halle.deWWW: http://imebmi.medizin.uni-halle.de/5

SUGI 26: A SAS/IML(r) Macro for Goodness-of-Fit Testing in Logistic ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?