Boosted Regression (Boosting): An introductory tutorial and a Stata ...

More documents

Recommendations

Info

small data set x linear approximation usually adequatelarge data set x nonlinearities and interactions likelymore variables thanxlinear (Gaussian and logistic) regressionobservations (or close)methods failnonlinearities need not be explicitlyxsuspected nonlinearitiesmodeledsuspected interactions x interactions need not be explicitly specifiedordered categorical x-xvariablesawkward in parametric regressioncorrelated data x potential for overfittingx-variables consist ofindicator variables onlyTable 4: Some indicators in favor and against the use of boosting.xnonlinearities cannot arise from indicatorvariables, interactions still mightI would like to point out a few issues that the reader may run into as he or shestarts experimenting with the boosting plugin. The “tree fully fit” error means that moretree splits are required than are possible. This arises, for example, if the data contain only10 observations but interaction=11 (or if bag=0.5 and interaction=6) is specified. It willalso tend to arise when the number of iterations (maxiter) and the number of interactions(inter) are accidentally switched. Reducing the number of interactions always solves thisproblem. A second issue are missing values, which are not supported in the currentversion of the boosting plugin. I suggest imputing variables ahead of time, for exampleusing a stratified hotdeck imputation (my implementation of hotdeck, “hotdeckvar”, isavailable from www.schonlau.net or type “net search hotdeckvar” within Stata). Anotheroption that is popular in the social sciences is to create variables that flag missing dataand add them to the list of covariates. Discarding observations with missing values is aless desirable alternative. I am working on a future release that allows saving the boostingmodel in Stata.AcknowledgementI am most grateful to Gregory Ridgeway, developer of the GBM boostingpackage in R, for many discussions on boosting, advice on C++ programming and forcomments on earlier versions of this paper. I am equally grateful to Nelson Lim at the32
RAND Corporation for his support and interest in this methodology and for comments onearlier versions of this paper.ReferencesBreiman, L., Friedman, J., Olshen, R., and Stone C. 1984. Classification and RegressionTrees. Belmont, CA: Wadsworth.Elkan, C. Boosting and Naïve Bayes Learning. Technical Report No CS97-557.September 1997, UCSD.Friedman, J., Hastie, T. and R. Tibshirani. 2000. Additive logistic regression: astatistical view of boosting. The Annals of Statistics 28(2):337-407.Friedman, J.H. (2001). "Greedy Function Approximation: A Gradient BoostingMachine," Annals of Statistics 29(5):1189-1232.Hastie, T., Tibshirani, R. and J. Friedman. 2001. The Elements of Statistical Learning.New York: Springer-Verlag.Long, J. S. and J. Freese. 2003. Regression Models for Categorical Dependent VariablesUsing Stata, Revised Edition. College Station, TX: Stata Press.McCaffrey, D., Ridgeway, G., Morral, A. (2004, to appear). “Propensity ScoreEstimation with Boosted Regression for Evaluating Adolescent Substance AbuseTreatment,” Psychological Methods.McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. New York:Chapman & Hall.Ridgeway, G. (1999). “The state of boosting,” Computing Science and Statistics 31:172-181. Also available at http://www.i-pensieri.com/gregr/papers.shtml .33
Page 3 and 4: e “the best algorithm”. Friedma
Page 5 and 6: interaction=2, then each tree has 3
Page 7 and 8: Initialize weights to be equal w i
Page 9 and 10: 4.3 Friedman’s gradient boosting
Page 11 and 12: The second parameter is the number
Page 13 and 14: 4 train test train5 train testFigur
Page 15 and 16: In what follows I give some suggest
Page 17 and 18: my experience the cross-validated R
Page 19 and 20: global trainn=e(trainn) /* using e(
Page 21 and 22: The influence shows one can learn h
Page 23 and 24: to one another. The response y is a
Page 25 and 26: has a lower value. Both values are
Page 27 and 28: Percentage Influence0 5 10 15 20123
Page 29 and 30: distribution, trainfraction=0.5, ba
Page 31: Shrinking does not affect runtime,

Boosted Regression (Boosting): An introductory tutorial and a Stata ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?