Boosted Regression (Boosting): An introductory tutorial and a Stata ...

More documents

Recommendations

Info

Rsquared.15 .2 .25 .30 5 10 15 20Number of interactionsFigure 7: Scatter plot of the pseudo R 2 computed on a test data set versus the number ofinteractions.I compare classification rates on the test data. Roughly 49% of the test data areclassified as zero (y=0), and 51% as one (y=1). When using a coin flip to classifyobservations one would have been right about half the time. Using logistic regression52.0% of observations in the test data set are classified correctly. This is just barelybetter than the rate one could have obtained by a coin flip. Because there were a lot ofunrelated x-variables I also tried a backward regression using p>0.15 as criterion toremove a variable. Using the backward regression 54.1% of the observations wereclassified correctly. The boosting model with 8 interactions, shrinkage =0.5 and bag=0.5,classifies 76.0% of the test data observations correctly.The Stata output displays the pseudo R 2 values for logistic regression (pseudoR 2 =0.02) and backward logistic regression (pseudo R 2 =0.01). Because the training datawere used to compute the pseudo R 2 values, the backward logistic regression necessarily24
has a lower value. Both values are much lower than the value obtained by boostedlogistic regression (test R 2 =0.27).Because there are only two response values (0 and 1), I use a different plot forcalibration than the scatter plot shown in Figure 5. If the predicted values are accurateone would expect that the predicted values are roughly the same as the fraction ofresponse values classified as “1” that give rise to a given predicted value. The fraction ofresponse values classified as “1” can be estimated by averaging or smoothing overresponse values with similar predictions. In Stata I use a lowess smoother to compare thepredictions from the boosted logistic regression and the linear logistic regression:twoway (lowess y logit_pred, bwidth(0.2)) (lowess y boost_pred,bwidth(0.2)) (lfit straight y), xtitle("Actual Values")legend(label(1 "Logistic Regression") label(2 "Boosting") label(3"Fitted Values=Actual Values") ) xsize(4) ysize(4)Calibration plots for the test data are shown in Figure 8. The near horizontal line forlogistic regression in the test calibration plot implies that logistic regression classifies50% of the observations correctly regardless of the actual predicted value. The logisticregression model does not generalize well.25
Page 3 and 4: e “the best algorithm”. Friedma
Page 5 and 6: interaction=2, then each tree has 3
Page 7 and 8: Initialize weights to be equal w i
Page 9 and 10: 4.3 Friedman’s gradient boosting
Page 11 and 12: The second parameter is the number
Page 13 and 14: 4 train test train5 train testFigur
Page 15 and 16: In what follows I give some suggest
Page 17 and 18: my experience the cross-validated R
Page 19 and 20: global trainn=e(trainn) /* using e(
Page 21 and 22: The influence shows one can learn h
Page 23: to one another. The response y is a
Page 27 and 28: Percentage Influence0 5 10 15 20123
Page 29 and 30: distribution, trainfraction=0.5, ba
Page 31 and 32: Shrinking does not affect runtime,
Page 33: RAND Corporation for his support an

Boosted Regression (Boosting): An introductory tutorial and a Stata ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?