11.07.2015 Views

Preface to First Edition - lib

Preface to First Edition - lib

Preface to First Edition - lib

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ANALYSIS USING R 165large tree using a trivial s<strong>to</strong>pping criterion as the number of observations ina leaf, say, and then prunes branches that are not necessary.Once that a tree has been grown, a simple summary statistic is computedfor each leaf. The mean or median can be used for continuous responses andfor nominal responses the proportions of the classes is commonly used. Theprediction of a new observation is simply the corresponding summary statisticof the leaf <strong>to</strong> which this observation belongs.However, even the right-sized tree consists of binary splits which are, ofcourse, hard decisions. When the underlying relationship between covariateand response is smooth, such a split point estimate will be affected by highvariability. This problem is addressed by so called ensemble methods. Here,multiple trees are grown on perturbed instances of the data set and theirpredictions are averaged. The simplest representative of such a procedure iscalled bagging (Breiman, 1996) and works as follows. We draw B bootstrapsamples from the original data set, i.e., we draw n out of n observations withreplacement from our n original observations. For each of those bootstrapsamples we grow a very large tree. When we are interested in the predictionfor a new observation, we pass this observation through all B trees and averagetheir predictions. It has been shown that the goodness of the predictions forfuture cases can be improved dramatically by this or similar simple procedures.More details can be found in Bühlmann (2004).9.3 Analysis Using R9.3.1 Predicting Body Fat ContentThe rpart function from rpart can be used <strong>to</strong> grow a regression tree. Theresponse variable and the covariates are defined by a model formula in thesame way as for lm, say. By default, a large initial tree is grown, we restrictthe number of observations required <strong>to</strong> establish a potential binary split <strong>to</strong> atleast ten:R> <strong>lib</strong>rary("rpart")R> data("bodyfat", package = "mboost")R> bodyfat_rpart

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!