presentation slides - INFORMS NY

Note on Missing Values.1) Missingness NOT in Y (see Wang and Sheng, 2007, JMLR for semisupervisedmethod for missing Y).2) Different methods of imputation:1) C4.5: probabilistic split: variables with missing values areattached to child nodes with weights equal to proportion ofnon-missing values.2) Complete case: eliminate all missing observations, and train.3) Grand mode/mean: imputed if categorical/continuous.4) Separate class: appropriate for categorical. For continuous,create extreme large value and thus separate missings fromnon-missings.5) Complete variable case: delete all variables with missingvalues.6) Surrogate (CART default): Use surrogate variable/s whenevervariable is missing. At testing or scoring, if variable ismissing, uses surrogate/s.— 22 —

Tree Derivative: Random Forests.(Breiman, 1999)Random Forests proceed in the following steps, and notice that there is noneed to create a training, validation and a test data sets:1. Take a random sample of N observations with replacement(“bagging”) from the data set. On average, select about 2/3 of rows. Theremaining 1/3 are called “out of bag (OOB)” observations. A new randomselection is performed for each tree constructed.2. Using the observations selected in step 1, construct a decision tree toits maximum size, without pruning. As the tree is built, allow only asubset of the total set of predictor variables to be considered aspossible splitters for each node. Select the set of predictors to beconsidered as random subset of the total set of available predictors.For example, if there are ten predictors, choose five of them randomly ascandidate splitters. Perform a new random selection for each split. Somepredictors (possibly best one) will not be considered for each split, butpredictor excluded from one split may be used for another split in thesame tree.— 23 —

