An Introduction to Recursive Partitioning Using the ... - Mayo Clinic

More documents

Recommendations

Info

1=Improved, 2=No change, 3=Worse X 3 = initial response to drugs 1=Improved, 2=No change, 3=Worse The other 11 variables did not appear in the nal model. This procedure seems to work especially well for variables such asX 1 , where there is a denite ordering, but spacings are not necessarily equal. The tree is built by the following process: rst the single variable is found which best splits the data into two groups (`best' will be dened later). The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size (5 for this data) or until no improvement can be made. The resultant model is, with certainty, too complex, and the question arises as it does with all stepwise procedures of when to stop. The second stage of the procedure consists of using cross-validation to trim back the full tree. In the medical example above the full tree had ten terminal regions. A cross validated estimate of risk was computed for a nested set of subtrees; this nal model, presented in gure 1, is the subtree with the lowest estimate of risk. 2 Notation The partitioning method can be applied to many dierent kinds of data. We will start by looking at the classication problem, which isoneof the more instructive cases (but also has the most complex equations). The sample population consists of n observations from C classes. A given model will break these observations into k terminal groups; to each of these groups is assigned a predicted class (this will be the response variable). In an actual application, most parameters will be estimated from the data, such estimates are given by formulae. i i =1; 2; :::; C Prior probabilities of each class. L(i; j) i =1; 2; :::; C Loss matrix for incorrectly classifying an i as a j. L(i; i) 0: A Some node of the tree. Note that A represents both a set of individuals in the sample data, and, via the tree that produced it, a classication rule for future data. 4
(x) (A) True class of an observation x, where x is the vector of predictor variables. The class assigned to A, if A were to be taken as a nal node. n i ;n A Number of observations in the sample that are class i, number of obs in node A. n iA Number of observations in the sample that are class i and node A. P (A) p(ijA) R(A) R(T ) Probability ofA (for future observations). = P C i=1 i P fx 2 A j (x) =ig P C i=1 i n iA =n i ) P f(x) =i j x 2 Ag (for future observations) = i P fx 2 A j (x) =ig=P fx 2 Ag i (n iA =n i )= P i (n iA =n i Risk of A = P C i=1 p(ijA)L(i; (A)) where (A) ischosen to minimize this risk. Risk of a model (or tree) T = P k j=1 P (A j )R(A j ) where A j are the terminal nodes of the tree. If L(i; j) = 1 for all i 6= j, and we set the prior probabilities equal to the observed class frequencies in the sample then p(ijA) = n iA =n A and R(T ) is the proportion misclassied. 3 Building the tree 3.1 Splitting criteria If we split a node A into two sons A L and A R (left and right sons), we will have P (A L )R(A L )+P (A R )R(A R ) P (A)R(A) 5
Page 1 and 2: An Introduction to Recursive Partit
Page 3: 24 patients revived / 144 not reviv
Page 7 and 8: Impurity 0.0 0.2 0.4 0.6 Gini crite
Page 9 and 10: in which case ~ i = iL P i j jL j
Page 11 and 12: 2) grade2.5 85 40 Prog ( 0.4706 0.5
Page 13 and 14: 4.2 Cross-validation Cross-validati
Page 15 and 16: x.7>0.5 | x.3>0.5 x.4
Page 17 and 18: so and seriously overts the data. I
Page 19 and 20: Variables actually used in tree con
Page 21 and 22: formula: the model formula, as in l
Page 23 and 24: Country:dghij | Country:dghij | Muc
Page 25 and 26: probabilities: 0.3103 0.2069 0.3966
Page 27 and 28: parms=list(prior=c(.65,.35))) > fit
Page 29 and 30: The prediction error for a new obse
Page 31 and 32: The rst split on displacement parti
Page 33 and 34: R-square 0.0 0.2 0.4 0.6 0.8 1.0
Page 35 and 36: 7) g2>13.2 45 10.580 0.6222 14) g2>
Page 37 and 38: One might expect this phenomenon to
Page 39 and 40: Call: rpart(formula = skips ~ Openi
Page 41 and 42: The use.n=T option species that num
Page 43 and 44: grade
Page 45 and 46: Prob Progression 0.0 0.2 0.4 0.6 0.
Page 47 and 48: grade11.845 No (20/11) Prog (3/6) N
Page 49 and 50: Endpoint = progstat No | (92/54) gr
Page 51 and 52: 11.2 Tree The user interface to rpa

An Introduction to Recursive Partitioning Using the ... - Mayo Clinic

Create successful ePaper yourself

Delete template?

Save as template?