14.03.2014 Views

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

Modeling and Multivariate Methods - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 13 Recursively Partitioning Data 335<br />

Validation<br />

Validation<br />

If you grow a tree with enough splits, partitioning can overfit data. When this happens, the model predicts<br />

the fitted data very well, but predicts future observations poorly. Validation is the process of using part of a<br />

data set to estimate model parameters, <strong>and</strong> using the other part to assess the predictive ability of the model.<br />

• The training set is the part that estimates model parameters.<br />

• The validation set is the part that assesses or validates the predictive ability of the model.<br />

• The test set is a final, independent assessment of the model’s predictive ability. The test set is available<br />

only when using a validation column (see Table 13.1).<br />

The training, validation, <strong>and</strong> test sets are created by subsetting the original data into parts. Table 13.2<br />

describes several methods for subsetting a data set.<br />

Table 13.2 Validation <strong>Methods</strong><br />

Excluded Rows<br />

Holdback<br />

KFold<br />

Uses row states to subset the data. Rows that are unexcluded are used as<br />

the training set, <strong>and</strong> excluded rows are used as the validation set.<br />

For more information about using row states <strong>and</strong> how to exclude rows,<br />

see Using JMP.<br />

R<strong>and</strong>omly divides the original data into the training <strong>and</strong> validation data<br />

sets. The Validation Portion (see Table 13.1) on the platform launch<br />

window is used to specify the proportion of the original data to use as<br />

the validation data set (holdback).<br />

Divides the original data into K subsets. In turn, each of the K sets is<br />

used to validate the model fit on the rest of the data, fitting a total of K<br />

models. The model giving the best validation statistic is chosen as the<br />

final model.<br />

KFold validation can be used only with the Decision Tree method. To<br />

use KFold, select K Fold Crossvalidation from the platform<br />

red-triangle menu, see “Platform Options” on page 325.<br />

This method is best for small data sets, because is makes efficient use of<br />

limited amounts of data.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!