05.08.2013 Views

New Approaches to in silico Design of Epitope-Based Vaccines

New Approaches to in silico Design of Epitope-Based Vaccines

New Approaches to in silico Design of Epitope-Based Vaccines

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.2. MACHINE LEARNING 27<br />

Figure 3.5: Two-fold nested five-fold cross validation. The tra<strong>in</strong><strong>in</strong>g data D is partitioned<br />

<strong>in</strong><strong>to</strong> five disjo<strong>in</strong>t subsets. In each <strong>of</strong> the outer cross validation rounds i = 1, . . . , 5, model selection<br />

and tra<strong>in</strong><strong>in</strong>g are performed on Ei and test<strong>in</strong>g on Ci. Model selection comprises an <strong>in</strong>ner cross<br />

validation. In each <strong>of</strong> the <strong>in</strong>ner cross validation rounds j = 1, . . . , 5 models for different parameter<br />

comb<strong>in</strong>ations are tra<strong>in</strong>ed on Eij and evaluated on Cij. The parameter comb<strong>in</strong>ation yield<strong>in</strong>g the<br />

best performance <strong>in</strong> the <strong>in</strong>ner cross validation is used for tra<strong>in</strong><strong>in</strong>g on Ei <strong>in</strong> the respective outer<br />

cross validation round. <strong>Based</strong> on [68].<br />

The de fac<strong>to</strong> standard <strong>to</strong> thoroughly assess how a predic<strong>to</strong>r will perform <strong>in</strong> practice, is<br />

k-fold cross validation. k-fold cross validation comprises k rounds <strong>of</strong> model selection and<br />

subsequent model evaluation. The data set D is adequately partitioned <strong>in</strong><strong>to</strong> k partitions <strong>of</strong><br />

equal size: D = D1∪. . .∪Dk with Di∩Dj = ∅ for all i = j. In each round one <strong>of</strong> the subsets<br />

serves as test set Ci, while the rema<strong>in</strong><strong>in</strong>g k − 1 subsets are comb<strong>in</strong>ed <strong>to</strong> form the tra<strong>in</strong><strong>in</strong>g<br />

set Ei. Model selection and tra<strong>in</strong><strong>in</strong>g are performed on the tra<strong>in</strong><strong>in</strong>g set, model evaluation<br />

on the test set. Each <strong>of</strong> the k subsets is used as test set once. The overall performance<br />

<strong>of</strong> the predic<strong>to</strong>r with respect <strong>to</strong> k-fold cross validation corresponds <strong>to</strong> the average over the<br />

performances from the k model evaluations.<br />

In two-times nested k-fold cross validation the model selection phase <strong>of</strong> the k-fold cross<br />

validation comprises a cross validation-based grid search (Figure 3.5): The tra<strong>in</strong><strong>in</strong>g data,<br />

Ei = D\Ci, is partitioned <strong>in</strong><strong>to</strong> n parts. For each parameter comb<strong>in</strong>ation a model is tra<strong>in</strong>ed<br />

on all comb<strong>in</strong>ations <strong>of</strong> n − 1 <strong>in</strong>ner subsets and evaluated on the rema<strong>in</strong><strong>in</strong>g <strong>in</strong>ner subset<br />

Cij. The parameter comb<strong>in</strong>ation with the best average performance over all n <strong>in</strong>ner cross<br />

validation rounds is used <strong>to</strong> tra<strong>in</strong> a model on Ei. The result<strong>in</strong>g model is evaluated on the<br />

outer test set Ci.<br />

S<strong>in</strong>ce on small data sets the <strong>in</strong>itial partition<strong>in</strong>g may affect the result<strong>in</strong>g performance,<br />

the performance assessment is more precise if it is based on several rounds <strong>of</strong> k-fold cross<br />

validation, each employ<strong>in</strong>g a different partition<strong>in</strong>g.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!