26.11.2012 Views

Peptide-Based Drug Design

Peptide-Based Drug Design

Peptide-Based Drug Design

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

152 Hilpert et al.<br />

than control. Given the large amount of data available, a rigorous methodology<br />

was feasible. We performed a 10-fold stratified, cross-validation where each of<br />

the two sets of peptides were randomly divided into 10 groups, requiring the<br />

number of active peptides in each section to be approximately the same in each<br />

group. By consecutively leaving one group of 10% out for use in validation<br />

and combining the other 9 groups into a single training group, 10 groups were<br />

constructed for training the models. ANNs were constructed using simulation<br />

software (86), by training on each of the 10 training groups and using the leftout<br />

validation group to signal the end of training and preventing overfitting.<br />

(Overfitting occurs when there are too few data for the complexity of the model:<br />

the model tends then to “memorize” the data and while it describes the training<br />

data very well, it performs poorly when used to predict data outside the training<br />

set.)<br />

The performance of any model built on data is best measured by accessing the<br />

predictions of the model in situations that did not exist in the training data. Given<br />

the large amounts of peptide activity data we had available, we chose to assess<br />

the model predictions on the set of data that was not involved in any way in the<br />

construction of the models: we used the Set A models to predict the activity of<br />

Set B peptides, and Set B models to predict activity of Set A peptides. Rather<br />

than attempt to predict IC50, each ANN produced a single number between 0<br />

and 1 to indicate the likelihood that the peptide was more active than Bac2A.<br />

An example of the performance of one selected ANN model built on 90% of<br />

Set A peptides for predicting Set B peptide activity is shown in Fig. 3. The<br />

positive predictive value (PPV, the fraction of peptides that are active out of<br />

all peptides predicted active) is quite low for ANN output threshold values up<br />

to 0.9; but the PPV increases dramatically to 1.0 for threshold values greater<br />

than 0.9. For threshold values less than 0.9, most peptides predicted to be active<br />

are in fact inactive (PPV < 0.5). However, for high threshold values ( 0.5), but few<br />

peptides are predicted to be active (only 2% of peptides at threshold of 0.999).<br />

For such high threshold values, many active peptides are incorrectly classified<br />

as inactive. However, this may not be a problem for an in silico screening<br />

of large numbers of peptides where high confidence is needed that peptides<br />

predicted to be active are worth the expense of experimental investigation. Such<br />

screening programs may not be concerned that many active peptides are incorrectly<br />

classified as inactive, only that those predicted active are most likely<br />

active.<br />

This is illustrated by examining the activities of test peptides ranked according<br />

to the ANN outputs (values near 1 indicating likelihood to be more active<br />

peptides, and values close to 0 indicating likelihood to not be more active<br />

peptides). The results are summarized in Table 4. Here, the proportion of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!