05.08.2013 Views

New Approaches to in silico Design of Epitope-Based Vaccines

New Approaches to in silico Design of Epitope-Based Vaccines

New Approaches to in silico Design of Epitope-Based Vaccines

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.2. IMPROVED KERNELS FOR MHC BINDING PREDICTION 35<br />

For the classification, peptides with IC50 values greater than 500 nM were considered nonb<strong>in</strong>ders,<br />

all others b<strong>in</strong>ders.<br />

We use three sets <strong>of</strong> physicochemical descrip<strong>to</strong>rs for AAs: (1) five descrip<strong>to</strong>rs derived<br />

from a pr<strong>in</strong>cipal component analysis <strong>of</strong> 237 physicochemical properties (pca), (2) three<br />

descrip<strong>to</strong>rs represent<strong>in</strong>g hydrophobicity, size, and electronic properties taken from the<br />

AAIndex (zscale), and (3) 20 descrip<strong>to</strong>rs correspond<strong>in</strong>g <strong>to</strong> the respective entries <strong>of</strong> the<br />

BLOSUM50 substitution matrix [88] (blosum50).<br />

The ma<strong>in</strong> goal <strong>of</strong> the work presented <strong>in</strong> this section is the methodological improvement<br />

<strong>of</strong> exist<strong>in</strong>g str<strong>in</strong>g kernels by <strong>in</strong>corporation <strong>of</strong> prior knowledge on AA properties. In order <strong>to</strong><br />

analyze the benefits <strong>of</strong> the proposed modifications we conducted performance comparisons<br />

between the orig<strong>in</strong>al and the modified str<strong>in</strong>g kernels as well as standard kernels.<br />

Prelim<strong>in</strong>ary Performance Analysis<br />

Prelim<strong>in</strong>ary classification experiments on three human MHC alleles (HLA-A*23:01, HLA-<br />

B*58:01, HLA-A*02:01) were carried out <strong>to</strong> analyze the performance <strong>of</strong> the different kernels:<br />

WD (3.23), RBF (3.22), poly (3.21), WD-RBF (4.5), WD-poly (as WD-RBF, but<br />

with polynomial-AASK) comb<strong>in</strong>ed with different encod<strong>in</strong>gs (pca, zscale, blosum50). The<br />

alleles were chosen <strong>to</strong> comprise a small data set (HLA-A*23:01, 104 examples) as well as<br />

a medium (HLA-B*58:01, 988 examples) and a large (HLA-A*02:01, 3089 examples) data<br />

set. The respective cross validation results are given <strong>in</strong> Table 4.1. For each <strong>of</strong> the alleles<br />

a different kernel type performs best: poly (pca) for HLA-A*23:01, RBF (blosum50) for<br />

HLA-B*58:01 and WD-RBF (blosum50) for HLA-A*02:01. The latter performs secondbest<br />

on HLA-A*23:01 and HLA-B*58:01. As for the benefits <strong>of</strong> the modification <strong>of</strong> the<br />

WD kernel, the WD-poly and WD-RBF kernels outperform the WD kernel <strong>in</strong> 17 out <strong>of</strong> 18<br />

cases.<br />

Learn<strong>in</strong>g Curve Analysis<br />

From Table 4.1 the trend can be observed that the kernels that use AA properties benefit<br />

more for smaller datasets. In order <strong>to</strong> validate this hypothesis, we performed learn<strong>in</strong>g curve<br />

analyses for WD and WD-RBF (blosum50) <strong>in</strong> a classification and a regression sett<strong>in</strong>g on<br />

the largest data set, i.e., HLA-A*02:01. Performance is measured by averag<strong>in</strong>g the auROC<br />

and the PCC, respectively. To average over different data splits <strong>in</strong> order <strong>to</strong> reduce random<br />

fluctuations <strong>of</strong> the performance, we performed 100 runs <strong>of</strong> two-times nested five-fold cross<br />

validation. In each run, thirty percent <strong>of</strong> the available data was used for test<strong>in</strong>g. From the<br />

rema<strong>in</strong><strong>in</strong>g data tra<strong>in</strong><strong>in</strong>g sets <strong>of</strong> different sizes (20, 31, 50, 80, 128, 204, 324, 516, 822, 1308)<br />

were selected randomly. Figure 4.3 shows the mean performances with standard errors.<br />

Both for classification and regression, it can clearly be seen that the fewer examples are<br />

available for learn<strong>in</strong>g, the stronger is the improvement <strong>of</strong> the WD-RBF kernel over the<br />

WD kernel. Intuitively this makes sense, as the more data is available, the easier it will be<br />

<strong>to</strong> <strong>in</strong>fer the relation <strong>of</strong> the AAs from the sequences <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data alone.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!