T xT Kl xTf - ICM
T xT Kl xTf - ICM
T xT Kl xTf - ICM
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Target Specific Compound Identification using Support Vector Machine.<br />
Dariusz Plewczynski 1,2* , Marcin von Grotthuss 1 , Stephane Spieser 3 , Leszek Rychewski 1 , Lucjan S. Wyrwicz 1 ,<br />
Uwe Koch 3<br />
1) BioInfoBank Institute, Limanowskiego 24A/16, 60-744 Poznan, Poland, Tel: +48-61-8653520, Fax:<br />
+48-61-8643350, E-mail: darman@bioinfo.pl<br />
2) Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw,<br />
Warsaw, Poland<br />
3) Istituto di Ricerche di Biologia Molecolare (IRBM) “P. Angeletti”, Merck&Co. Inc., Pomezia, Italy<br />
* The correspondence should be addressed to Dariusz Plewczynski (darman@bioinfo.pl) and Uwe Koch<br />
(uwe_koch@merck.com).<br />
RUNNING TITLE: Target Specific Compound Identification by Support Vector Machine.<br />
KEYWORDS:<br />
1) Compound Identification,<br />
2) Protein target specificity,<br />
3) MDL Drug Data Report,<br />
4) Machine-learning methods,<br />
5) Atom Pairs,<br />
6) Support Vector Machine.<br />
ABREVIATIONS:<br />
1. SVM – support vector machine,<br />
2. AP – atom pairs,<br />
3. MDDR - MDL Drug Data Report.<br />
1
ABSTRACT<br />
In many cases at the beginning of a HTS-campaign some information about active molecules is already<br />
available. Often active compounds (such as substrate analogues, natural products, inhibitors of a related protein<br />
or ligands published by a pharmaceutical company) have been identified in low-throughput validation studies of<br />
the biochemical target. We would like to evaluate in how far support vector machine can be trained on those<br />
compounds and used to classify a collection with unknown activity. This approach is aimed on reducing the<br />
number of compounds to be tested against the given target. Our method predicts biological activity of chemical<br />
compounds based only on the Atom Pairs (AP) two dimensional topological descriptors. The supervised<br />
Support Vector Machine (SVM) method is trained here on compounds from the MDL drug data report<br />
(MDDR) known to be active for specific protein target. For detailed analysis we have selected five different<br />
biological targets: cyclooxygenase-2, dihydrofolatereductase, thrombin, HIV-reverse transcriptase and<br />
antagonists of the estrogen receptor. The accuracy of compounds identification is estimated here using the<br />
recall and precision values. The sensitivities for all types of targets are over 80% and the classification<br />
performance reaches 100% for selected targets. The second application of our method address the problem<br />
when at the beginning of a HTS-campaign no initial set of actives is known on a selected protein target. Then<br />
the virtual high-throughput screening (vHTS) is applied in most cases by flexible docking procedure. The vHTS<br />
experiment typically contain a large percentage of false positives that should be verified by costly and time-<br />
consuming experimental follow-up assays. The subsequent use of our machine learning method improves the<br />
speed (you do not have to perform the docking on all compounds of the database) and also the accuracy of<br />
HTS hit lists (the enrichment factor).<br />
INTRODUCTION<br />
Genomic research provides an ever increasing number of potential drug targets. In the past large compound<br />
collections were tested for a single target. Recently it became common practice in the applied research to screen<br />
large collections of compound for potential activity of in vitro high throughput screening (HTS) model studies<br />
to identify new lead compounds. However, with a larger number of drug targets often the question is raised how<br />
to preselect chemical versus biological space more efficiently. Our aim of the present study is to present fast<br />
and reliable in silico method that captures the essential features of inhibitor molecules.<br />
High throughput screening (HTS) allows for the testing of millions of compounds for activity against the<br />
chosen target. As a result a set of lead molecules with relatively high activity against the target is identified.<br />
The number of these compounds can be up relatively high and therefore these compounds are usually subjected<br />
to further prioritization based on the assessment of various molecular characteristics. Although, highly<br />
2
successful, this approach can not be applied simultaneously to the large number of drug targets emerging from<br />
genomic research. One solution is to reduce the number of compounds to be tested to those with a high<br />
probability of activity. There are many ways in which various computational methods can contribute in this<br />
process. We focused here on the application of support vector machine (SVM) – i.e. supervised machine<br />
learning approach. We evaluated our method in terms of its capability to recognize known ligands for five<br />
divergent protein targets of the highest medicinal relevance, which already have been investigated in several<br />
drug discovery programs.<br />
Chemists have gathered expertise on features in molecular structures that are important for inhibition on<br />
specific targets. Thus, even the 2D structure of ligand allows for some estimate of its activity for a given protein<br />
target. We tried to describe this empirical knowledge about inhibitors in terms of computational prediction<br />
model. By application of support vector machine algorithm we classify compounds that are active against this<br />
biological target according to the commercially available MDL drug data report [1]. In the past various<br />
machine-learning approaches have been used for a number of compound-based classification problems. For<br />
example neural networks have been used as drug-likeness filters to distinguish drugs from non-drugs, to<br />
classify compounds based on their ADME properties, toxicity and target specificity. In many of these<br />
applications the use of parameters describing the compound’s topology gave satisfactory results. That is why<br />
we have used AP two dimensional ligands descriptors to represent the variety of chemical space.<br />
Target identification is a critical step following the discovery of small molecules. In [2] Nidhi et al.<br />
provided an in silico method for predicting potential targets for compounds on the basis of chemical structure<br />
alone. They used the multiple-category Laplacian-modified naive Bayesian model trained on extended-<br />
connectivity fingerprints of compounds from 964 target classes in the WOMBAT (World Of Molecular<br />
BioAcTivity) chemogenomics database. The algorithm was then tested by finding the three top most likely<br />
protein targets for all MDDR (MDL Drug Database Report) database compounds [1]. On average, the correct<br />
target was found 77% of the time for compounds from 10 MDDR activity classes with known targets [2]. The<br />
support vector machine was used recently to describe high-throughput screening (HTS) data with great success<br />
[3]. With carefully selected parameters, SVM models increased the hit rates significantly, and 50% of the active<br />
compounds could be recovered by screening just 7% of the test set. The authors found that the size of the<br />
training set played a significant role in the performance of the models, i.e. a training set with 10,000 member<br />
compounds is likely the minimum size required to build a model with reasonable predictive power [3]. In other<br />
work by using an in-house data set of small-molecule structures, encoded by Ghose-Crippen parameters,<br />
several machine learning techniques were applied to distinguish between kinase inhibitors and other molecules<br />
with no reported activity on any protein kinase [4]. They compared four approaches: support vector machines<br />
(SVM), artificial neural networks (ANN), k nearest neighbor classification with GA-optimized feature selection<br />
3
(GA/kNN), and recursive partitioning (RP). Support-vector machines, followed by the GA/kNN combination,<br />
outperformed the other techniques when comparing the average of individual models. Similar to our approach<br />
is presented in [5]. They have performed virtual screening using some very simple features, by employing the<br />
number of atoms per element as molecular descriptors but without regard to any structural information<br />
whatsoever. These atom counts are able to outperform virtual-affinity-based fingerprints and Unity fingerprints<br />
in some activity classes. This fact can partly be explained by highly nonlinear structure-activity relationships,<br />
which represent a severe limitation of the "similar property principle" in the context of bioactivity [5].<br />
In our previous paper [6] we answered the following question: “How well do different classification methods<br />
perform in selecting the ligands of a protein target out of large compound collections not used to train the<br />
model?”. In this work support vector machines, random forest, artificial neural networks, k-nearest-neighbor<br />
classification with genetic-algorithm-optimized feature selection, trend vectors, naive Bayesian classification,<br />
and decision tree were used to divide databases into molecules predicted to be active and those predicted to be<br />
inactive. Training and predicted activities were treated as binary. We reported significant differences in the<br />
performance of the methods independent of the biological target and compound class. Different methods can<br />
have different applications; some provide particularly high enrichment, others are strong in retrieving the<br />
maximum number of actives. We also showed that these methods do surprisingly well in predicting recently<br />
published ligands of a target on the basis of initial leads and that a combination of the results of different<br />
methods in certain cases can improve results compared to the most consistent method. In the present paper we<br />
focus our attention only on our novel SVM-based method and provide more in-depth description of the<br />
methodology and results. We believed that this paper can help all readers to use our protocol for solving<br />
similar HTS problems. Therefore we decided not to include a comparison of all current known approaches<br />
thinking it is rather outside of the scope of this manuscript. Our results provide higher sensitivity and selectivity<br />
comparing to other recently published methods (such as combination of SVM with naïve Bayesian trained on<br />
Ghose-Crippen parameters and others [2-5]).<br />
The list of hits generated by virtual high-throughput screening (vHTS) typically contain a large percentage of<br />
false positives, making experimental follow-up assays necessary to distinguish active from inactive substances.<br />
Here we would like to present another application of SVM based method aimed at improving the accuracy of<br />
HTS hit lists by the subsequent use of machine learning method. The virtual screening procedure often is<br />
performed on the large chemical libraries and selecting hits by statistical algorithms instead of time-costly<br />
docking procedure is of great importance [7]. We address this problem by the case study on five protein targets:<br />
HIV-reverse transcriptase, COX2, dihydrofolate reductase, estrogen receptor and thrombin in conjunction with<br />
MDL Drug Data Report database [1]. The virtual HTS was performed with a set of different flexible docking<br />
and scoring methods. Our results reveal that support vector machine algorithm is able to speed-up the vHTS<br />
4
procedure by limiting the set of ligands to be docked on a target and provide also the better enrichment of the<br />
HTS hit rate.<br />
COMPUTATIONAL METHODS<br />
Data sets and AP descriptors<br />
Five diverse protein targets were tested: human cyclooxygenase-2 [8], dihydrofolate reductase [9, 10], thrombin<br />
[11, 12], antiestrogen [13] and HIV reverse transcriptase [14]. The datasets used for training and testing is<br />
comprised of both active and inactive compounds from the subset of the MDL drug data report [1]. All<br />
compounds are clinically tested or already launched on the market. Inhibitors and non-active compounds for<br />
those targets were used for training supervised machine-learning algorithm. For additional tests we have<br />
selected also compounds that are now biologically tested for inhibition of targets, and are not yet on the marker<br />
or clinical tests.<br />
The entire pool of compounds for cyclooxygenase-2 target contains 112 inhibitors and 10452 inactive<br />
compounds divided randomly into two subsets: training (75 inhibitors, 2106 inactive ones) and testing<br />
(respectively 37 and 8346). In the case of dihydrofolate we have 28 inhibitors (divided into 17 for training and<br />
11 for testing) and 10529 inactive ones (2149 and 8380). For reverse transcriptase we have selected 114<br />
inhibitors (79 and 35) and 10450 inactive compounds (2130 and 8320), and for thrombin we have 112<br />
inhibitors (77 and 35) and 10459 inactive ones (2036 and 8423). For the last target we have collected 34<br />
inhibitors (22 and 12) and 11580 inactive compounds (2528 for training and 9052 for testing). For additional<br />
test we have selected molecules biologically tested for inhibition of cyclooxygenase-2 (792 molecules),<br />
dihydrofolate (154), thrombin (1066), reverse transcriptase (597) and antiestrogen (256).<br />
We have utilized the regular atom pair AP descriptors [15] due to their proven success in classifying<br />
compounds, ease of use and interpretability. To encode the molecule structures we employed the MIX tools<br />
script [16], which counts for each atom pair the number of covalent bonds that join them. Thus for each<br />
compound it yields a binary vector with 1 for all present types of atom pairs, and 0 for those that are absent in a<br />
molecule. In addition to those we have tested also a larger set of additional 6 descriptors such as: TT (regular<br />
topological torsion), DP (pairs using sq types), DT (torsions using SQ types), DRUGBITS (substructures) and<br />
ROF6 set of descriptors [16]. When including in the training those additional descriptors we have not observed<br />
any significant improvement of results.<br />
Classification and model validation<br />
5
We trained support vector machine algorithm for each type of a target. First, we have created the dataset of<br />
compounds with experimentally verified activity (positive instances). Then we have built the dataset of inactive<br />
compounds (negative instances). The negative instances are chosen randomly from launched or preclinical<br />
compounds that have no experimentally verified activity for the selected type of the target. These two datasets<br />
(positive and negative instances) are projected then as sets of points into a multidimensional space using AP<br />
two dimensional descriptors described in the previous subsection.<br />
In is well known that some machine-learning methods have difficulties handling unbalanced training sets, i.e.<br />
when the number of positive instances is substantially different from the number of negatives. Therefore for<br />
each target we selected randomly 1/3 of negatives for training and 2/3 for testing, whereas 2/3 of positive<br />
instances for training and 1/3 for testing. The selections were repeated few times for all targets, yet no<br />
significant differences were observed between those various selections. We present here the average results for<br />
training and testing phase of experiments. Models are derived for each training set independently and used for<br />
prediction of activity of the compounds in the test set.<br />
There are many ways to present the performance of a SVM classifier. We use here accuracy E, precision P and<br />
recall R values, together with confusion tables. Their definitions are given below:<br />
fp + fn<br />
E = * 100%<br />
,<br />
tp + fp + tn + fn<br />
tp<br />
R = * 100%<br />
, [Eq. 1]<br />
tp + fn<br />
tp<br />
P = * 100%<br />
,<br />
tp + fp<br />
where tp is the number of true positives, fp is the number of false positives, tn is the number of true negatives<br />
and fn is the number of false negatives. The classification error E provides an overall error measure, whereas<br />
recall R measures the percentage of correct predictions (the probability of correct prediction), and precision P<br />
gives the percentage of observed positives that are correctly predicted (the measure of the reliability of positive<br />
instances prediction).<br />
The Support Vector Machine (SVM) Method.<br />
SVM is an effective statistical learning method [17-19] with good performance yet easier to implement<br />
then neural networks. It was successfully applied to various problems including text classification [20], image<br />
recognition tasks [21], bioinformatics [22, 23] and medical applications [24, 25]. The SVM approach has been<br />
6
used also in analysis of gene expression data [26], classification of microarrays data [27], to infer gene<br />
functional classification [28-30] and for protein analysis [31-33].<br />
Most of those tasks have the property of sparse instance vectors. The SVM approach has the ability to<br />
construct predictive models with the large generalization power even in the case of large dimensionality of the<br />
data when the number of observation available for training is low. SVM always seeks a globally optimized<br />
solution and avoids over-fitting, so the large number of features (as in our binary representation of ligands<br />
topology) is allowed. The SVMlight implementation done by Thorsten Joachims [34] is used in the field of<br />
bioinformatics [35]. The crucial idea behind is a sparse instance vectors property to obtain compact and<br />
efficient representation.<br />
The output of the training phase is a classification function i.e. a model. It consists from the set of D<br />
support vectors Tj andα i , which are nonzero, positive real numbers. Those constants are obtained from<br />
optimization procedure (quadratic programming QP problem) used to find the maximal margin hyperplane. The<br />
number of free parameters of the QP problem is equal to the number of all instances in the training dataset. The<br />
non-zero parameters α i describe the strength of this particular i-th support vector in the decision function.<br />
SVM chooses as support vectors those points that lie closest to the separating hyperplane. The kernel function<br />
is used to define the feature space after nonlinear mapping function from the embedding space. The mapping<br />
function Ω need not be explicitly defined because in the kernel function is used only the inner product of it. The<br />
kernel function is a positive define function reflecting the similarity between an input sample and the set of<br />
support vectors Ti. In most cases three types of kernels are used: the linear, polynomial or radial basis.<br />
The reliability of a classification of a ligand [36] as an active one is given by the cost function:<br />
where ( T T )<br />
i<br />
f<br />
[ ] ∑ ( { [ ] } { } )<br />
= i D<br />
( T x ) = liα<br />
iK<br />
Ω T x , Ω Ti<br />
, [Eq. 1]<br />
i=<br />
1<br />
K , is the proper kernel function that defines the feature space, Ω is a nonlinear mapping function<br />
from embedding space T into the feature space, and li are known a priori class labels for support vectors. We<br />
use l = + 1 for positive cases and l = −1for<br />
negative ones. The kernel function is a positive define function<br />
i<br />
i<br />
reflecting the similarity between an input sample and the set of support vectors Ti. The non-zero parameters α i<br />
describe the strength of this particular i-th support vector in the decision function. SVM chooses as support<br />
vectors those points that lie closest to the separating hyperplane. The mapping function Ω need not be explicitly<br />
defined because in the kernel function is used only the inner product of it.<br />
RESULTS AND DISCUSION<br />
7
The major goal of this study was to test to what extent the supervised support vector machine method is capable<br />
of learning and predicting the target specific inhibition likelihood for chemical compounds based on MDL drug<br />
data report [1]. We have prepared for each of 5 different protein targets 2 datasets: for training and testing. Each<br />
include compounds which are know to inhibit the protein target (2/3 of all available positives for training, 1/3<br />
for testing) and those known not to be active for the selected target (1/3 of all available negatives for training,<br />
2/3 for testing). Support vector machine algorithm was trained on the first data set and tested on the second one.<br />
In the following paragraphs we discuss the classification results obtained for each of biological targets and<br />
present its performance using confusion tables for training and testing datasets. In addition we provide also the<br />
overall value for classification error and precision/recall values. In general the SVM models yield the successful<br />
classification of compounds for all targets (see Table I for confusion tables for training and testing datasets). It<br />
provides robust and reliable models for all types of protein targets. The SVM algorithm turn to be the method<br />
of choice for any practical purpose: it is very fast, efficient and robust.<br />
We performed two additional in silico experiments. The first one uses for training all available in MDL drug<br />
data report active compounds annotated as preclinical or launched. The testing is done on potential inhibitors<br />
that are annotated as biologically tested. The second experiment trains the SVM method on oldest 1/3<br />
compounds known to be inhibitors, and test their accuracy on the rest: 2/3 newest developed and patented<br />
compounds. The performance of machine-learning models in both experiments for each type of protein target is<br />
described by the recall R and the precision P. The recall R value measures the percentage of correct predictions,<br />
whereas precision P gives the percentage of observed positives that are correctly predicted. These measures of<br />
accuracy are calculated separately for each type of protein target and presented in Tables II (the first<br />
experiment) and Table III (the second experiment). The typical recall value is around 60%, and the precision P<br />
is close to 100% for all targets (the first experiment). The results for the second experiment are slightly worst,<br />
which is caused by the discovery of novel drug classes that are presented in the newest 2/3 compounds.<br />
In the Table IV we present results on enrichment studies of virtual high-throughput screening. We trained SVM<br />
algorithm on the set of first 10% of the best scoring ligands from the docking and scoring experiments for each<br />
protein target. The subset of MDL drug data report inhibitors including both active and non-active compounds<br />
for those targets were docked on protein targets using various docking methods (FLOG [37], GLIDE [38, 39],<br />
FRED [40], Dock [41], Autodock [42, 43] and <strong>ICM</strong> [44]) followed in some cases by the scoring (SS,<br />
ChemScore or internal <strong>ICM</strong> score). The 10% of best docked compounds were then used for training supervised<br />
machine-learning algorithms. The classification models were tested then on the rest of active compounds (data<br />
not shown). Our results support possibility to train machine learning algorithms on docking and scoring results.<br />
Such trained models can be later applied to large databases in order to select ligands for further experimental<br />
verification. This procedure will allow for faster selection of compounds in virtual HTS experiments even in<br />
8
cases where no initial information about a set of active compounds for selected protein target is known. The<br />
random scores, from training of SVM on randomly selected subset of 1000 compounds, for recall and precision<br />
are equal to 50% and the most methods are able to gain recall/precision up to 70%.<br />
Our results are in close agreement with other comparative studies [4, 7, 36, 45, 46]. The SVM method is fast<br />
and reliable machine-learning method that outperforms other types of algorithms. It is also well suited for the<br />
classification of small molecules using 2D descriptors with respect to their potential inhibition on selected<br />
target classes. The selection of molecular descriptors should be done in accordance with the balance between<br />
general and detailed level of description. The MIX tools descriptors [16] are useful for the classification of<br />
compounds by SVM with respect to their potential for inhibition of selected protein targets. The subsequent<br />
SVM training on results of docking experiment allows for speed-up vHTS procedure and to enrich the hit list.<br />
Our aim was to present possible in silico application of machine learning methods to the HTS data. The number<br />
of HTS hits can become large requiring some type of prioritization. The set of active compounds often contains<br />
a substantial number of false positives. False negatives are also important but difficult to identify<br />
experimentally. Application of a machine learning model can help to identify true positives and help to select<br />
compounds for retesting. In still another application the results of the HTS itself can be used for training and the<br />
model used to identify new compounds not present in the screening collection to be synthesized or bought from<br />
a commercial source.<br />
These tools can be also valuable whenever a large data set of molecules is to be screened in order to select<br />
structures that have a higher likelihood of being inhibitors. Such molecules are often desired in order to enrich<br />
in-house target-specific libraries of pharmaceutical companies. An empirically derived in silico method can also<br />
help to set priorities within the list of accessible in-house molecules to be tested experimentally. Similar<br />
approach can be used to enrich the initial set of patented molecules by including compounds from the large<br />
commercial collections that are predicted to be active for the same protein target family.<br />
ACKNOWLEDGMENTS<br />
This work was supported by EC BioSapiens (LHSG-CT-2003-503265) and EC SEPSDA (SP22-CT-2004-<br />
003831) 6FP projects as well as the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and<br />
2P05A00130). MvG and LSW would like to thank the Foundation for Polish Science for the fellowship.<br />
9
Table I. SVM classification results for selected 5 targets on launched and preclinical inhibitors from<br />
MDDR database.<br />
Protein Target Training dataset Testing dataset<br />
COX2 predicted 0 1 predicted 0 1<br />
10<br />
Recall/<br />
Precision<br />
Recall/<br />
Precision<br />
Training set Testing set<br />
observed 0 2106 0 0 8308 38 92% 73%<br />
observed 1 6 69 1 10 27 100% 42%<br />
DH 0 1 0 1<br />
observed 0 2151 0 0 8390 9 100% 73%<br />
observed 1 0 17 1 3 8 100% 47%<br />
TH 0 1 0 1<br />
observed 0 2032 0 0 8401 22 98% 74%<br />
observed 1 1 47 1 9 26 100% 54%<br />
RT 0 1 0 1<br />
observed 0 2130 0 0 8277 43 100% 31%<br />
observed 1 0 54 1 24 11 100% 20%<br />
AE 0 1 0 1<br />
observed 0 2351 2 0 9028 24 100% 42%<br />
observed 1 0 22 1 7 5 92% 17%<br />
The SVM classification performance on the set of preclinical or launched inhibitors from MDDR database is<br />
described here using the confusion tables. Columns represent observed in experiments class of a compound for<br />
each of targets (active/inactive) whereas rows represent the prediction results. The list of protein targets<br />
include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and<br />
antiestrogen (AE). First we present results on training datasets with 2/3 available positives (preclinical or<br />
launched inhibitors for the selected target) and 1/3 of negatives (randomly selected subset of preclinical or<br />
launched inhibitors knowing not to inhibit selected target). On the right there are results of SVM models on<br />
testing datasets containing 1/3 of positives and 2/3 of available negatives not used in training of SVM method.<br />
The last two columns present the precision and recall values on both training and testing datasets.
Table II. The classification accuracy for SVM method on selected 5 targets for biologically tested<br />
inhibitors of MDL drug data report.<br />
Protein Target Number<br />
of<br />
positives/<br />
negatives<br />
cyclooxygenase-2<br />
10408<br />
dihydrofolate<br />
10492<br />
thrombin<br />
10418<br />
reverse transcriptase<br />
10406<br />
antiestrogen<br />
12164<br />
11<br />
Number of<br />
biological<br />
testing<br />
compounds<br />
793<br />
110 69%<br />
95%<br />
28 154 71%<br />
98%<br />
111 1070 62%<br />
98%<br />
113 596 52%<br />
96%<br />
34 255 40%<br />
96%<br />
Recall/<br />
Precision<br />
of the best<br />
method<br />
MDL drug data reports inhibitor and non-active compounds for selected five targets. SVM algorithm is trained<br />
here on the whole set of preclinical or launched inhibitors for five protein targets. The classification models<br />
were then tested then on biological tested on selected biological target according to MDDR compounds<br />
(excluding those that are preclinical or already launched on the market).<br />
The first column presents the protein target name. The second column gives numbers of positives and negatives<br />
instances used in training. The third column shows the number of compounds annotated by MDL drug data<br />
report for being biologically tested on selected protein target. The recall and precision values for the SVM<br />
method are in the fourth column.
Table III. The classification accuracy for support vector machine method on selected 5 targets for<br />
patented compounds from MDL drug data report.<br />
Protein Target Number<br />
of training<br />
negatives/<br />
positives<br />
oldest 1/3<br />
Cyclooxygenase-2<br />
Number<br />
of testing<br />
negatives/<br />
positives<br />
newest 2/3<br />
12<br />
Recall/<br />
Precision<br />
on the<br />
training<br />
dataset<br />
Recall/<br />
Precision<br />
on the<br />
testing<br />
dataset<br />
2106 8347 94% 64%<br />
31 81 100% 66%<br />
dihydrofolate 2151 8390 100% 50%<br />
12 16 100% 44%<br />
thrombin 2034 8425 100% 33%<br />
34 76 100% 66%<br />
reverse 2130 8323 100% 46%<br />
transcriptase 54 59 100% 33%<br />
antiestrogen 2353 9700 100% 9%<br />
6 22 100% 15%<br />
The supervised SVM method classification performance trained using AP descriptors. The list of protein targets<br />
include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and<br />
antiestrogen (AE). The oldest one-third (first four columns) or the oldest two-thirds (last four columns) of a<br />
subset of MDL drug data report inhibitors and non-active compounds for those targets were used for training<br />
supervised machine-learning algorithms. The classification models were tested then on rest of patented<br />
compounds.<br />
The first column presents the protein target name. The second column gives numbers of positives and negatives<br />
instances used in training (1/3 oldest patented compounds). The third column shows the number of 2/3 newest<br />
compounds annotated by MDL drug data report for inhibition on selected protein target and patented. The recall<br />
and precision values for the SVM method on the training datasets is included in the fourth column. The fifth<br />
column presents the precision and recall values on the testing dataset.
Table IV. The classification accuracy for support vector machine method on selected 5 targets for top<br />
10% of best docked compounds.<br />
Protein Target #positives in<br />
MDDR db<br />
FLOG<br />
docking<br />
results<br />
Best Docking<br />
Methods<br />
Cyclooxygenase-2 98 1 GLIDE<br />
FRED<br />
13<br />
Best Scoring<br />
Methods<br />
SS<br />
ChemScore<br />
dihydrofolate 27 6 <strong>ICM</strong> <strong>ICM</strong> 10<br />
thrombin 99 46 <strong>ICM</strong> <strong>ICM</strong> 99<br />
reverse transcriptase 108 0 <strong>ICM</strong> <strong>ICM</strong> 10<br />
antiestrogen 32 17 GLIDE<br />
FRED<br />
GlideScore<br />
ChemScore<br />
Best<br />
D&S<br />
Results<br />
60<br />
63<br />
Recall/<br />
Precision<br />
on the<br />
training<br />
dataset<br />
93,20%<br />
98,97%<br />
64,74%<br />
62,43%<br />
72,64%<br />
68,28%<br />
79,29%<br />
80,16%<br />
74,00%<br />
24<br />
23 71,36%<br />
The supervised SVM method classification performance trained using AP descriptors. The list of protein targets<br />
include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and<br />
antiestrogen (AE). The subset of MDL drug data report inhibitors including both active and non-active<br />
compounds for those targets were docked on protein targets using various docking methods. The 10% of best<br />
docked compounds were then used for training supervised machine-learning algorithms. The classification<br />
models were tested then on the rest of active compounds.<br />
The first column presents the protein target name. The second column gives numbers of active compounds<br />
found in MDL data drug report database for selected protein target. The number of negatives used for docking<br />
experiment was fixed for all targets and equal to 10000. The third column shows the results of FLOG fast and<br />
flexible docking procedure i.e. the number of active compounds found in first 10% of the ordered by the FLOG<br />
docking score ligands. The fourth and fifth column present the best docking and scoring method name. The<br />
sixth column presents the number of active compounds found in first 10% of the list of ligands ordered by the<br />
best docking program followed by the scoring. The set of 10% of the best scoring compounds was then used to<br />
train support vector machine (SVM). The recall and precision values for the training is presented in the seventh<br />
column.
REFERENCES<br />
1. MDL, MDL Drug Data Report (2004). Coverage: 1988-present; updated monthly. Focus: Drugs<br />
launched or under development, as referenced in the patent literature, conference proceedings, and<br />
other sources; descriptions of therapeutic action and biological activity; tracking of compounds<br />
through development phases. Size: 132726 molecules,129459 models. Updates add approximately<br />
10,000 new compounds per year. 2004.<br />
2. Nidhi, et al., Prediction of biological targets for compounds using multiple-category Bayesian models<br />
trained on chemogenomics databases. J Chem Inf Model, 2006. 46(3): p. 1124-33.<br />
3. Fang, J., et al., Support vector machines in HTS data mining: Type I MetAPs inhibition study. J Biomol<br />
Screen, 2006. 11(2): p. 138-44.<br />
4. Briem, H. and J. Gunther, Classifying "kinase inhibitor-likeness" by using machine-learning methods.<br />
Chembiochem, 2005. 6(3): p. 558-66.<br />
5. Bender, A. and R.C. Glen, A discussion of measures of enrichment in virtual screening: comparing the<br />
information content of descriptors with increasing levels of sophistication. J Chem Inf Model, 2005.<br />
45(5): p. 1369-75.<br />
6. Plewczynski, D., S.A. Spieser, and U. Koch, Assessing different classification methods for virtual<br />
screening. J Chem Inf Model, 2006. 46(3): p. 1098-106.<br />
7. Jenkins, J.L., R.Y. Kao, and R. Shapiro, Virtual screening to enrich hit lists from high-throughput<br />
screening: a case study on small-molecule inhibitors of angiogenin. Proteins, 2003. 50(1): p. 81-93.<br />
8. Kalgutkar, A.S. and Z. Zhao, Discovery and design of selective cyclooxygenase-2 inhibitors as nonulcerogenic,<br />
anti-inflammatory drugs with potential utility as anti-cancer agents. Curr Drug Targets,<br />
2001. 2(1): p. 79-106.<br />
9. Dicker, A.P., et al., Identification and characterization of a mutation in the dihydrofolate reductase<br />
gene from the methotrexate-resistant Chinese hamster ovary cell line Pro-3 MtxRIII. J Biol Chem,<br />
1990. 265(14): p. 8317-21.<br />
10. Schweitzer, B.I., A.P. Dicker, and J.R. Bertino, Dihydrofolate reductase as a therapeutic target. Faseb<br />
J, 1990. 4(8): p. 2441-52.<br />
11. Ambler, J., et al., The discovery of orally available thrombin inhibitors: studies towards the<br />
optimisation of CGH1668. Bioorg Med Chem Lett, 1998. 8(24): p. 3583-8.<br />
12. Menear, K., Progress towards the discovery of orally active thrombin inhibitors. Curr Med Chem, 1998.<br />
5(6): p. 457-68.<br />
13. Gustafsson, J.A., Therapeutic potential of selective estrogen receptor modulators. Curr Opin Chem<br />
Biol, 1998. 2(4): p. 508-11.<br />
14. Castro, H.C., et al., HIV-1 reverse transcriptase: a therapeutical target in the spotlight. Curr Med<br />
Chem, 2006. 13(3): p. 313-24.<br />
15. Sheridan, R.P., The centroid approximation for mixtures: calculating similarity and deriving structure-activity<br />
relationships. J Chem Inf Comput Sci, 2000. 40(6): p. 1456-69.<br />
16. Miller, M.D., R.P. Sheridan, and S.K. Kearsley, SQ: a program for rapidly producing<br />
pharmacophorically relevent molecular superpositions. J Med Chem, 1999. 42(9): p. 1505-14.<br />
17. Vapnik, V.N., The nature of statistical learning theory. Vol. xv. 1995, New York: Springer.<br />
18. Vapnik, V.N., Statistical learning theory. Adaptive and learning systems for signal processing,<br />
communications, and control. Vol. xxiv. 1998, New York: Wiley. 736.<br />
19. Cristianini, N. and J. Shawe-Taylor, An introduction to support vector machines : and other kernelbased<br />
learning methods. 2000, Cambridge, U.K. ; New York: Cambridge University Press. xiii, 189.<br />
20. Joachims, T., Learning to classify text using support vector machines. <strong>Kl</strong>uwer international series in<br />
engineering and computer science SECS. Vol. xvi. 2002, Boston: <strong>Kl</strong>uwer Academic Publishers. 205.<br />
14
21. Vojtech, F. and H. Vaclay, Vojtech, F., Vaclay, H., An iterative algorithm learning the maximal margin<br />
classifier. Pattern Recognition, 2003. 36(9): p. 1985-1996.<br />
22. Kim, H. and H. Park, Protein secondary structure prediction based on an improved support vector<br />
machines approach. Protein Eng, 2003. 16(8): p. 553-60.<br />
23. Minakuchi, Y., K. Satou, and A. Konagaya. Prediction of protein-protein interaction sites using supprot<br />
vector machnes. in International conference on mathematics and engineering techniques in medicine<br />
and biological sciences. 2003.<br />
24. Valentini, G., Gene expression data analysis of human lymphoma using support vector machines and<br />
output coding ensembles. Artif Intell Med, 2002. 26(3): p. 281-304.<br />
25. Guyon, I., et al., Gene selection for cancer classification using support vector machines. Mach. Learn.,<br />
2002. 46: p. 389-422.<br />
26. Brown, M.P., et al., Knowledge-based analysis of microarray gene expression data by using support<br />
vector machines. Proc Natl Acad Sci U S A, 2000. 97(1): p. 262-7.<br />
27. Furey, T.S., et al., Support vector machine classification and validation of cancer tissue samples using<br />
microarray expression data. Bioinformatics, 2000. 16(10): p. 906-14.<br />
28. Krishnan, V.G. and D.R. Westhead, A comparative study of machine-learning methods to predict the<br />
effects of single nucleotide polymorphisms on protein function. Bioinformatics, 2003. 19(17): p. 2199-<br />
209.<br />
29. Pavlidis, P., et al. Gene functional classification from heterogeneous data. in 5th International<br />
Conference on Computational Molecular Biology. 2001. Montreal, Canada: ACM Press.<br />
30. Zien, A., et al., Engineering support vector machine kernels that recognize translation initiation sites.<br />
Bioinformatics, 2000. 16(9): p. 799-807.<br />
31. Jaakkola, T., M. Diekhans, and D. Haussler, A discriminative framework for detecting remote protein<br />
homologies. J Comput Biol, 2000. 7(1-2): p. 95-114.<br />
32. Hua, S. and Z. Sun, A novel method of protein secondary structure prediction with high segment<br />
overlap measure: support vector machine approach. J Mol Biol, 2001. 308(2): p. 397-407.<br />
33. Ding, C.H. and I. Dubchak, Multi-class protein fold recognition using support vector machines and<br />
neural networks. Bioinformatics, 2001. 17(4): p. 349-58.<br />
34. Schölkopf, B., C.J.C. Burges, and A.J. Smola, Advances in kernel methods : support vector learning.<br />
Vol. vii. 1999, Cambridge, Mass.: MIT Press. 376.<br />
35. Zavaljevski, N., F.J. Stevens, and J. Reifman, Support vector machines with selective kernel scaling for<br />
protein classification and identification of key amino acid positions. Bioinformatics, 2002. 18(5): p.<br />
689-96.<br />
36. Burbidge, R., et al., Drug design by machine learning: support vector machines for pharmaceutical<br />
data analysis. Comput Chem, 2001. 26(1): p. 5-14.<br />
37. Miller, M.D., et al., FLOG: a system to select 'quasi-flexible' ligands complementary to a receptor of<br />
known three-dimensional structure. J Comput Aided Mol Des, 1994. 8(2): p. 153-74.<br />
38. Friesner, R.A., et al., Glide: a new approach for rapid, accurate docking and scoring. 1. Method and<br />
assessment of docking accuracy. J Med Chem, 2004. 47(7): p. 1739-49.<br />
39. Halgren, T.A., et al., Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment<br />
factors in database screening. J Med Chem, 2004. 47(7): p. 1750-9.<br />
40. Miteva, M.A., et al., Fast structure-based virtual ligand screening combining FRED, DOCK, and<br />
Surflex. J Med Chem, 2005. 48(19): p. 6012-22.<br />
41. Ewing, T.J., et al., DOCK 4.0: search strategies for automated molecular docking of flexible molecule<br />
databases. J Comput Aided Mol Des, 2001. 15(5): p. 411-28.<br />
42. Buzko, O.V., A.C. Bishop, and K.M. Shokat, Modified AutoDock for accurate docking of protein kinase<br />
inhibitors. J Comput Aided Mol Des, 2002. 16(2): p. 113-27.<br />
43. Vaque, M., et al., BDT: an easy-to-use front-end application for automation of massive docking tasks<br />
and complex docking strategies with AutoDock. Bioinformatics, 2006.<br />
15
44. Fernandez-Recio, J., M. Totrov, and R. Abagyan, <strong>ICM</strong>-DISCO docking by global energy optimization<br />
with fully flexible side-chains. Proteins, 2003. 52(1): p. 113-7.<br />
45. Byvatov, E., et al., Comparison of support vector machine and artificial neural network systems for<br />
drug/nondrug classification. J Chem Inf Comput Sci, 2003. 43(6): p. 1882-9.<br />
46. Glick, M., et al., Enrichment of high-throughput screening data with increasing levels of noise using<br />
support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers. J<br />
Chem Inf Model, 2006. 46(1): p. 193-200.<br />
16