04.11.2012 Views

T xT Kl xTf - ICM

T xT Kl xTf - ICM

T xT Kl xTf - ICM

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Target Specific Compound Identification using Support Vector Machine.<br />

Dariusz Plewczynski 1,2* , Marcin von Grotthuss 1 , Stephane Spieser 3 , Leszek Rychewski 1 , Lucjan S. Wyrwicz 1 ,<br />

Uwe Koch 3<br />

1) BioInfoBank Institute, Limanowskiego 24A/16, 60-744 Poznan, Poland, Tel: +48-61-8653520, Fax:<br />

+48-61-8643350, E-mail: darman@bioinfo.pl<br />

2) Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw,<br />

Warsaw, Poland<br />

3) Istituto di Ricerche di Biologia Molecolare (IRBM) “P. Angeletti”, Merck&Co. Inc., Pomezia, Italy<br />

* The correspondence should be addressed to Dariusz Plewczynski (darman@bioinfo.pl) and Uwe Koch<br />

(uwe_koch@merck.com).<br />

RUNNING TITLE: Target Specific Compound Identification by Support Vector Machine.<br />

KEYWORDS:<br />

1) Compound Identification,<br />

2) Protein target specificity,<br />

3) MDL Drug Data Report,<br />

4) Machine-learning methods,<br />

5) Atom Pairs,<br />

6) Support Vector Machine.<br />

ABREVIATIONS:<br />

1. SVM – support vector machine,<br />

2. AP – atom pairs,<br />

3. MDDR - MDL Drug Data Report.<br />

1


ABSTRACT<br />

In many cases at the beginning of a HTS-campaign some information about active molecules is already<br />

available. Often active compounds (such as substrate analogues, natural products, inhibitors of a related protein<br />

or ligands published by a pharmaceutical company) have been identified in low-throughput validation studies of<br />

the biochemical target. We would like to evaluate in how far support vector machine can be trained on those<br />

compounds and used to classify a collection with unknown activity. This approach is aimed on reducing the<br />

number of compounds to be tested against the given target. Our method predicts biological activity of chemical<br />

compounds based only on the Atom Pairs (AP) two dimensional topological descriptors. The supervised<br />

Support Vector Machine (SVM) method is trained here on compounds from the MDL drug data report<br />

(MDDR) known to be active for specific protein target. For detailed analysis we have selected five different<br />

biological targets: cyclooxygenase-2, dihydrofolatereductase, thrombin, HIV-reverse transcriptase and<br />

antagonists of the estrogen receptor. The accuracy of compounds identification is estimated here using the<br />

recall and precision values. The sensitivities for all types of targets are over 80% and the classification<br />

performance reaches 100% for selected targets. The second application of our method address the problem<br />

when at the beginning of a HTS-campaign no initial set of actives is known on a selected protein target. Then<br />

the virtual high-throughput screening (vHTS) is applied in most cases by flexible docking procedure. The vHTS<br />

experiment typically contain a large percentage of false positives that should be verified by costly and time-<br />

consuming experimental follow-up assays. The subsequent use of our machine learning method improves the<br />

speed (you do not have to perform the docking on all compounds of the database) and also the accuracy of<br />

HTS hit lists (the enrichment factor).<br />

INTRODUCTION<br />

Genomic research provides an ever increasing number of potential drug targets. In the past large compound<br />

collections were tested for a single target. Recently it became common practice in the applied research to screen<br />

large collections of compound for potential activity of in vitro high throughput screening (HTS) model studies<br />

to identify new lead compounds. However, with a larger number of drug targets often the question is raised how<br />

to preselect chemical versus biological space more efficiently. Our aim of the present study is to present fast<br />

and reliable in silico method that captures the essential features of inhibitor molecules.<br />

High throughput screening (HTS) allows for the testing of millions of compounds for activity against the<br />

chosen target. As a result a set of lead molecules with relatively high activity against the target is identified.<br />

The number of these compounds can be up relatively high and therefore these compounds are usually subjected<br />

to further prioritization based on the assessment of various molecular characteristics. Although, highly<br />

2


successful, this approach can not be applied simultaneously to the large number of drug targets emerging from<br />

genomic research. One solution is to reduce the number of compounds to be tested to those with a high<br />

probability of activity. There are many ways in which various computational methods can contribute in this<br />

process. We focused here on the application of support vector machine (SVM) – i.e. supervised machine<br />

learning approach. We evaluated our method in terms of its capability to recognize known ligands for five<br />

divergent protein targets of the highest medicinal relevance, which already have been investigated in several<br />

drug discovery programs.<br />

Chemists have gathered expertise on features in molecular structures that are important for inhibition on<br />

specific targets. Thus, even the 2D structure of ligand allows for some estimate of its activity for a given protein<br />

target. We tried to describe this empirical knowledge about inhibitors in terms of computational prediction<br />

model. By application of support vector machine algorithm we classify compounds that are active against this<br />

biological target according to the commercially available MDL drug data report [1]. In the past various<br />

machine-learning approaches have been used for a number of compound-based classification problems. For<br />

example neural networks have been used as drug-likeness filters to distinguish drugs from non-drugs, to<br />

classify compounds based on their ADME properties, toxicity and target specificity. In many of these<br />

applications the use of parameters describing the compound’s topology gave satisfactory results. That is why<br />

we have used AP two dimensional ligands descriptors to represent the variety of chemical space.<br />

Target identification is a critical step following the discovery of small molecules. In [2] Nidhi et al.<br />

provided an in silico method for predicting potential targets for compounds on the basis of chemical structure<br />

alone. They used the multiple-category Laplacian-modified naive Bayesian model trained on extended-<br />

connectivity fingerprints of compounds from 964 target classes in the WOMBAT (World Of Molecular<br />

BioAcTivity) chemogenomics database. The algorithm was then tested by finding the three top most likely<br />

protein targets for all MDDR (MDL Drug Database Report) database compounds [1]. On average, the correct<br />

target was found 77% of the time for compounds from 10 MDDR activity classes with known targets [2]. The<br />

support vector machine was used recently to describe high-throughput screening (HTS) data with great success<br />

[3]. With carefully selected parameters, SVM models increased the hit rates significantly, and 50% of the active<br />

compounds could be recovered by screening just 7% of the test set. The authors found that the size of the<br />

training set played a significant role in the performance of the models, i.e. a training set with 10,000 member<br />

compounds is likely the minimum size required to build a model with reasonable predictive power [3]. In other<br />

work by using an in-house data set of small-molecule structures, encoded by Ghose-Crippen parameters,<br />

several machine learning techniques were applied to distinguish between kinase inhibitors and other molecules<br />

with no reported activity on any protein kinase [4]. They compared four approaches: support vector machines<br />

(SVM), artificial neural networks (ANN), k nearest neighbor classification with GA-optimized feature selection<br />

3


(GA/kNN), and recursive partitioning (RP). Support-vector machines, followed by the GA/kNN combination,<br />

outperformed the other techniques when comparing the average of individual models. Similar to our approach<br />

is presented in [5]. They have performed virtual screening using some very simple features, by employing the<br />

number of atoms per element as molecular descriptors but without regard to any structural information<br />

whatsoever. These atom counts are able to outperform virtual-affinity-based fingerprints and Unity fingerprints<br />

in some activity classes. This fact can partly be explained by highly nonlinear structure-activity relationships,<br />

which represent a severe limitation of the "similar property principle" in the context of bioactivity [5].<br />

In our previous paper [6] we answered the following question: “How well do different classification methods<br />

perform in selecting the ligands of a protein target out of large compound collections not used to train the<br />

model?”. In this work support vector machines, random forest, artificial neural networks, k-nearest-neighbor<br />

classification with genetic-algorithm-optimized feature selection, trend vectors, naive Bayesian classification,<br />

and decision tree were used to divide databases into molecules predicted to be active and those predicted to be<br />

inactive. Training and predicted activities were treated as binary. We reported significant differences in the<br />

performance of the methods independent of the biological target and compound class. Different methods can<br />

have different applications; some provide particularly high enrichment, others are strong in retrieving the<br />

maximum number of actives. We also showed that these methods do surprisingly well in predicting recently<br />

published ligands of a target on the basis of initial leads and that a combination of the results of different<br />

methods in certain cases can improve results compared to the most consistent method. In the present paper we<br />

focus our attention only on our novel SVM-based method and provide more in-depth description of the<br />

methodology and results. We believed that this paper can help all readers to use our protocol for solving<br />

similar HTS problems. Therefore we decided not to include a comparison of all current known approaches<br />

thinking it is rather outside of the scope of this manuscript. Our results provide higher sensitivity and selectivity<br />

comparing to other recently published methods (such as combination of SVM with naïve Bayesian trained on<br />

Ghose-Crippen parameters and others [2-5]).<br />

The list of hits generated by virtual high-throughput screening (vHTS) typically contain a large percentage of<br />

false positives, making experimental follow-up assays necessary to distinguish active from inactive substances.<br />

Here we would like to present another application of SVM based method aimed at improving the accuracy of<br />

HTS hit lists by the subsequent use of machine learning method. The virtual screening procedure often is<br />

performed on the large chemical libraries and selecting hits by statistical algorithms instead of time-costly<br />

docking procedure is of great importance [7]. We address this problem by the case study on five protein targets:<br />

HIV-reverse transcriptase, COX2, dihydrofolate reductase, estrogen receptor and thrombin in conjunction with<br />

MDL Drug Data Report database [1]. The virtual HTS was performed with a set of different flexible docking<br />

and scoring methods. Our results reveal that support vector machine algorithm is able to speed-up the vHTS<br />

4


procedure by limiting the set of ligands to be docked on a target and provide also the better enrichment of the<br />

HTS hit rate.<br />

COMPUTATIONAL METHODS<br />

Data sets and AP descriptors<br />

Five diverse protein targets were tested: human cyclooxygenase-2 [8], dihydrofolate reductase [9, 10], thrombin<br />

[11, 12], antiestrogen [13] and HIV reverse transcriptase [14]. The datasets used for training and testing is<br />

comprised of both active and inactive compounds from the subset of the MDL drug data report [1]. All<br />

compounds are clinically tested or already launched on the market. Inhibitors and non-active compounds for<br />

those targets were used for training supervised machine-learning algorithm. For additional tests we have<br />

selected also compounds that are now biologically tested for inhibition of targets, and are not yet on the marker<br />

or clinical tests.<br />

The entire pool of compounds for cyclooxygenase-2 target contains 112 inhibitors and 10452 inactive<br />

compounds divided randomly into two subsets: training (75 inhibitors, 2106 inactive ones) and testing<br />

(respectively 37 and 8346). In the case of dihydrofolate we have 28 inhibitors (divided into 17 for training and<br />

11 for testing) and 10529 inactive ones (2149 and 8380). For reverse transcriptase we have selected 114<br />

inhibitors (79 and 35) and 10450 inactive compounds (2130 and 8320), and for thrombin we have 112<br />

inhibitors (77 and 35) and 10459 inactive ones (2036 and 8423). For the last target we have collected 34<br />

inhibitors (22 and 12) and 11580 inactive compounds (2528 for training and 9052 for testing). For additional<br />

test we have selected molecules biologically tested for inhibition of cyclooxygenase-2 (792 molecules),<br />

dihydrofolate (154), thrombin (1066), reverse transcriptase (597) and antiestrogen (256).<br />

We have utilized the regular atom pair AP descriptors [15] due to their proven success in classifying<br />

compounds, ease of use and interpretability. To encode the molecule structures we employed the MIX tools<br />

script [16], which counts for each atom pair the number of covalent bonds that join them. Thus for each<br />

compound it yields a binary vector with 1 for all present types of atom pairs, and 0 for those that are absent in a<br />

molecule. In addition to those we have tested also a larger set of additional 6 descriptors such as: TT (regular<br />

topological torsion), DP (pairs using sq types), DT (torsions using SQ types), DRUGBITS (substructures) and<br />

ROF6 set of descriptors [16]. When including in the training those additional descriptors we have not observed<br />

any significant improvement of results.<br />

Classification and model validation<br />

5


We trained support vector machine algorithm for each type of a target. First, we have created the dataset of<br />

compounds with experimentally verified activity (positive instances). Then we have built the dataset of inactive<br />

compounds (negative instances). The negative instances are chosen randomly from launched or preclinical<br />

compounds that have no experimentally verified activity for the selected type of the target. These two datasets<br />

(positive and negative instances) are projected then as sets of points into a multidimensional space using AP<br />

two dimensional descriptors described in the previous subsection.<br />

In is well known that some machine-learning methods have difficulties handling unbalanced training sets, i.e.<br />

when the number of positive instances is substantially different from the number of negatives. Therefore for<br />

each target we selected randomly 1/3 of negatives for training and 2/3 for testing, whereas 2/3 of positive<br />

instances for training and 1/3 for testing. The selections were repeated few times for all targets, yet no<br />

significant differences were observed between those various selections. We present here the average results for<br />

training and testing phase of experiments. Models are derived for each training set independently and used for<br />

prediction of activity of the compounds in the test set.<br />

There are many ways to present the performance of a SVM classifier. We use here accuracy E, precision P and<br />

recall R values, together with confusion tables. Their definitions are given below:<br />

fp + fn<br />

E = * 100%<br />

,<br />

tp + fp + tn + fn<br />

tp<br />

R = * 100%<br />

, [Eq. 1]<br />

tp + fn<br />

tp<br />

P = * 100%<br />

,<br />

tp + fp<br />

where tp is the number of true positives, fp is the number of false positives, tn is the number of true negatives<br />

and fn is the number of false negatives. The classification error E provides an overall error measure, whereas<br />

recall R measures the percentage of correct predictions (the probability of correct prediction), and precision P<br />

gives the percentage of observed positives that are correctly predicted (the measure of the reliability of positive<br />

instances prediction).<br />

The Support Vector Machine (SVM) Method.<br />

SVM is an effective statistical learning method [17-19] with good performance yet easier to implement<br />

then neural networks. It was successfully applied to various problems including text classification [20], image<br />

recognition tasks [21], bioinformatics [22, 23] and medical applications [24, 25]. The SVM approach has been<br />

6


used also in analysis of gene expression data [26], classification of microarrays data [27], to infer gene<br />

functional classification [28-30] and for protein analysis [31-33].<br />

Most of those tasks have the property of sparse instance vectors. The SVM approach has the ability to<br />

construct predictive models with the large generalization power even in the case of large dimensionality of the<br />

data when the number of observation available for training is low. SVM always seeks a globally optimized<br />

solution and avoids over-fitting, so the large number of features (as in our binary representation of ligands<br />

topology) is allowed. The SVMlight implementation done by Thorsten Joachims [34] is used in the field of<br />

bioinformatics [35]. The crucial idea behind is a sparse instance vectors property to obtain compact and<br />

efficient representation.<br />

The output of the training phase is a classification function i.e. a model. It consists from the set of D<br />

support vectors Tj andα i , which are nonzero, positive real numbers. Those constants are obtained from<br />

optimization procedure (quadratic programming QP problem) used to find the maximal margin hyperplane. The<br />

number of free parameters of the QP problem is equal to the number of all instances in the training dataset. The<br />

non-zero parameters α i describe the strength of this particular i-th support vector in the decision function.<br />

SVM chooses as support vectors those points that lie closest to the separating hyperplane. The kernel function<br />

is used to define the feature space after nonlinear mapping function from the embedding space. The mapping<br />

function Ω need not be explicitly defined because in the kernel function is used only the inner product of it. The<br />

kernel function is a positive define function reflecting the similarity between an input sample and the set of<br />

support vectors Ti. In most cases three types of kernels are used: the linear, polynomial or radial basis.<br />

The reliability of a classification of a ligand [36] as an active one is given by the cost function:<br />

where ( T T )<br />

i<br />

f<br />

[ ] ∑ ( { [ ] } { } )<br />

= i D<br />

( T x ) = liα<br />

iK<br />

Ω T x , Ω Ti<br />

, [Eq. 1]<br />

i=<br />

1<br />

K , is the proper kernel function that defines the feature space, Ω is a nonlinear mapping function<br />

from embedding space T into the feature space, and li are known a priori class labels for support vectors. We<br />

use l = + 1 for positive cases and l = −1for<br />

negative ones. The kernel function is a positive define function<br />

i<br />

i<br />

reflecting the similarity between an input sample and the set of support vectors Ti. The non-zero parameters α i<br />

describe the strength of this particular i-th support vector in the decision function. SVM chooses as support<br />

vectors those points that lie closest to the separating hyperplane. The mapping function Ω need not be explicitly<br />

defined because in the kernel function is used only the inner product of it.<br />

RESULTS AND DISCUSION<br />

7


The major goal of this study was to test to what extent the supervised support vector machine method is capable<br />

of learning and predicting the target specific inhibition likelihood for chemical compounds based on MDL drug<br />

data report [1]. We have prepared for each of 5 different protein targets 2 datasets: for training and testing. Each<br />

include compounds which are know to inhibit the protein target (2/3 of all available positives for training, 1/3<br />

for testing) and those known not to be active for the selected target (1/3 of all available negatives for training,<br />

2/3 for testing). Support vector machine algorithm was trained on the first data set and tested on the second one.<br />

In the following paragraphs we discuss the classification results obtained for each of biological targets and<br />

present its performance using confusion tables for training and testing datasets. In addition we provide also the<br />

overall value for classification error and precision/recall values. In general the SVM models yield the successful<br />

classification of compounds for all targets (see Table I for confusion tables for training and testing datasets). It<br />

provides robust and reliable models for all types of protein targets. The SVM algorithm turn to be the method<br />

of choice for any practical purpose: it is very fast, efficient and robust.<br />

We performed two additional in silico experiments. The first one uses for training all available in MDL drug<br />

data report active compounds annotated as preclinical or launched. The testing is done on potential inhibitors<br />

that are annotated as biologically tested. The second experiment trains the SVM method on oldest 1/3<br />

compounds known to be inhibitors, and test their accuracy on the rest: 2/3 newest developed and patented<br />

compounds. The performance of machine-learning models in both experiments for each type of protein target is<br />

described by the recall R and the precision P. The recall R value measures the percentage of correct predictions,<br />

whereas precision P gives the percentage of observed positives that are correctly predicted. These measures of<br />

accuracy are calculated separately for each type of protein target and presented in Tables II (the first<br />

experiment) and Table III (the second experiment). The typical recall value is around 60%, and the precision P<br />

is close to 100% for all targets (the first experiment). The results for the second experiment are slightly worst,<br />

which is caused by the discovery of novel drug classes that are presented in the newest 2/3 compounds.<br />

In the Table IV we present results on enrichment studies of virtual high-throughput screening. We trained SVM<br />

algorithm on the set of first 10% of the best scoring ligands from the docking and scoring experiments for each<br />

protein target. The subset of MDL drug data report inhibitors including both active and non-active compounds<br />

for those targets were docked on protein targets using various docking methods (FLOG [37], GLIDE [38, 39],<br />

FRED [40], Dock [41], Autodock [42, 43] and <strong>ICM</strong> [44]) followed in some cases by the scoring (SS,<br />

ChemScore or internal <strong>ICM</strong> score). The 10% of best docked compounds were then used for training supervised<br />

machine-learning algorithms. The classification models were tested then on the rest of active compounds (data<br />

not shown). Our results support possibility to train machine learning algorithms on docking and scoring results.<br />

Such trained models can be later applied to large databases in order to select ligands for further experimental<br />

verification. This procedure will allow for faster selection of compounds in virtual HTS experiments even in<br />

8


cases where no initial information about a set of active compounds for selected protein target is known. The<br />

random scores, from training of SVM on randomly selected subset of 1000 compounds, for recall and precision<br />

are equal to 50% and the most methods are able to gain recall/precision up to 70%.<br />

Our results are in close agreement with other comparative studies [4, 7, 36, 45, 46]. The SVM method is fast<br />

and reliable machine-learning method that outperforms other types of algorithms. It is also well suited for the<br />

classification of small molecules using 2D descriptors with respect to their potential inhibition on selected<br />

target classes. The selection of molecular descriptors should be done in accordance with the balance between<br />

general and detailed level of description. The MIX tools descriptors [16] are useful for the classification of<br />

compounds by SVM with respect to their potential for inhibition of selected protein targets. The subsequent<br />

SVM training on results of docking experiment allows for speed-up vHTS procedure and to enrich the hit list.<br />

Our aim was to present possible in silico application of machine learning methods to the HTS data. The number<br />

of HTS hits can become large requiring some type of prioritization. The set of active compounds often contains<br />

a substantial number of false positives. False negatives are also important but difficult to identify<br />

experimentally. Application of a machine learning model can help to identify true positives and help to select<br />

compounds for retesting. In still another application the results of the HTS itself can be used for training and the<br />

model used to identify new compounds not present in the screening collection to be synthesized or bought from<br />

a commercial source.<br />

These tools can be also valuable whenever a large data set of molecules is to be screened in order to select<br />

structures that have a higher likelihood of being inhibitors. Such molecules are often desired in order to enrich<br />

in-house target-specific libraries of pharmaceutical companies. An empirically derived in silico method can also<br />

help to set priorities within the list of accessible in-house molecules to be tested experimentally. Similar<br />

approach can be used to enrich the initial set of patented molecules by including compounds from the large<br />

commercial collections that are predicted to be active for the same protein target family.<br />

ACKNOWLEDGMENTS<br />

This work was supported by EC BioSapiens (LHSG-CT-2003-503265) and EC SEPSDA (SP22-CT-2004-<br />

003831) 6FP projects as well as the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and<br />

2P05A00130). MvG and LSW would like to thank the Foundation for Polish Science for the fellowship.<br />

9


Table I. SVM classification results for selected 5 targets on launched and preclinical inhibitors from<br />

MDDR database.<br />

Protein Target Training dataset Testing dataset<br />

COX2 predicted 0 1 predicted 0 1<br />

10<br />

Recall/<br />

Precision<br />

Recall/<br />

Precision<br />

Training set Testing set<br />

observed 0 2106 0 0 8308 38 92% 73%<br />

observed 1 6 69 1 10 27 100% 42%<br />

DH 0 1 0 1<br />

observed 0 2151 0 0 8390 9 100% 73%<br />

observed 1 0 17 1 3 8 100% 47%<br />

TH 0 1 0 1<br />

observed 0 2032 0 0 8401 22 98% 74%<br />

observed 1 1 47 1 9 26 100% 54%<br />

RT 0 1 0 1<br />

observed 0 2130 0 0 8277 43 100% 31%<br />

observed 1 0 54 1 24 11 100% 20%<br />

AE 0 1 0 1<br />

observed 0 2351 2 0 9028 24 100% 42%<br />

observed 1 0 22 1 7 5 92% 17%<br />

The SVM classification performance on the set of preclinical or launched inhibitors from MDDR database is<br />

described here using the confusion tables. Columns represent observed in experiments class of a compound for<br />

each of targets (active/inactive) whereas rows represent the prediction results. The list of protein targets<br />

include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and<br />

antiestrogen (AE). First we present results on training datasets with 2/3 available positives (preclinical or<br />

launched inhibitors for the selected target) and 1/3 of negatives (randomly selected subset of preclinical or<br />

launched inhibitors knowing not to inhibit selected target). On the right there are results of SVM models on<br />

testing datasets containing 1/3 of positives and 2/3 of available negatives not used in training of SVM method.<br />

The last two columns present the precision and recall values on both training and testing datasets.


Table II. The classification accuracy for SVM method on selected 5 targets for biologically tested<br />

inhibitors of MDL drug data report.<br />

Protein Target Number<br />

of<br />

positives/<br />

negatives<br />

cyclooxygenase-2<br />

10408<br />

dihydrofolate<br />

10492<br />

thrombin<br />

10418<br />

reverse transcriptase<br />

10406<br />

antiestrogen<br />

12164<br />

11<br />

Number of<br />

biological<br />

testing<br />

compounds<br />

793<br />

110 69%<br />

95%<br />

28 154 71%<br />

98%<br />

111 1070 62%<br />

98%<br />

113 596 52%<br />

96%<br />

34 255 40%<br />

96%<br />

Recall/<br />

Precision<br />

of the best<br />

method<br />

MDL drug data reports inhibitor and non-active compounds for selected five targets. SVM algorithm is trained<br />

here on the whole set of preclinical or launched inhibitors for five protein targets. The classification models<br />

were then tested then on biological tested on selected biological target according to MDDR compounds<br />

(excluding those that are preclinical or already launched on the market).<br />

The first column presents the protein target name. The second column gives numbers of positives and negatives<br />

instances used in training. The third column shows the number of compounds annotated by MDL drug data<br />

report for being biologically tested on selected protein target. The recall and precision values for the SVM<br />

method are in the fourth column.


Table III. The classification accuracy for support vector machine method on selected 5 targets for<br />

patented compounds from MDL drug data report.<br />

Protein Target Number<br />

of training<br />

negatives/<br />

positives<br />

oldest 1/3<br />

Cyclooxygenase-2<br />

Number<br />

of testing<br />

negatives/<br />

positives<br />

newest 2/3<br />

12<br />

Recall/<br />

Precision<br />

on the<br />

training<br />

dataset<br />

Recall/<br />

Precision<br />

on the<br />

testing<br />

dataset<br />

2106 8347 94% 64%<br />

31 81 100% 66%<br />

dihydrofolate 2151 8390 100% 50%<br />

12 16 100% 44%<br />

thrombin 2034 8425 100% 33%<br />

34 76 100% 66%<br />

reverse 2130 8323 100% 46%<br />

transcriptase 54 59 100% 33%<br />

antiestrogen 2353 9700 100% 9%<br />

6 22 100% 15%<br />

The supervised SVM method classification performance trained using AP descriptors. The list of protein targets<br />

include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and<br />

antiestrogen (AE). The oldest one-third (first four columns) or the oldest two-thirds (last four columns) of a<br />

subset of MDL drug data report inhibitors and non-active compounds for those targets were used for training<br />

supervised machine-learning algorithms. The classification models were tested then on rest of patented<br />

compounds.<br />

The first column presents the protein target name. The second column gives numbers of positives and negatives<br />

instances used in training (1/3 oldest patented compounds). The third column shows the number of 2/3 newest<br />

compounds annotated by MDL drug data report for inhibition on selected protein target and patented. The recall<br />

and precision values for the SVM method on the training datasets is included in the fourth column. The fifth<br />

column presents the precision and recall values on the testing dataset.


Table IV. The classification accuracy for support vector machine method on selected 5 targets for top<br />

10% of best docked compounds.<br />

Protein Target #positives in<br />

MDDR db<br />

FLOG<br />

docking<br />

results<br />

Best Docking<br />

Methods<br />

Cyclooxygenase-2 98 1 GLIDE<br />

FRED<br />

13<br />

Best Scoring<br />

Methods<br />

SS<br />

ChemScore<br />

dihydrofolate 27 6 <strong>ICM</strong> <strong>ICM</strong> 10<br />

thrombin 99 46 <strong>ICM</strong> <strong>ICM</strong> 99<br />

reverse transcriptase 108 0 <strong>ICM</strong> <strong>ICM</strong> 10<br />

antiestrogen 32 17 GLIDE<br />

FRED<br />

GlideScore<br />

ChemScore<br />

Best<br />

D&S<br />

Results<br />

60<br />

63<br />

Recall/<br />

Precision<br />

on the<br />

training<br />

dataset<br />

93,20%<br />

98,97%<br />

64,74%<br />

62,43%<br />

72,64%<br />

68,28%<br />

79,29%<br />

80,16%<br />

74,00%<br />

24<br />

23 71,36%<br />

The supervised SVM method classification performance trained using AP descriptors. The list of protein targets<br />

include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and<br />

antiestrogen (AE). The subset of MDL drug data report inhibitors including both active and non-active<br />

compounds for those targets were docked on protein targets using various docking methods. The 10% of best<br />

docked compounds were then used for training supervised machine-learning algorithms. The classification<br />

models were tested then on the rest of active compounds.<br />

The first column presents the protein target name. The second column gives numbers of active compounds<br />

found in MDL data drug report database for selected protein target. The number of negatives used for docking<br />

experiment was fixed for all targets and equal to 10000. The third column shows the results of FLOG fast and<br />

flexible docking procedure i.e. the number of active compounds found in first 10% of the ordered by the FLOG<br />

docking score ligands. The fourth and fifth column present the best docking and scoring method name. The<br />

sixth column presents the number of active compounds found in first 10% of the list of ligands ordered by the<br />

best docking program followed by the scoring. The set of 10% of the best scoring compounds was then used to<br />

train support vector machine (SVM). The recall and precision values for the training is presented in the seventh<br />

column.


REFERENCES<br />

1. MDL, MDL Drug Data Report (2004). Coverage: 1988-present; updated monthly. Focus: Drugs<br />

launched or under development, as referenced in the patent literature, conference proceedings, and<br />

other sources; descriptions of therapeutic action and biological activity; tracking of compounds<br />

through development phases. Size: 132726 molecules,129459 models. Updates add approximately<br />

10,000 new compounds per year. 2004.<br />

2. Nidhi, et al., Prediction of biological targets for compounds using multiple-category Bayesian models<br />

trained on chemogenomics databases. J Chem Inf Model, 2006. 46(3): p. 1124-33.<br />

3. Fang, J., et al., Support vector machines in HTS data mining: Type I MetAPs inhibition study. J Biomol<br />

Screen, 2006. 11(2): p. 138-44.<br />

4. Briem, H. and J. Gunther, Classifying "kinase inhibitor-likeness" by using machine-learning methods.<br />

Chembiochem, 2005. 6(3): p. 558-66.<br />

5. Bender, A. and R.C. Glen, A discussion of measures of enrichment in virtual screening: comparing the<br />

information content of descriptors with increasing levels of sophistication. J Chem Inf Model, 2005.<br />

45(5): p. 1369-75.<br />

6. Plewczynski, D., S.A. Spieser, and U. Koch, Assessing different classification methods for virtual<br />

screening. J Chem Inf Model, 2006. 46(3): p. 1098-106.<br />

7. Jenkins, J.L., R.Y. Kao, and R. Shapiro, Virtual screening to enrich hit lists from high-throughput<br />

screening: a case study on small-molecule inhibitors of angiogenin. Proteins, 2003. 50(1): p. 81-93.<br />

8. Kalgutkar, A.S. and Z. Zhao, Discovery and design of selective cyclooxygenase-2 inhibitors as nonulcerogenic,<br />

anti-inflammatory drugs with potential utility as anti-cancer agents. Curr Drug Targets,<br />

2001. 2(1): p. 79-106.<br />

9. Dicker, A.P., et al., Identification and characterization of a mutation in the dihydrofolate reductase<br />

gene from the methotrexate-resistant Chinese hamster ovary cell line Pro-3 MtxRIII. J Biol Chem,<br />

1990. 265(14): p. 8317-21.<br />

10. Schweitzer, B.I., A.P. Dicker, and J.R. Bertino, Dihydrofolate reductase as a therapeutic target. Faseb<br />

J, 1990. 4(8): p. 2441-52.<br />

11. Ambler, J., et al., The discovery of orally available thrombin inhibitors: studies towards the<br />

optimisation of CGH1668. Bioorg Med Chem Lett, 1998. 8(24): p. 3583-8.<br />

12. Menear, K., Progress towards the discovery of orally active thrombin inhibitors. Curr Med Chem, 1998.<br />

5(6): p. 457-68.<br />

13. Gustafsson, J.A., Therapeutic potential of selective estrogen receptor modulators. Curr Opin Chem<br />

Biol, 1998. 2(4): p. 508-11.<br />

14. Castro, H.C., et al., HIV-1 reverse transcriptase: a therapeutical target in the spotlight. Curr Med<br />

Chem, 2006. 13(3): p. 313-24.<br />

15. Sheridan, R.P., The centroid approximation for mixtures: calculating similarity and deriving structure-activity<br />

relationships. J Chem Inf Comput Sci, 2000. 40(6): p. 1456-69.<br />

16. Miller, M.D., R.P. Sheridan, and S.K. Kearsley, SQ: a program for rapidly producing<br />

pharmacophorically relevent molecular superpositions. J Med Chem, 1999. 42(9): p. 1505-14.<br />

17. Vapnik, V.N., The nature of statistical learning theory. Vol. xv. 1995, New York: Springer.<br />

18. Vapnik, V.N., Statistical learning theory. Adaptive and learning systems for signal processing,<br />

communications, and control. Vol. xxiv. 1998, New York: Wiley. 736.<br />

19. Cristianini, N. and J. Shawe-Taylor, An introduction to support vector machines : and other kernelbased<br />

learning methods. 2000, Cambridge, U.K. ; New York: Cambridge University Press. xiii, 189.<br />

20. Joachims, T., Learning to classify text using support vector machines. <strong>Kl</strong>uwer international series in<br />

engineering and computer science SECS. Vol. xvi. 2002, Boston: <strong>Kl</strong>uwer Academic Publishers. 205.<br />

14


21. Vojtech, F. and H. Vaclay, Vojtech, F., Vaclay, H., An iterative algorithm learning the maximal margin<br />

classifier. Pattern Recognition, 2003. 36(9): p. 1985-1996.<br />

22. Kim, H. and H. Park, Protein secondary structure prediction based on an improved support vector<br />

machines approach. Protein Eng, 2003. 16(8): p. 553-60.<br />

23. Minakuchi, Y., K. Satou, and A. Konagaya. Prediction of protein-protein interaction sites using supprot<br />

vector machnes. in International conference on mathematics and engineering techniques in medicine<br />

and biological sciences. 2003.<br />

24. Valentini, G., Gene expression data analysis of human lymphoma using support vector machines and<br />

output coding ensembles. Artif Intell Med, 2002. 26(3): p. 281-304.<br />

25. Guyon, I., et al., Gene selection for cancer classification using support vector machines. Mach. Learn.,<br />

2002. 46: p. 389-422.<br />

26. Brown, M.P., et al., Knowledge-based analysis of microarray gene expression data by using support<br />

vector machines. Proc Natl Acad Sci U S A, 2000. 97(1): p. 262-7.<br />

27. Furey, T.S., et al., Support vector machine classification and validation of cancer tissue samples using<br />

microarray expression data. Bioinformatics, 2000. 16(10): p. 906-14.<br />

28. Krishnan, V.G. and D.R. Westhead, A comparative study of machine-learning methods to predict the<br />

effects of single nucleotide polymorphisms on protein function. Bioinformatics, 2003. 19(17): p. 2199-<br />

209.<br />

29. Pavlidis, P., et al. Gene functional classification from heterogeneous data. in 5th International<br />

Conference on Computational Molecular Biology. 2001. Montreal, Canada: ACM Press.<br />

30. Zien, A., et al., Engineering support vector machine kernels that recognize translation initiation sites.<br />

Bioinformatics, 2000. 16(9): p. 799-807.<br />

31. Jaakkola, T., M. Diekhans, and D. Haussler, A discriminative framework for detecting remote protein<br />

homologies. J Comput Biol, 2000. 7(1-2): p. 95-114.<br />

32. Hua, S. and Z. Sun, A novel method of protein secondary structure prediction with high segment<br />

overlap measure: support vector machine approach. J Mol Biol, 2001. 308(2): p. 397-407.<br />

33. Ding, C.H. and I. Dubchak, Multi-class protein fold recognition using support vector machines and<br />

neural networks. Bioinformatics, 2001. 17(4): p. 349-58.<br />

34. Schölkopf, B., C.J.C. Burges, and A.J. Smola, Advances in kernel methods : support vector learning.<br />

Vol. vii. 1999, Cambridge, Mass.: MIT Press. 376.<br />

35. Zavaljevski, N., F.J. Stevens, and J. Reifman, Support vector machines with selective kernel scaling for<br />

protein classification and identification of key amino acid positions. Bioinformatics, 2002. 18(5): p.<br />

689-96.<br />

36. Burbidge, R., et al., Drug design by machine learning: support vector machines for pharmaceutical<br />

data analysis. Comput Chem, 2001. 26(1): p. 5-14.<br />

37. Miller, M.D., et al., FLOG: a system to select 'quasi-flexible' ligands complementary to a receptor of<br />

known three-dimensional structure. J Comput Aided Mol Des, 1994. 8(2): p. 153-74.<br />

38. Friesner, R.A., et al., Glide: a new approach for rapid, accurate docking and scoring. 1. Method and<br />

assessment of docking accuracy. J Med Chem, 2004. 47(7): p. 1739-49.<br />

39. Halgren, T.A., et al., Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment<br />

factors in database screening. J Med Chem, 2004. 47(7): p. 1750-9.<br />

40. Miteva, M.A., et al., Fast structure-based virtual ligand screening combining FRED, DOCK, and<br />

Surflex. J Med Chem, 2005. 48(19): p. 6012-22.<br />

41. Ewing, T.J., et al., DOCK 4.0: search strategies for automated molecular docking of flexible molecule<br />

databases. J Comput Aided Mol Des, 2001. 15(5): p. 411-28.<br />

42. Buzko, O.V., A.C. Bishop, and K.M. Shokat, Modified AutoDock for accurate docking of protein kinase<br />

inhibitors. J Comput Aided Mol Des, 2002. 16(2): p. 113-27.<br />

43. Vaque, M., et al., BDT: an easy-to-use front-end application for automation of massive docking tasks<br />

and complex docking strategies with AutoDock. Bioinformatics, 2006.<br />

15


44. Fernandez-Recio, J., M. Totrov, and R. Abagyan, <strong>ICM</strong>-DISCO docking by global energy optimization<br />

with fully flexible side-chains. Proteins, 2003. 52(1): p. 113-7.<br />

45. Byvatov, E., et al., Comparison of support vector machine and artificial neural network systems for<br />

drug/nondrug classification. J Chem Inf Comput Sci, 2003. 43(6): p. 1882-9.<br />

46. Glick, M., et al., Enrichment of high-throughput screening data with increasing levels of noise using<br />

support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers. J<br />

Chem Inf Model, 2006. 46(1): p. 193-200.<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!