T xT Kl xTf - ICM

Target Specific Compound Identification using Support Vector Machine. 

Dariusz Plewczynski 1,2* , Marcin von Grotthuss 1 , Stephane Spieser 3 , Leszek Rychewski 1 , Lucjan S. Wyrwicz 1 , 

Uwe Koch 3 

1) BioInfoBank Institute, Limanowskiego 24A/16, 60-744 Poznan, Poland, Tel: +48-61-8653520, Fax: 

+48-61-8643350, E-mail: darman@bioinfo.pl 

2) Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, 

Warsaw, Poland 

3) Istituto di Ricerche di Biologia Molecolare (IRBM) “P. Angeletti”, Merck&Co. Inc., Pomezia, Italy 

* The correspondence should be addressed to Dariusz Plewczynski (darman@bioinfo.pl) and Uwe Koch 

(uwe_koch@merck.com). 

RUNNING TITLE: Target Specific Compound Identification by Support Vector Machine. 

KEYWORDS: 

1) Compound Identification, 

2) Protein target specificity, 

3) MDL Drug Data Report, 

4) Machine-learning methods, 

5) Atom Pairs, 

6) Support Vector Machine. 

ABREVIATIONS: 

1. SVM – support vector machine, 

2. AP – atom pairs, 

3. MDDR - MDL Drug Data Report. 

1

ABSTRACT 

In many cases at the beginning of a HTS-campaign some information about active molecules is already 

available. Often active compounds (such as substrate analogues, natural products, inhibitors of a related protein 

or ligands published by a pharmaceutical company) have been identified in low-throughput validation studies of 

the biochemical target. We would like to evaluate in how far support vector machine can be trained on those 

compounds and used to classify a collection with unknown activity. This approach is aimed on reducing the 

number of compounds to be tested against the given target. Our method predicts biological activity of chemical 

compounds based only on the Atom Pairs (AP) two dimensional topological descriptors. The supervised 

Support Vector Machine (SVM) method is trained here on compounds from the MDL drug data report 

(MDDR) known to be active for specific protein target. For detailed analysis we have selected five different 

biological targets: cyclooxygenase-2, dihydrofolatereductase, thrombin, HIV-reverse transcriptase and 

antagonists of the estrogen receptor. The accuracy of compounds identification is estimated here using the 

recall and precision values. The sensitivities for all types of targets are over 80% and the classification 

performance reaches 100% for selected targets. The second application of our method address the problem 

when at the beginning of a HTS-campaign no initial set of actives is known on a selected protein target. Then 

the virtual high-throughput screening (vHTS) is applied in most cases by flexible docking procedure. The vHTS 

experiment typically contain a large percentage of false positives that should be verified by costly and time- 

consuming experimental follow-up assays. The subsequent use of our machine learning method improves the 

speed (you do not have to perform the docking on all compounds of the database) and also the accuracy of 

HTS hit lists (the enrichment factor). 

INTRODUCTION 

Genomic research provides an ever increasing number of potential drug targets. In the past large compound 

collections were tested for a single target. Recently it became common practice in the applied research to screen 

large collections of compound for potential activity of in vitro high throughput screening (HTS) model studies 

to identify new lead compounds. However, with a larger number of drug targets often the question is raised how 

to preselect chemical versus biological space more efficiently. Our aim of the present study is to present fast 

and reliable in silico method that captures the essential features of inhibitor molecules. 

High throughput screening (HTS) allows for the testing of millions of compounds for activity against the 

chosen target. As a result a set of lead molecules with relatively high activity against the target is identified. 

The number of these compounds can be up relatively high and therefore these compounds are usually subjected 

to further prioritization based on the assessment of various molecular characteristics. Although, highly 

2

successful, this approach can not be applied simultaneously to the large number of drug targets emerging from 

genomic research. One solution is to reduce the number of compounds to be tested to those with a high 

probability of activity. There are many ways in which various computational methods can contribute in this 

process. We focused here on the application of support vector machine (SVM) – i.e. supervised machine 

learning approach. We evaluated our method in terms of its capability to recognize known ligands for five 

divergent protein targets of the highest medicinal relevance, which already have been investigated in several 

drug discovery programs. 

Chemists have gathered expertise on features in molecular structures that are important for inhibition on 

specific targets. Thus, even the 2D structure of ligand allows for some estimate of its activity for a given protein 

target. We tried to describe this empirical knowledge about inhibitors in terms of computational prediction 

model. By application of support vector machine algorithm we classify compounds that are active against this 

biological target according to the commercially available MDL drug data report [1]. In the past various 

machine-learning approaches have been used for a number of compound-based classification problems. For 

example neural networks have been used as drug-likeness filters to distinguish drugs from non-drugs, to 

classify compounds based on their ADME properties, toxicity and target specificity. In many of these 

applications the use of parameters describing the compound’s topology gave satisfactory results. That is why 

we have used AP two dimensional ligands descriptors to represent the variety of chemical space. 

Target identification is a critical step following the discovery of small molecules. In [2] Nidhi et al. 

provided an in silico method for predicting potential targets for compounds on the basis of chemical structure 

alone. They used the multiple-category Laplacian-modified naive Bayesian model trained on extended- 

connectivity fingerprints of compounds from 964 target classes in the WOMBAT (World Of Molecular 

BioAcTivity) chemogenomics database. The algorithm was then tested by finding the three top most likely 

protein targets for all MDDR (MDL Drug Database Report) database compounds [1]. On average, the correct 

target was found 77% of the time for compounds from 10 MDDR activity classes with known targets [2]. The 

support vector machine was used recently to describe high-throughput screening (HTS) data with great success 

[3]. With carefully selected parameters, SVM models increased the hit rates significantly, and 50% of the active 

compounds could be recovered by screening just 7% of the test set. The authors found that the size of the 

training set played a significant role in the performance of the models, i.e. a training set with 10,000 member 

compounds is likely the minimum size required to build a model with reasonable predictive power [3]. In other 

work by using an in-house data set of small-molecule structures, encoded by Ghose-Crippen parameters, 

several machine learning techniques were applied to distinguish between kinase inhibitors and other molecules 

with no reported activity on any protein kinase [4]. They compared four approaches: support vector machines 

(SVM), artificial neural networks (ANN), k nearest neighbor classification with GA-optimized feature selection 

3

(GA/kNN), and recursive partitioning (RP). Support-vector machines, followed by the GA/kNN combination, 

outperformed the other techniques when comparing the average of individual models. Similar to our approach 

is presented in [5]. They have performed virtual screening using some very simple features, by employing the 

number of atoms per element as molecular descriptors but without regard to any structural information 

whatsoever. These atom counts are able to outperform virtual-affinity-based fingerprints and Unity fingerprints 

in some activity classes. This fact can partly be explained by highly nonlinear structure-activity relationships, 

which represent a severe limitation of the "similar property principle" in the context of bioactivity [5]. 

In our previous paper [6] we answered the following question: “How well do different classification methods 

perform in selecting the ligands of a protein target out of large compound collections not used to train the 

model?”. In this work support vector machines, random forest, artificial neural networks, k-nearest-neighbor 

classification with genetic-algorithm-optimized feature selection, trend vectors, naive Bayesian classification, 

and decision tree were used to divide databases into molecules predicted to be active and those predicted to be 

inactive. Training and predicted activities were treated as binary. We reported significant differences in the 

performance of the methods independent of the biological target and compound class. Different methods can 

have different applications; some provide particularly high enrichment, others are strong in retrieving the 

maximum number of actives. We also showed that these methods do surprisingly well in predicting recently 

published ligands of a target on the basis of initial leads and that a combination of the results of different 

methods in certain cases can improve results compared to the most consistent method. In the present paper we 

focus our attention only on our novel SVM-based method and provide more in-depth description of the 

methodology and results. We believed that this paper can help all readers to use our protocol for solving 

similar HTS problems. Therefore we decided not to include a comparison of all current known approaches 

thinking it is rather outside of the scope of this manuscript. Our results provide higher sensitivity and selectivity 

comparing to other recently published methods (such as combination of SVM with naïve Bayesian trained on 

Ghose-Crippen parameters and others [2-5]). 

The list of hits generated by virtual high-throughput screening (vHTS) typically contain a large percentage of 

false positives, making experimental follow-up assays necessary to distinguish active from inactive substances. 

Here we would like to present another application of SVM based method aimed at improving the accuracy of 

HTS hit lists by the subsequent use of machine learning method. The virtual screening procedure often is 

performed on the large chemical libraries and selecting hits by statistical algorithms instead of time-costly 

docking procedure is of great importance [7]. We address this problem by the case study on five protein targets: 

HIV-reverse transcriptase, COX2, dihydrofolate reductase, estrogen receptor and thrombin in conjunction with 

MDL Drug Data Report database [1]. The virtual HTS was performed with a set of different flexible docking 

and scoring methods. Our results reveal that support vector machine algorithm is able to speed-up the vHTS 

4

procedure by limiting the set of ligands to be docked on a target and provide also the better enrichment of the 

HTS hit rate. 

COMPUTATIONAL METHODS 

Data sets and AP descriptors 

Five diverse protein targets were tested: human cyclooxygenase-2 [8], dihydrofolate reductase [9, 10], thrombin 

[11, 12], antiestrogen [13] and HIV reverse transcriptase [14]. The datasets used for training and testing is 

comprised of both active and inactive compounds from the subset of the MDL drug data report [1]. All 

compounds are clinically tested or already launched on the market. Inhibitors and non-active compounds for 

those targets were used for training supervised machine-learning algorithm. For additional tests we have 

selected also compounds that are now biologically tested for inhibition of targets, and are not yet on the marker 

or clinical tests. 

The entire pool of compounds for cyclooxygenase-2 target contains 112 inhibitors and 10452 inactive 

compounds divided randomly into two subsets: training (75 inhibitors, 2106 inactive ones) and testing 

(respectively 37 and 8346). In the case of dihydrofolate we have 28 inhibitors (divided into 17 for training and 

11 for testing) and 10529 inactive ones (2149 and 8380). For reverse transcriptase we have selected 114 

inhibitors (79 and 35) and 10450 inactive compounds (2130 and 8320), and for thrombin we have 112 

inhibitors (77 and 35) and 10459 inactive ones (2036 and 8423). For the last target we have collected 34 

inhibitors (22 and 12) and 11580 inactive compounds (2528 for training and 9052 for testing). For additional 

test we have selected molecules biologically tested for inhibition of cyclooxygenase-2 (792 molecules), 

dihydrofolate (154), thrombin (1066), reverse transcriptase (597) and antiestrogen (256). 

We have utilized the regular atom pair AP descriptors [15] due to their proven success in classifying 

compounds, ease of use and interpretability. To encode the molecule structures we employed the MIX tools 

script [16], which counts for each atom pair the number of covalent bonds that join them. Thus for each 

compound it yields a binary vector with 1 for all present types of atom pairs, and 0 for those that are absent in a 

molecule. In addition to those we have tested also a larger set of additional 6 descriptors such as: TT (regular 

topological torsion), DP (pairs using sq types), DT (torsions using SQ types), DRUGBITS (substructures) and 

ROF6 set of descriptors [16]. When including in the training those additional descriptors we have not observed 

any significant improvement of results. 

Classification and model validation 

5

We trained support vector machine algorithm for each type of a target. First, we have created the dataset of 

compounds with experimentally verified activity (positive instances). Then we have built the dataset of inactive 

compounds (negative instances). The negative instances are chosen randomly from launched or preclinical 

compounds that have no experimentally verified activity for the selected type of the target. These two datasets 

(positive and negative instances) are projected then as sets of points into a multidimensional space using AP 

two dimensional descriptors described in the previous subsection. 

In is well known that some machine-learning methods have difficulties handling unbalanced training sets, i.e. 

when the number of positive instances is substantially different from the number of negatives. Therefore for 

each target we selected randomly 1/3 of negatives for training and 2/3 for testing, whereas 2/3 of positive 

instances for training and 1/3 for testing. The selections were repeated few times for all targets, yet no 

significant differences were observed between those various selections. We present here the average results for 

training and testing phase of experiments. Models are derived for each training set independently and used for 

prediction of activity of the compounds in the test set. 

There are many ways to present the performance of a SVM classifier. We use here accuracy E, precision P and 

recall R values, together with confusion tables. Their definitions are given below: 

fp + fn 

E = * 100% 

, 

tp + fp + tn + fn 

tp 

R = * 100% 

, [Eq. 1] 

tp + fn 

tp 

P = * 100% 

, 

tp + fp 

where tp is the number of true positives, fp is the number of false positives, tn is the number of true negatives 

and fn is the number of false negatives. The classification error E provides an overall error measure, whereas 

recall R measures the percentage of correct predictions (the probability of correct prediction), and precision P 

gives the percentage of observed positives that are correctly predicted (the measure of the reliability of positive 

instances prediction). 

The Support Vector Machine (SVM) Method. 

SVM is an effective statistical learning method [17-19] with good performance yet easier to implement 

then neural networks. It was successfully applied to various problems including text classification [20], image 

recognition tasks [21], bioinformatics [22, 23] and medical applications [24, 25]. The SVM approach has been 

6

used also in analysis of gene expression data [26], classification of microarrays data [27], to infer gene 

functional classification [28-30] and for protein analysis [31-33]. 

Most of those tasks have the property of sparse instance vectors. The SVM approach has the ability to 

construct predictive models with the large generalization power even in the case of large dimensionality of the 

data when the number of observation available for training is low. SVM always seeks a globally optimized 

solution and avoids over-fitting, so the large number of features (as in our binary representation of ligands 

topology) is allowed. The SVMlight implementation done by Thorsten Joachims [34] is used in the field of 

bioinformatics [35]. The crucial idea behind is a sparse instance vectors property to obtain compact and 

efficient representation. 

The output of the training phase is a classification function i.e. a model. It consists from the set of D 

support vectors Tj andα i , which are nonzero, positive real numbers. Those constants are obtained from 

optimization procedure (quadratic programming QP problem) used to find the maximal margin hyperplane. The 

number of free parameters of the QP problem is equal to the number of all instances in the training dataset. The 

non-zero parameters α i describe the strength of this particular i-th support vector in the decision function. 

SVM chooses as support vectors those points that lie closest to the separating hyperplane. The kernel function 

is used to define the feature space after nonlinear mapping function from the embedding space. The mapping 

function Ω need not be explicitly defined because in the kernel function is used only the inner product of it. The 

kernel function is a positive define function reflecting the similarity between an input sample and the set of 

support vectors Ti. In most cases three types of kernels are used: the linear, polynomial or radial basis. 

The reliability of a classification of a ligand [36] as an active one is given by the cost function: 

where ( T T ) 

i 

f 

[ ] ∑ ( { [ ] } { } ) 

= i D 

( T x ) = liα 

iK 

Ω T x , Ω Ti 

, [Eq. 1] 

i= 

1 

K , is the proper kernel function that defines the feature space, Ω is a nonlinear mapping function 

from embedding space T into the feature space, and li are known a priori class labels for support vectors. We 

use l = + 1 for positive cases and l = −1for 

negative ones. The kernel function is a positive define function 

i 

i 

reflecting the similarity between an input sample and the set of support vectors Ti. The non-zero parameters α i 

describe the strength of this particular i-th support vector in the decision function. SVM chooses as support 

vectors those points that lie closest to the separating hyperplane. The mapping function Ω need not be explicitly 

defined because in the kernel function is used only the inner product of it. 

RESULTS AND DISCUSION 

7

The major goal of this study was to test to what extent the supervised support vector machine method is capable 

of learning and predicting the target specific inhibition likelihood for chemical compounds based on MDL drug 

data report [1]. We have prepared for each of 5 different protein targets 2 datasets: for training and testing. Each 

include compounds which are know to inhibit the protein target (2/3 of all available positives for training, 1/3 

for testing) and those known not to be active for the selected target (1/3 of all available negatives for training, 

2/3 for testing). Support vector machine algorithm was trained on the first data set and tested on the second one. 

In the following paragraphs we discuss the classification results obtained for each of biological targets and 

present its performance using confusion tables for training and testing datasets. In addition we provide also the 

overall value for classification error and precision/recall values. In general the SVM models yield the successful 

classification of compounds for all targets (see Table I for confusion tables for training and testing datasets). It 

provides robust and reliable models for all types of protein targets. The SVM algorithm turn to be the method 

of choice for any practical purpose: it is very fast, efficient and robust. 

We performed two additional in silico experiments. The first one uses for training all available in MDL drug 

data report active compounds annotated as preclinical or launched. The testing is done on potential inhibitors 

that are annotated as biologically tested. The second experiment trains the SVM method on oldest 1/3 

compounds known to be inhibitors, and test their accuracy on the rest: 2/3 newest developed and patented 

compounds. The performance of machine-learning models in both experiments for each type of protein target is 

described by the recall R and the precision P. The recall R value measures the percentage of correct predictions, 

whereas precision P gives the percentage of observed positives that are correctly predicted. These measures of 

accuracy are calculated separately for each type of protein target and presented in Tables II (the first 

experiment) and Table III (the second experiment). The typical recall value is around 60%, and the precision P 

is close to 100% for all targets (the first experiment). The results for the second experiment are slightly worst, 

which is caused by the discovery of novel drug classes that are presented in the newest 2/3 compounds. 

In the Table IV we present results on enrichment studies of virtual high-throughput screening. We trained SVM 

algorithm on the set of first 10% of the best scoring ligands from the docking and scoring experiments for each 

protein target. The subset of MDL drug data report inhibitors including both active and non-active compounds 

for those targets were docked on protein targets using various docking methods (FLOG [37], GLIDE [38, 39], 

FRED [40], Dock [41], Autodock [42, 43] and ICM [44]) followed in some cases by the scoring (SS, 

ChemScore or internal ICM score). The 10% of best docked compounds were then used for training supervised 

machine-learning algorithms. The classification models were tested then on the rest of active compounds (data 

not shown). Our results support possibility to train machine learning algorithms on docking and scoring results. 

Such trained models can be later applied to large databases in order to select ligands for further experimental 

verification. This procedure will allow for faster selection of compounds in virtual HTS experiments even in 

8

cases where no initial information about a set of active compounds for selected protein target is known. The 

random scores, from training of SVM on randomly selected subset of 1000 compounds, for recall and precision 

are equal to 50% and the most methods are able to gain recall/precision up to 70%. 

Our results are in close agreement with other comparative studies [4, 7, 36, 45, 46]. The SVM method is fast 

and reliable machine-learning method that outperforms other types of algorithms. It is also well suited for the 

classification of small molecules using 2D descriptors with respect to their potential inhibition on selected 

target classes. The selection of molecular descriptors should be done in accordance with the balance between 

general and detailed level of description. The MIX tools descriptors [16] are useful for the classification of 

compounds by SVM with respect to their potential for inhibition of selected protein targets. The subsequent 

SVM training on results of docking experiment allows for speed-up vHTS procedure and to enrich the hit list. 

Our aim was to present possible in silico application of machine learning methods to the HTS data. The number 

of HTS hits can become large requiring some type of prioritization. The set of active compounds often contains 

a substantial number of false positives. False negatives are also important but difficult to identify 

experimentally. Application of a machine learning model can help to identify true positives and help to select 

compounds for retesting. In still another application the results of the HTS itself can be used for training and the 

model used to identify new compounds not present in the screening collection to be synthesized or bought from 

a commercial source. 

These tools can be also valuable whenever a large data set of molecules is to be screened in order to select 

structures that have a higher likelihood of being inhibitors. Such molecules are often desired in order to enrich 

in-house target-specific libraries of pharmaceutical companies. An empirically derived in silico method can also 

help to set priorities within the list of accessible in-house molecules to be tested experimentally. Similar 

approach can be used to enrich the initial set of patented molecules by including compounds from the large 

commercial collections that are predicted to be active for the same protein target family. 

ACKNOWLEDGMENTS 

This work was supported by EC BioSapiens (LHSG-CT-2003-503265) and EC SEPSDA (SP22-CT-2004- 

003831) 6FP projects as well as the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and 

2P05A00130). MvG and LSW would like to thank the Foundation for Polish Science for the fellowship. 

9

Table I. SVM classification results for selected 5 targets on launched and preclinical inhibitors from 

MDDR database. 

Protein Target Training dataset Testing dataset 

COX2 predicted 0 1 predicted 0 1 

10 

Recall/ 

Precision 

Recall/ 

Precision 

Training set Testing set 

observed 0 2106 0 0 8308 38 92% 73% 

observed 1 6 69 1 10 27 100% 42% 

DH 0 1 0 1 

observed 0 2151 0 0 8390 9 100% 73% 

observed 1 0 17 1 3 8 100% 47% 

TH 0 1 0 1 

observed 0 2032 0 0 8401 22 98% 74% 

observed 1 1 47 1 9 26 100% 54% 

RT 0 1 0 1 

observed 0 2130 0 0 8277 43 100% 31% 

observed 1 0 54 1 24 11 100% 20% 

AE 0 1 0 1 

observed 0 2351 2 0 9028 24 100% 42% 

observed 1 0 22 1 7 5 92% 17% 

The SVM classification performance on the set of preclinical or launched inhibitors from MDDR database is 

described here using the confusion tables. Columns represent observed in experiments class of a compound for 

each of targets (active/inactive) whereas rows represent the prediction results. The list of protein targets 

include: cyclooxygenase-2 (COX2), dihydrofolate (DH), thrombin (TH), reverse transcriptase (RT) and 

antiestrogen (AE). First we present results on training datasets with 2/3 available positives (preclinical or 

launched inhibitors for the selected target) and 1/3 of negatives (randomly selected subset of preclinical or 

launched inhibitors knowing not to inhibit selected target). On the right there are results of SVM models on 

testing datasets containing 1/3 of positives and 2/3 of available negatives not used in training of SVM method. 

The last two columns present the precision and recall values on both training and testing datasets.

Table II. The classification accuracy for SVM method on selected 5 targets for biologically tested 

inhibitors of MDL drug data report. 

Protein Target Number 

of 

positives/ 

negatives 

cyclooxygenase-2 

10408 

dihydrofolate 

10492 

thrombin 

10418 

reverse transcriptase 

10406 

antiestrogen 

12164 

11 

Number of 

biological 

testing 

compounds 

793 

110 69% 

95% 

28 154 71% 

98% 

111 1070 62% 

98% 

113 596 52% 

96% 

34 255 40% 

96% 

Recall/ 

Precision 

of the best 

method 

MDL drug data reports inhibitor and non-active compounds for selected five targets. SVM algorithm is trained 

here on the whole set of preclinical or launched inhibitors for five protein targets. The classification models 

were then tested then on biological tested on selected biological target according to MDDR compounds 

(excluding those that are preclinical or already launched on the market). 

The first column presents the protein target name. The second column gives numbers of positives and negatives 

instances used in training. The third column shows the number of compounds annotated by MDL drug data 

report for being biologically tested on selected protein target. The recall and precision values for the SVM 

method are in the fourth column.

Table III. The classification accuracy for support vector machine method on selected 5 targets for 

patented compounds from MDL drug data report. 

Protein Target Number 

of training 

negatives/ 

positives 

oldest 1/3 

Cyclooxygenase-2 

Number 

of testing 

negatives/ 

positives 

newest 2/3 

12 

Recall/ 

Precision 

on the 

training 

dataset 

Recall/ 

Precision 

on the 

testing 

dataset 

2106 8347 94% 64% 

31 81 100% 66% 

dihydrofolate 2151 8390 100% 50% 

12 16 100% 44% 

thrombin 2034 8425 100% 33% 

34 76 100% 66% 

reverse 2130 8323 100% 46% 

transcriptase 54 59 100% 33% 

antiestrogen 2353 9700 100% 9% 

6 22 100% 15% 

The supervised SVM method classification performance trained using AP descriptors. The list of protein targets 


antiestrogen (AE). The oldest one-third (first four columns) or the oldest two-thirds (last four columns) of a 

subset of MDL drug data report inhibitors and non-active compounds for those targets were used for training 

supervised machine-learning algorithms. The classification models were tested then on rest of patented 

compounds. 

The first column presents the protein target name. The second column gives numbers of positives and negatives 

instances used in training (1/3 oldest patented compounds). The third column shows the number of 2/3 newest 

compounds annotated by MDL drug data report for inhibition on selected protein target and patented. The recall 

and precision values for the SVM method on the training datasets is included in the fourth column. The fifth 

column presents the precision and recall values on the testing dataset.

Table IV. The classification accuracy for support vector machine method on selected 5 targets for top 

10% of best docked compounds. 

Protein Target #positives in 

MDDR db 

FLOG 

docking 

results 

Best Docking 

Methods 

Cyclooxygenase-2 98 1 GLIDE 

FRED 

13 

Best Scoring 

Methods 

SS 

ChemScore 

dihydrofolate 27 6 ICM ICM 10 

thrombin 99 46 ICM ICM 99 

reverse transcriptase 108 0 ICM ICM 10 

antiestrogen 32 17 GLIDE 

FRED 

GlideScore 

ChemScore 

Best 

D&S 

Results 

60 

63 

Recall/ 

Precision 

on the 

training 

dataset 

93,20% 

98,97% 

64,74% 

62,43% 

72,64% 

68,28% 

79,29% 

80,16% 

74,00% 

24 

23 71,36% 

The supervised SVM method classification performance trained using AP descriptors. The list of protein targets 


antiestrogen (AE). The subset of MDL drug data report inhibitors including both active and non-active 

compounds for those targets were docked on protein targets using various docking methods. The 10% of best 

docked compounds were then used for training supervised machine-learning algorithms. The classification 

models were tested then on the rest of active compounds. 

The first column presents the protein target name. The second column gives numbers of active compounds 

found in MDL data drug report database for selected protein target. The number of negatives used for docking 

experiment was fixed for all targets and equal to 10000. The third column shows the results of FLOG fast and 

flexible docking procedure i.e. the number of active compounds found in first 10% of the ordered by the FLOG 

docking score ligands. The fourth and fifth column present the best docking and scoring method name. The 

sixth column presents the number of active compounds found in first 10% of the list of ligands ordered by the 

best docking program followed by the scoring. The set of 10% of the best scoring compounds was then used to 

train support vector machine (SVM). The recall and precision values for the training is presented in the seventh 

column.

REFERENCES 

1. MDL, MDL Drug Data Report (2004). Coverage: 1988-present; updated monthly. Focus: Drugs 

launched or under development, as referenced in the patent literature, conference proceedings, and 

other sources; descriptions of therapeutic action and biological activity; tracking of compounds 

through development phases. Size: 132726 molecules,129459 models. Updates add approximately 

10,000 new compounds per year. 2004. 

2. Nidhi, et al., Prediction of biological targets for compounds using multiple-category Bayesian models 

trained on chemogenomics databases. J Chem Inf Model, 2006. 46(3): p. 1124-33. 

3. Fang, J., et al., Support vector machines in HTS data mining: Type I MetAPs inhibition study. J Biomol 

Screen, 2006. 11(2): p. 138-44. 

4. Briem, H. and J. Gunther, Classifying "kinase inhibitor-likeness" by using machine-learning methods. 

Chembiochem, 2005. 6(3): p. 558-66. 

5. Bender, A. and R.C. Glen, A discussion of measures of enrichment in virtual screening: comparing the 

information content of descriptors with increasing levels of sophistication. J Chem Inf Model, 2005. 

45(5): p. 1369-75. 

6. Plewczynski, D., S.A. Spieser, and U. Koch, Assessing different classification methods for virtual 

screening. J Chem Inf Model, 2006. 46(3): p. 1098-106. 

7. Jenkins, J.L., R.Y. Kao, and R. Shapiro, Virtual screening to enrich hit lists from high-throughput 

screening: a case study on small-molecule inhibitors of angiogenin. Proteins, 2003. 50(1): p. 81-93. 

8. Kalgutkar, A.S. and Z. Zhao, Discovery and design of selective cyclooxygenase-2 inhibitors as nonulcerogenic, 

anti-inflammatory drugs with potential utility as anti-cancer agents. Curr Drug Targets, 

2001. 2(1): p. 79-106. 

9. Dicker, A.P., et al., Identification and characterization of a mutation in the dihydrofolate reductase 

gene from the methotrexate-resistant Chinese hamster ovary cell line Pro-3 MtxRIII. J Biol Chem, 

1990. 265(14): p. 8317-21. 

10. Schweitzer, B.I., A.P. Dicker, and J.R. Bertino, Dihydrofolate reductase as a therapeutic target. Faseb 

J, 1990. 4(8): p. 2441-52. 

11. Ambler, J., et al., The discovery of orally available thrombin inhibitors: studies towards the 

optimisation of CGH1668. Bioorg Med Chem Lett, 1998. 8(24): p. 3583-8. 

12. Menear, K., Progress towards the discovery of orally active thrombin inhibitors. Curr Med Chem, 1998. 

5(6): p. 457-68. 

13. Gustafsson, J.A., Therapeutic potential of selective estrogen receptor modulators. Curr Opin Chem 

Biol, 1998. 2(4): p. 508-11. 

14. Castro, H.C., et al., HIV-1 reverse transcriptase: a therapeutical target in the spotlight. Curr Med 

Chem, 2006. 13(3): p. 313-24. 

15. Sheridan, R.P., The centroid approximation for mixtures: calculating similarity and deriving structure-activity 

relationships. J Chem Inf Comput Sci, 2000. 40(6): p. 1456-69. 

16. Miller, M.D., R.P. Sheridan, and S.K. Kearsley, SQ: a program for rapidly producing 

pharmacophorically relevent molecular superpositions. J Med Chem, 1999. 42(9): p. 1505-14. 

17. Vapnik, V.N., The nature of statistical learning theory. Vol. xv. 1995, New York: Springer. 

18. Vapnik, V.N., Statistical learning theory. Adaptive and learning systems for signal processing, 

communications, and control. Vol. xxiv. 1998, New York: Wiley. 736. 

19. Cristianini, N. and J. Shawe-Taylor, An introduction to support vector machines : and other kernelbased 

learning methods. 2000, Cambridge, U.K. ; New York: Cambridge University Press. xiii, 189. 

20. Joachims, T., Learning to classify text using support vector machines. Kluwer international series in 

engineering and computer science SECS. Vol. xvi. 2002, Boston: Kluwer Academic Publishers. 205. 

14

21. Vojtech, F. and H. Vaclay, Vojtech, F., Vaclay, H., An iterative algorithm learning the maximal margin 

classifier. Pattern Recognition, 2003. 36(9): p. 1985-1996. 

22. Kim, H. and H. Park, Protein secondary structure prediction based on an improved support vector 

machines approach. Protein Eng, 2003. 16(8): p. 553-60. 

23. Minakuchi, Y., K. Satou, and A. Konagaya. Prediction of protein-protein interaction sites using supprot 

vector machnes. in International conference on mathematics and engineering techniques in medicine 

and biological sciences. 2003. 

24. Valentini, G., Gene expression data analysis of human lymphoma using support vector machines and 

output coding ensembles. Artif Intell Med, 2002. 26(3): p. 281-304. 

25. Guyon, I., et al., Gene selection for cancer classification using support vector machines. Mach. Learn., 

2002. 46: p. 389-422. 

26. Brown, M.P., et al., Knowledge-based analysis of microarray gene expression data by using support 

vector machines. Proc Natl Acad Sci U S A, 2000. 97(1): p. 262-7. 

27. Furey, T.S., et al., Support vector machine classification and validation of cancer tissue samples using 

microarray expression data. Bioinformatics, 2000. 16(10): p. 906-14. 

28. Krishnan, V.G. and D.R. Westhead, A comparative study of machine-learning methods to predict the 

effects of single nucleotide polymorphisms on protein function. Bioinformatics, 2003. 19(17): p. 2199- 

209. 

29. Pavlidis, P., et al. Gene functional classification from heterogeneous data. in 5th International 

Conference on Computational Molecular Biology. 2001. Montreal, Canada: ACM Press. 

30. Zien, A., et al., Engineering support vector machine kernels that recognize translation initiation sites. 

Bioinformatics, 2000. 16(9): p. 799-807. 

31. Jaakkola, T., M. Diekhans, and D. Haussler, A discriminative framework for detecting remote protein 

homologies. J Comput Biol, 2000. 7(1-2): p. 95-114. 

32. Hua, S. and Z. Sun, A novel method of protein secondary structure prediction with high segment 

overlap measure: support vector machine approach. J Mol Biol, 2001. 308(2): p. 397-407. 

33. Ding, C.H. and I. Dubchak, Multi-class protein fold recognition using support vector machines and 

neural networks. Bioinformatics, 2001. 17(4): p. 349-58. 

34. Schölkopf, B., C.J.C. Burges, and A.J. Smola, Advances in kernel methods : support vector learning. 

Vol. vii. 1999, Cambridge, Mass.: MIT Press. 376. 

35. Zavaljevski, N., F.J. Stevens, and J. Reifman, Support vector machines with selective kernel scaling for 

protein classification and identification of key amino acid positions. Bioinformatics, 2002. 18(5): p. 

689-96. 

36. Burbidge, R., et al., Drug design by machine learning: support vector machines for pharmaceutical 

data analysis. Comput Chem, 2001. 26(1): p. 5-14. 

37. Miller, M.D., et al., FLOG: a system to select 'quasi-flexible' ligands complementary to a receptor of 

known three-dimensional structure. J Comput Aided Mol Des, 1994. 8(2): p. 153-74. 

38. Friesner, R.A., et al., Glide: a new approach for rapid, accurate docking and scoring. 1. Method and 

assessment of docking accuracy. J Med Chem, 2004. 47(7): p. 1739-49. 

39. Halgren, T.A., et al., Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment 

factors in database screening. J Med Chem, 2004. 47(7): p. 1750-9. 

40. Miteva, M.A., et al., Fast structure-based virtual ligand screening combining FRED, DOCK, and 

Surflex. J Med Chem, 2005. 48(19): p. 6012-22. 

41. Ewing, T.J., et al., DOCK 4.0: search strategies for automated molecular docking of flexible molecule 

databases. J Comput Aided Mol Des, 2001. 15(5): p. 411-28. 

42. Buzko, O.V., A.C. Bishop, and K.M. Shokat, Modified AutoDock for accurate docking of protein kinase 

inhibitors. J Comput Aided Mol Des, 2002. 16(2): p. 113-27. 

43. Vaque, M., et al., BDT: an easy-to-use front-end application for automation of massive docking tasks 

and complex docking strategies with AutoDock. Bioinformatics, 2006. 

15

44. Fernandez-Recio, J., M. Totrov, and R. Abagyan, ICM-DISCO docking by global energy optimization 

with fully flexible side-chains. Proteins, 2003. 52(1): p. 113-7. 

45. Byvatov, E., et al., Comparison of support vector machine and artificial neural network systems for 

drug/nondrug classification. J Chem Inf Comput Sci, 2003. 43(6): p. 1882-9. 

46. Glick, M., et al., Enrichment of high-throughput screening data with increasing levels of noise using 

support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers. J 

Chem Inf Model, 2006. 46(1): p. 193-200. 

16

T xT Kl xTf - ICM

Create successful ePaper yourself

Delete template?

Save as template?