13.07.2015 Views

Software for Data Management & (Q)SAR Applications - Cefic LRI

Software for Data Management & (Q)SAR Applications - Cefic LRI

Software for Data Management & (Q)SAR Applications - Cefic LRI

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Software</strong> <strong>for</strong> <strong>Data</strong> <strong>Management</strong>& (Q)<strong>SAR</strong> <strong>Applications</strong>Project EEM-9 9 Update2004-20062006Joanna JaworskaCentral Product Safety,Procter and Gamble,BelgiumNina JeliazkovaBulgarian Academy ofSciences, Institute of ParallelProcessing,Bulgaria


<strong>Software</strong> overview<strong>Data</strong> import andexport,Format ConversionsEM9-1,2,3Chemical groupingDescriptor relevanceassessment functionEM9-1b<strong>Data</strong>baseSearch engineSearches by (CAS,SMILES, Name)Substructure searchSimilarity SearchEM9-1a,b, 2,3ApplicabilitydomainEM9-1a


AMBIT software design‣ Open source‣ Modular approach‣ Stand alone and web versions‣ Implemented in Java, i.e.• Plat<strong>for</strong>m independent• Suitable <strong>for</strong> web applications without the need <strong>for</strong> rewriting thecode‣ The chemin<strong>for</strong>matics functionality relies on the opensource Java library – The Chemistry Development Kit,available at http://cdk.source<strong>for</strong>ge.netcdk.source<strong>for</strong>ge.net/


<strong>Data</strong>base design– EEM9-3


<strong>Data</strong>baseNCI databaseEPA DSSToxEPA AquireEPA TerretoxNMRShiftDBLigand.infoZINCWWW MolecularMatrixAMBIT database compared …FormatSDFdownloadSDFText files,organizedrelationally;Online queryMySQL,online onlySDFMySQL,online queryXML(XIndices).onlineNumber250 000Severalthousands?~2000~700013773(April 8, 2005)~1 000 000~2 000 000~250 000Comments+ 3D Corina optimized structures+ public-very very big (2G) file , needs preprocessing to be used+Toxicological data available-An (unsuccessful) approach to standardize SDF file <strong>for</strong>mat-Not well designed <strong>for</strong>mathttp://www.epa.gov/ecotox/data_download/aquire/aquire_structure.pdfpdf+ well designed structure, tox data available- Needs preprocessing in order to use files downloaded- small number of compounds compared to NCI+open source, open content+complex procedure <strong>for</strong> structure contribution by aregistered users-optimized <strong>for</strong> spectra (measured 14965)+huge number of compounds, CAS, names-most compounds have only 2D structure-needs preprocessing - No toxicological data, but somepharma related+3D structures <strong>for</strong> docking+user is allowed to select and download subsets+open+interface to GAMESS, Condor, MOPAC, ChemDraw,Babel, Inchi


Homework‣ Many online sites and downloadable files with chemical data‣ Most are not “databases” in an IT sense• NCI‣ Diverse <strong>for</strong>mats (even if it is an SDF, it could differ widely)• DSSTox‣ Require preprocessing to be used by an end user• Ligand Info‣ Not directly suitable to store Q<strong>SAR</strong> data and models‣ More … (commercial solutions, data sharing, data publishing)AMBIT database is an attempt to address these issues


The AMBIT database‣ In IT sense database is a software system, allowing users to store, to retrieveand to search structured in<strong>for</strong>mation.‣ The database• Stores chemical in<strong>for</strong>mation (structure, names, CAS, SMILES, properties, etc.),• provides place/allows function <strong>for</strong> the• EEM9-1a Applicability Domain,• EEM9-1b Chemical Grouping• EEM9-2 CAS-SMILES SMILES converter.‣ The software is based on a Relational <strong>Data</strong>base <strong>Management</strong> System(RDBMS),• allows much faster and convenient access to the data in contrast to flat text files.• Our choice is MySQL database (www.mysql.com), which is the most popular opensource relational database.• The complete documentation of AMBIT <strong>Data</strong>base is available athttp://ambit.acad.bgambit.acad.bg/docs/


Ambit <strong>Data</strong>base:Literature reference tables cont.All these details are hidden behind a convenient user interface


Why XML and CML ?(eXtensibleMarkup Language, Chemical Markup Language) Many approaches to the internal computer representation of a chemical exist.But: Mutually incompatible and incomplete (different <strong>for</strong>mats represent differentsubsets of chemical in<strong>for</strong>mation, conversion between them loses in<strong>for</strong>mation)i Proprietary (SMILES, connection table <strong>for</strong>mats - SDF, etc.) (de facto standardDaylight SMILES is not fully published. The last publication is from 1989, furtherSMILES extensions are not published. This often results in SMILES S strings, notunderstandable by all of the software) In contrast CML is a common chemical data <strong>for</strong>mat, that is: Open (documentation available) Universal standard (able to represent any aspect of chemical in<strong>for</strong>mation) Suitable <strong>for</strong> internal (in memory, <strong>for</strong> all kind of processing – calculation,substructure searching, etc.) and external (e.g. visualization) computerrepresentation In contrast to SMILES complex compounds can be unambiguously represented(e.g. hydrates, salts, etc.)


Ambit <strong>Data</strong>base today‣ 463 426 compounds• NCI dataset 249071 structureshttp://cactus.nci.nih.gov/ncidb2/download.html• Ligand.info – 251369http://ligand.info• SRC KOWWIN Training data set - 2464• SRC KOWWIN Validation data set - 10839 structures‣ Structures stored in a compressed CML <strong>for</strong>mat‣ SMILES and fingerprints generated


EEM9- 1a Applicability domain


Applicability Domain (AD)AD estimation by projection of training set inmodel’s s descriptor space‣ Review of statistical approaches to regression based Q<strong>SAR</strong> ADestimation, ATLA, 33, 445-459, 459, 2005‣ An approach to determining applicability domain <strong>for</strong> Q<strong>SAR</strong> groupcontribution models: an analysis of SRC KOWWIN, ATLA,33, 461-470,2005‣ Current status of methods <strong>for</strong> defininig the applicability domain ofQ<strong>SAR</strong>s, , ECVAM workshop, ATLA,1-19,200519,2005


Applicability Domain (AD)structural similarity assessment‣ Need to further develop AD estimation methodology• Queried chemical is in the descriptor domain but is dissmilar tochemicals in a training set• Queried chemical is outside descriptor domain but is similar to chemicalsin the training set• High dimensional models ( SRC, MCASE)‣ Structural similarity assessment ( structural domain)• Different computer representations of structure• Different measures of similarity• Robust AD estimation requires combining several methods, eachrepresenting chemical structure in a different way.


Applicability Domain (AD)by structural similarity assessment in AMBIT‣ Daylight fingerprints• Missing fragments• Tanimoto coefficient ( Jaccard version)‣ Atom environments• Consensus, Hellinger distance• k-Nearest neighbor, Tanimoto coefficient‣ Option <strong>for</strong> combining different approaches


‣ Methods:AD Estimation in AMBIT• Ranges• Distances (City block, Euclidean, Mahalonobis)• Probability Density• Structural similarity


AD Estimation - options‣ Preprocessing :• PCA• Center‣ Threshold‣ Structural similarity


View compounds and scatter plot


Domain vs. Prediction error


Export results to a text file‣ Menu : File / Save / Training set‣ Exported results : example:


Methods <strong>for</strong> grouping EEM9-1b‣ OECD workshop on category development ( 2004)observed lack of <strong>for</strong>mal methods/tools in this field.‣ AMBIT approach - Reapply AD methods• Projection of groups (subspaces) in chemical space• Grouping by structural similarity methodspKa121110987654321(a)pKa13121110987654321pKa13121110987654321 (c)-11-10E_HOMO-9-12-11-10-9E_HOMO-8-7-6-12-11-10-9E_HOMO-8-7-6(a) Probabilistic classification (b) Classification by Euclidean distance(c) by Mahalanobis distance


Methods <strong>for</strong> grouping EEM9-1b‣ Projection of groups in chemical space• Use In<strong>for</strong>mation Theory methods select and rank descriptorsbased on relevance & ability to discriminate between groups• Estimate AD <strong>for</strong> each group/category• Classify the new compounds using Bayesian decision theory‣ Reapplication of AD assessment by structural methods• Daylight fingerprints• Missing fragments• Tanimoto coefficient ( Jaccard version)• Atom environments• Consensus, Hellinger distance• k-Nearest neighbor, Tanimoto coefficient


EEM 9-29Conversions CAS to SMILESMOA Assessment


Ambit converterAmbit converter can open : CML, CSV, HIN, ICHI, INCHI, MDL MOL, MDLSDF, MOL2, PDB, SMI, TXT and XYZ file typesAmbit converter can save : SDF, MOL, CSV, TXT, SMI file types.CAS-SMILES conversion based on a database lookup


CAS-SMILES conversion based ona database lookup (MySQL database)<strong>Data</strong>base may be local (on the user computer) or remote (server).


Verhaar schemeVerhaar H.J.M., Van Leeuven C., Hermens J.L.M.,Classifying Environmental Pollutants. 1: Structure-Activity Relationships <strong>for</strong> Prediction of Aquatic Toxicity, Chemosphere, Vol.25, No.4, pp.471-491, 491, 1992‣ 34 rules‣ 5 classes• Class 1. Narcosis orbaseline toxicity• Class 2 Less inertcompounds• Class 3 Unspecificreactivity• Class 4 Compounds andgroups of compoundsacting by a specificmechanism• Class 5 Not possible toclassify according to theserules


Some of the rules


LogP prediction (Rule 0.3)‣ Based on an open source implementation of XLOGPhttp://cdk.source<strong>for</strong>ge.net/api/org/openscience/cdk/qsar/XLogPDescriptor.htmlcriptor.html‣ R. Wang, Y. Fu and L. Lai, A new atom-additive additive method <strong>for</strong> calculating partitioncoefficients,J. . Chem. Inf. Comput. Sci. 37 (1997) 615–621.621.‣ XLOGP is an atom-additive additive method <strong>for</strong> calculating the octanol/water partitioncoefficient(logP). It gives the logP value <strong>for</strong> a give compound by summing thecontributions from component atoms and correction factors. The program pclassifiesatoms by their hybridization states and their neighboring atoms. The program alsoincludes correction factors to account <strong>for</strong> some intramolecular interactions. Log Pcalculation is described as shown in equation:where a i and b j are regression coefficients, A i is the number of occurrences of the ithatom type, and Bj is the number of occurrences of the jthcorrection factor identifiedby the program.


Summary‣ This presentations is midway of the projectstatus.‣ Many tools were developed and we are workingon their seamless integration.‣ Both standalone and web application aredeveloped and are being extensively tested.‣ Synergies with other projects are becomingfeasible• ECB Cramer rules software• Fraunhofer Institute subchronic toxicity database.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!