13.07.2015 Views

Comparative Assessment of QSAR Models for Aquatic Toxicity

Comparative Assessment of QSAR Models for Aquatic Toxicity

Comparative Assessment of QSAR Models for Aquatic Toxicity

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

EUROPEAN COMMISSIONDIRECTORATE GENERALJOINT RESEARCH CENTREInstitute <strong>for</strong> Health and Consumer ProtectionToxicology and Chemical Substances UnitEuropean Chemicals BureauI-21020 Ispra (VA) Italy<strong>Comparative</strong> <strong>Assessment</strong> <strong>of</strong> <strong>QSAR</strong> <strong>Models</strong> <strong>for</strong><strong>Aquatic</strong> <strong>Toxicity</strong>Manuela Pavan, Andrew P. Worth and Tatiana I. Netzeva2005 EUR 21750 EN


LEGAL NOTICENeither the European Commission nor any personacting on behalf <strong>of</strong> the Commission is responsible <strong>for</strong>the use which might be made <strong>of</strong> the following in<strong>for</strong>mation.A great deal <strong>of</strong> additional in<strong>for</strong>mation on theEuropean Union is available on the Internet.It can be accessed through the Europa server(http://europa.eu.int)EUR 21750 EN© European Communities, 2005Reproduction is authorised provided the source is acknowledgedPrinted in Italy


7.2 Model comparison by ratio <strong>of</strong> <strong>QSAR</strong> prediction/SIDS data .............................................. 367.2.1 Comparison between non-polar narcosis model (<strong>QSAR</strong>1) and SIDS LC50. .............. 377.2.2 Comparison between polar narcosis model (<strong>QSAR</strong>2) and SIDS LC50. ..................... 397.2.3 Comparison between narcosis model (<strong>QSAR</strong>3) and SIDS LC50................................ 417.2.4 Comparison between mixed model (<strong>QSAR</strong>4) and SIDS LC50................................... 437.2.5 Comparison between E-state indices model (<strong>QSAR</strong>5) and SIDS LC50. .................... 457.2.6 Comparison between Terra<strong>QSAR</strong> model (<strong>QSAR</strong>6) and SIDS LC50. ........................ 47ACKNOWLEDGEMENTS...........................................................................................................48TABLES ........................................................................................................................................53Table I – SIDS test data. ........................................................................................................... 53Table II – Mixed model (<strong>QSAR</strong>4) training set. ........................................................................ 63Table III – SIDS chemicals not suitable <strong>for</strong> <strong>QSAR</strong> 4:.............................................................. 66Table IV – <strong>QSAR</strong> 4 predictions <strong>for</strong> the SIDS subset defined by model domain in descriptorand response space (XY-D). ..................................................................................................... 67Table V – E-state indices model (<strong>QSAR</strong>5) training set............................................................ 70Table VI – <strong>QSAR</strong>5 predictions <strong>for</strong> the 9 test set chemicals. .................................................... 74Table VII – SIDS chemicals not suitable <strong>for</strong> <strong>QSAR</strong> 5. ............................................................ 75Table VIII –<strong>QSAR</strong> 5 predictions <strong>for</strong> the SIDS subset defined by model domain in descriptorand response space (XY-D). ..................................................................................................... 76Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set: measured versus Predicted FHM values in TQ-FHM model, pT units (log[mmol/L]). ...................................................................................... 81Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data......................................................... 92Table XI – Model per<strong>for</strong>mance comparison............................................................................. 99APPENDIX I: TERMINOLOGY AND STATISTICAL BACKGROUND...............................101


LIST OF ABBREVIATIONSAIC Akaike In<strong>for</strong>mation CriterionE-state Electrotopological indexFFisher statistics.FIT Kubinyi functionGETAWAY GEometry, Topology, and Atom-Weights AssemblYLC50 Concentration <strong>of</strong> a compound that causes 50% lethality <strong>of</strong> the animals in a testbatchLOO Leave-one-out cross-validationOLS Ordinary Least SquaresPCA Principal component analysis<strong>QSAR</strong> Quantitative Structure-Activity RelationshipsQ 2 BootQ 2 extR2average predictive power calculated by boot-strapping validationexplained variance in prediction calculated by external validationCoefficient <strong>of</strong> determinationR 2 cv Cross-validated R 2R 2 adjRMSsSDECSDEPSDEPextWHIMAdjusted R2Residual Mean SquareStandard error <strong>of</strong> estimateStandard Deviation Error in Calculation,Standard Deviation Error <strong>of</strong> PredictionExternal Standard Deviation Error <strong>of</strong> PredictionWeighted Holistic Invariant Molecular descriptors


5. Estimation <strong>of</strong> predictive ability by internal validation techniques (cross-validation,bootstrap, response randomization).6. Evaluation <strong>of</strong> <strong>QSAR</strong> applicability domains by making predictions <strong>of</strong> SIDS test data:checking the domain <strong>of</strong> applicability with respect to descriptor ranges and any structuralrules defining the group <strong>of</strong> substances <strong>for</strong> which the models are valid.7. Application <strong>of</strong> the models to the SIDS chemicals8. Evaluation <strong>of</strong> predictive per<strong>for</strong>mance in terms <strong>of</strong> explained variance (Q 2 ext) and theprediction reliability (order <strong>of</strong> magnitude between estimated and experimental data).Predictive per<strong>for</strong>mance was assessed <strong>for</strong> the full set <strong>of</strong> SIDS substances, and <strong>for</strong> subsetsbased on different hypotheses about the applicability domain.9. <strong>Comparative</strong> analysis <strong>of</strong> the model quality.2. SIDS TOXICITY DATA SELECTIONThe experimental toxicity values were available <strong>for</strong> 32 SIDS chemicals; interval values wereprovided <strong>for</strong> 4 chemicals and open intervals (>) <strong>for</strong> 6 chemicals. All the measured effectconcentrations expressed as “>” were disregarded, since these values were difficult to comparewith <strong>QSAR</strong> predictions.In order to provide a deeper and more realistic further evaluation/validation <strong>of</strong> the selectedmodels the AQUIRE (AQUatic toxicity In<strong>for</strong>mation REtrieval) database developed by the U.S.EPA Mid-Continent Ecology Division, Duluth, MN (MED-Duluth)(http://www.epa.gov/ecotox/) was investigated to fill in the experimental missing values <strong>of</strong> theSIDS data.The AQUIRE database provided experimental toxicity values <strong>of</strong> 25 SIDS missing values. Sincethe database gave more than one value <strong>for</strong> each chemical the average value was used to fill in thedata gaps. Thus the final integrated SIDS dataset was made <strong>of</strong> 57 experimental toxicity data outthe 177 SIDS chemicals. The 177 SIDS chemicals investigated in this study, their toxicity interms <strong>of</strong> LogLC50(mol/l), their logKow values and their mechanism <strong>of</strong> action are listed in TableI.The mechanism <strong>of</strong> toxic action (MOA) <strong>of</strong> the SIDS chemicals was studied and identified bycomparing three classification schemes and developing a consensus classification scheme basedon a majority principle according to which each chemical has been classified belonging to theclass more represented among the classifications compared and following the precautionaryprinciple according to which all chemicals with a MOA differently interpreted by theclassification schemes were classified as potentially specifically reactive chemicals. The details<strong>of</strong> the three classification schemes compared together with the consensus classification schemeare illustrated in the European Commission Report EUR 21749 EN (Pavan, M. et al. 2005).2


3. SELECTION OF LITERATURE-BASED MODELS TO PREDICT SIDS FISHTOXICITYThe following six <strong>QSAR</strong> models <strong>for</strong> acute fish toxicity on Pimephales promelas were analyzedwith respect to their predictive capability on SIDS data set:• <strong>QSAR</strong> 1: non – polar narcosis: Veith, GD, Call, DJ and Brooke, LT. (1983). Structuretoxicityrelationships <strong>for</strong> the fathead minnow, Pimephales promelas: Narcotic industrialchemicals. Canadian Journal <strong>of</strong> Fisheries and <strong>Aquatic</strong> Sciences. 40, 743-748. Published bythe European Commission (European Commission, 1995) and recommended <strong>for</strong> use in theEuropean Union Technical Guidance Document (European Economic Community 1996).• <strong>QSAR</strong> 2 polar narcosis: Verhaar, H.J.M., Mulder, W., Hermens, J.L.M. (1995). <strong>QSAR</strong>s <strong>for</strong>ecotoxicity. In Overview <strong>of</strong> structure-activity relationships <strong>for</strong> environmental endpoints, PartI: general outline and procedure. Hermens, J.L.M. (ed), Report in <strong>QSAR</strong> <strong>for</strong> Predicting Fateand Effects <strong>of</strong> Chemicals in the Environment, Final Report <strong>of</strong> DG XII Contract No. EV5V-CT92-0211 (available at http://ecb.jrc.it/<strong>QSAR</strong>/).• <strong>QSAR</strong> 3 narcosis model: developed by ECB by combining the training sets <strong>of</strong> the twoabove models.• <strong>QSAR</strong> 4 (mixed mechanism <strong>of</strong> toxic action): Veith, GD, Mekenyan, O.G. (1993). A <strong>QSAR</strong>approach <strong>for</strong> estimating the aquatic toxicity <strong>of</strong> s<strong>of</strong>t electrophiles [<strong>QSAR</strong> <strong>for</strong> s<strong>of</strong>telectrophiles]. Quantitative Structure-Activity Relationships 12, 349-356.• <strong>QSAR</strong> 5: <strong>QSAR</strong> based on atom-type electrotopological state (E-state) indices :Huuskonen,J. 2003. <strong>QSAR</strong> modeling with the electrotopological state indices: predicting thetoxicity <strong>of</strong> organic chemicals. Chemosphere , 50, 949 – 953.• <strong>QSAR</strong> 6: Terra<strong>QSAR</strong>-FHM: Terra<strong>QSAR</strong> TM – FHM, Fathead minnow 96-hr LC50Estimation, S<strong>of</strong>tware vs 1.1.The first two models represent <strong>QSAR</strong>s <strong>for</strong> two very well known mechanisms <strong>of</strong> action: nonpolarnarcosis (<strong>QSAR</strong>1) and polar narcosis (<strong>QSAR</strong>2). The third model developed by ECB isintended to represent the narcosis mechanism <strong>of</strong> action, comprehensive <strong>of</strong> the non-polar andpolar action.The three <strong>QSAR</strong> models <strong>for</strong> narcosis were evaluated in the European Commission Report EUR21749 EN (Pavan, M. et al. 2005).The fourth model is more general than the previous ones since by including an electrophilicitydescriptor it is supposed to describe potentially bioreactive (electrophilic) chemicals.The fifth model is a more recently proposed model based on hydrophobic and polar atom-typeelectrotopological state (E-state) indices.The sixth model is a commercially available neural network based s<strong>of</strong>tware program, designedand optimized solely <strong>for</strong> the computation <strong>of</strong> acute (96hr) median lethal concentrations (LC50) <strong>of</strong>organic (carbon containing) substances developed by TerraBase Inc. s<strong>of</strong>tware company.Each model was analyzed <strong>for</strong> its correspondence with the OECD principles and <strong>for</strong> its capabilityto provide reliable predictions <strong>of</strong> the fish toxicity <strong>of</strong> the SIDS chemicals.3


4. MIXED MODE OF ACTION <strong>QSAR</strong>4 EVALUATION4.1 Defined endpoint and algorithmThis <strong>QSAR</strong> developed <strong>for</strong> predicting acute toxicity <strong>of</strong> organic chemicals to the fathead minnowwas proposed by Veith and Mekenyan (Veith, GD, Mekenyan, O.G. 1993):LogLC50 = - 0.579 LogKow + 0.473 E LUMO - 2.414where LC 50 is the concentration (in moles per litre) causing 50% lethality in Pimephalespromelas, after an exposure <strong>of</strong> 96 hours, and Kow is the octanol-water partition coefficient andE LUMO is the energy <strong>of</strong> the lowest unoccupied molecular orbital.Actually, the energy <strong>of</strong> the lowest unoccupied molecular orbital values recalculated byChemOffice3D [Chem3D Ultra 9.0] were slightly different to the ones published in the paperand consequently the OLS equation reproduced is slightly, but not significantly, different fromthe original one:LogLC50 = - 0.574 LogKow + 0.454 E LUMO - 2.4454.2 Mechanistic basisThe term “mixed” mode <strong>of</strong> action usually refers to compounds acting by narcosis mechanisms aswell as those acting by unspecific “bioreactive” mechanisms, which involve electrophilic (and insome cases nucleophilic) reactions within the cell or organism. They exhibit a higher toxicitythan that expected from narcosis, and fish acute toxicity syndrome studies demonstrate that theyare separate from narcosis. The exact mechanism <strong>of</strong> action is not known, but it is assumed toinvolve a covalent reaction with a biological macromolecule (e.g. a protein or DNA etc). Afurther parameter is thus included in the model to account <strong>for</strong> the reactivity component <strong>of</strong> thetoxicity. Typically this has been a molecular orbital property, such as the energy <strong>of</strong> the lowestunoccupied molecular orbital (E LUMO ), or a nucleophilic superdelocalisability.The <strong>QSAR</strong> was developed <strong>for</strong> aromatic chemicals considered to act by a number <strong>of</strong> mechanisms<strong>of</strong> toxic action. These include non-polar and polar narcosis as well as unspecific electrophilicityas defined by Russom (Russom et al. 1997).The model is based on two descriptors. The first descriptor, <strong>for</strong> hydrophobicity (log Kow), isrelevant to the mechanism <strong>of</strong> action, i.e. toxicity results from the accumulation <strong>of</strong> molecules inbiological membranes. The second descriptor, <strong>for</strong> electrophilicity (E LUMO ). relates to thereactivity <strong>of</strong> the chemicals with biological macromolecules.4.3 Domain <strong>of</strong> applicabilityThe <strong>QSAR</strong> model was defined by the developers as applicable to chemicals having log Kowvalues in the range from 0.34 to 7.54, and E LUMO values in the range from -2.51 to 0.53. Thecompounds in the training set that may operate by a number <strong>of</strong> mechanisms <strong>of</strong> action includingnon-polar and polar narcosis as well as unspecific electrophilicity. The training set comprisesaromatic compounds, including alkyl, halogen benzenes, as well as similar substituents onphenols and anilines.4


The domain <strong>of</strong> applicability was verified by the leverage approach, which provides a measure <strong>of</strong>the distance between the descriptor values <strong>for</strong> an observation and the mean <strong>of</strong> x-values <strong>for</strong> allobservations.4.4 Model per<strong>for</strong>manceThe model quality was evaluated according to its internal per<strong>for</strong>mance (data quality andgoodness-<strong>of</strong>-fit) and its predictivity on SIDS test data (external validation).4.4.1 Internal per<strong>for</strong>mance• Data qualityThe model training set is made up <strong>of</strong> 114 chemicals listed in Table II. The biological data areconsidered to be <strong>of</strong> high quality, having been obtained by a single protocol and measured inthe same laboratory.The LogKow data are a mixture <strong>of</strong> experimental and calculated values. Kow is considered tobe a high quality physicochemical descriptor, and the range <strong>of</strong> log Kow values is well withinthe one commonly used. However, there is no certainty that the measurements <strong>of</strong> Kow weremade by the same protocol, or in the same laboratory, so this could result in a small amount<strong>of</strong> variability. Furthermore, using a mixture <strong>of</strong> calculated and experimental values will alsoresult in some variability.The calculation <strong>of</strong> E LUMO was per<strong>for</strong>med by the same method used and suggested by theauthors (MNDO calculation method). This descriptor is known to be con<strong>for</strong>mation –dependent, thus some minimal variability is expected.• Goodness <strong>of</strong> fitThe model has been trained with 114 chemicals listed in Table II.Predictor Coeff. SEConstant -2.445 0.115LogKow -0.574 0.033E LUMO 0.454 0.061The following fitness regression parameters have been calculated:2R2Radj s F LOF77.57 77.17 0.485 191.99 0.246SDEC AIC FIT0.478 0.248 3.24022R = Coefficient <strong>of</strong> determination; Radj= Coefficient <strong>of</strong> determination adjusted <strong>for</strong> thedegrees <strong>of</strong> freedom; s = standard error <strong>of</strong> the estimate; F = Fisher function; LOF = Friedman5


modified; SDEC = Standard Deviation Error in Calculation; AIC = Akaike In<strong>for</strong>mationCriterion; FIT = Kubinyi function.• Outlier detection:The regression line <strong>of</strong> the equation, the Williams and the residual plots are reported below.Several chemicals were identified as Y-outliers, which are inside the X-AD <strong>of</strong> the model,meaning that either their toxicity values are wrong or these chemicals have some additionalfeature not accounted <strong>for</strong> by the model.The Williams plot identifies 1,3-Dichloro-4,6-dinitrobenzene (112) as a strong outlier with astandard deviation error in prediction greater than 3, together with four small outliers:catechol (16), 4-chlorocatechol (24), 1,4-dinitrobenzene (110) and 1,3,5-trichloro-2,4-dinitrobenzene (113).Moreover two influential chemicals with leverage values greater than 3p/n (h*=0.079) wereidentified: 2,4,6-tri-tert-butylphenol (13) and 2,2'-methylenebis(3,4,6-trichlorophenol) (46).These chemicals greatly influence the regression line: in fact, the regression line is <strong>for</strong>cednear the observed value and their residuals (observed-predicted value) are small, i.e. they arewell predicted.Regression line model LogLC50 = - 0.574 LogKow + 0.454 ELUMO - 2.445Predicted LogLC50 (mol/L)-2-3-4-5-6-74611252114 34134716 17644814218 1972 6857 58 49242214 363 86110108 111109107 1058255227 97677101875937733660 50 89 75 74 10456 78 8871 532639 10 40 815 232570696 1068580 102 92 7991 90 5 67 66386544 11 612720113452881436283 84 9493 3351 413035 3197 100 9829103 9514 3212969954-8-8-7-6 -5 -4Experimental LogLC50 (mol/L)-3-2Figure 1 - Mixed model regression plot.6


Histogram <strong>of</strong> Log Kow2015Frequency10501234Log Kow567Figure 4 - Histogram <strong>of</strong> training set LogKow distribution.40Histogram <strong>of</strong> ELUMO30Frequency20100-2.5-2.0-1.5-1.0ELUMO-0.50.00.5Figure 5 - Histogram <strong>of</strong> training set E LUMO distribution.Since, this <strong>QSAR</strong> model is based on more than one descriptor, its applicability domain has to beevaluated not only by the separated descriptor distribution analysis but also by accounting <strong>for</strong> itsoverall model space, which is a two dimensional space. For this purpose, a Principal ComponentAnalysis (PCA) was per<strong>for</strong>med and the Hotelling control chart was used to evaluate how far8


away each chemical was from the PC model hyper plane. The Hotelling ellipse was computedwith a 0.05 (95% confidence) significance level.PC23210-1-254Score plot518480 81 8356 11 8550 7853 797781015 5 20 6 4534733549 795574 75 8887 7693 10372633 2 4 59 101 8223102 3348 868911819 21 22 2790252829 32585760 61 919510492 94 98 97 96 996230314764691767 66100 11410641105 36 109 7042 68 107 3765 108444371 26111 38244016 39110 112 1135214121346-3-3 -2 -1 0 1 2 3 4Figure 6 - Score plot PC1 vs PC2 calculated on LogKow and E LUMO descriptors.This plot highlights the already establish highly influential behavior <strong>of</strong> the 2,4,6-tri-tertbutylphenol(13) and 2,2'-methylenebis(3,4,6-trichlorophenol) (46), confirming the resultsprovided by the leverage approach.PC1• Internal validation:The model evaluated by leave-one-out internal cross-validation ( Q2LOO) and by bootstrappingwith 5000 iterations shows a good predictive power. It was also verified by Y-scramblingwith 300 iterations: the models based on randomized responses, have all extremely low R 2and Q 2 compared with the real model meaning that the model was not obtained by chancecorrelation.2QLOO2Qboostrap(5000 iterations)SDEP75.94 75.83 0.49522QLOO= explained variance in prediction; Qboostrap= explained variance in prediction bybootstrapping; SDEP = Standard Deviation Error in Prediction9


4.4.2 External validation on SIDS test dataThe <strong>QSAR</strong> model was used to make predictions <strong>of</strong> SIDS test data.• Model descriptor applicability domainThe domain <strong>of</strong> applicability with respect to descriptor ranges was evaluated analyzing thedistribution <strong>of</strong> the SIDS LogKow and E LUMO values with respect to the correspondingdistribution <strong>of</strong> the training set.Dotplot <strong>of</strong> Log KowStatusTestTraining-3036Log Kow9121518Figure 7 - SIDS and training set LogKow distribution comparison.10


Dotplot <strong>of</strong> Log KowMOA-SetAChECNSENMTANPNPEPNSBSN2UNKz-MIX-3036Log Kow9121518Figure 8 - SIDS and training set LogKow distribution with MOA highlighted.The LogKow domain <strong>of</strong> the SIDS test set includes the one <strong>of</strong> the training set but is muchbigger: in fact, the range <strong>of</strong> LogKow values <strong>for</strong> the SIDS set is from -3.89 to 18.08.Dotplot <strong>of</strong> ELUMOStatusTestTraining-2.7-1.8-0.90.0ELUMO0.91.82.7Figure 9 - SIDS and training set E LUMO distribution comparison.11


Dotplot <strong>of</strong> ELUMOMOA-SetAChECNSENMTANPNPEPNSBSN2UNKz-MIX-2.7-1.8-0.90.0ELUMO0.91.82.7Figure 10 - SIDS and training set E LUMO distribution comparison.The E LUMO domain <strong>of</strong> the SIDS test set includes the one <strong>of</strong> the training set but is muchbigger: in fact the range <strong>of</strong> E LUMO values <strong>for</strong> the SIDS set is from -3.204 to 3.351.Moreover, the distribution <strong>of</strong> SIDS test chemicals in the two-dimensional model space wasinvestigated by projecting the data in the a two-dimensional space provided by the PCA and theHotelling. Ellipse to evaluate how far away each chemical was from the PC model hyper plane.12


54Model descriptor space (LogKow - E LUMO )S173PC23210-1-2S95S126S86 S141 S167S154S152S139S21 S9S142S149T13T14T51 T52S101S78T12T3 T2T9 T10 T15T11T46T56T50 T6 T20T17 T16T18 T19T21T22T23T45T35 T34T47T48T49T72 T73 T55T77T78T80 T81 T84 T83S2 S3S158S166 S175S176S81 S13 S145 S74S76S91S64 S15S99 S177 S169 S162S170 S93S90 S19S171S106T85 S118 S133S163 S164 S174 S160S144S115S96S97T82 T74 T75 T76 T53S31 T79T24 T25 T26 T27T28T29 T33T64T58 T57 T86 T63 T87T59 T101T102T88S40S27 S11S51T104 T89T90 T91T60T61T62 T92 T93 T103S10 S12 S20S65 S56S68S70 S110 S88S134S159S138S161S165S35S52 S83 S157S26S120 S80 S85S37 S32 S50 S66 S48S89S129S92S67S54S71 S73S18S49 S46S100S94S114S111S59 S72S55T94 T30 T31T32 T98T99 T95 T97 T96S29 S25 T54S14S22T68S16T36 T37 T69 T67 T66 T41 T100 T114S23 S24S79 S82S103 S69S75S53 S45S104 S98S108 S107S119 S131S122 S124 S132 S116 S130 S112S137S146 S153 S151S113S117S30 S44S47S123 S121 S135 S136S150 S156 S147S105 T105T106 S63 S42 S143S28S168S155T109 T108S5 S4 T42T107T70 S41 S33S125S140 S148T44T43 S34S17S43 S36 S102 S61S58 S127 T71 T111 T38 T65 S62 S60S84 S172S38 T110 T39T40 T112S7T113S6 S77S87 S128S57S39S109-3TrainingSIDS test-4-4 -3 -2 -1 0 1 2 3 4 5Figure 11 - Score plot PC1 vs PC2 calculated on LogKow and E LUMO descriptors.The SIDS space is much bigger than the one <strong>of</strong> the training set and thus, according to theapplicability domain <strong>of</strong> the model descriptors, predictions can be per<strong>for</strong>med only <strong>for</strong> the SIDSchemicals within the range values <strong>of</strong> the model descriptors. The complete list <strong>of</strong> the SIDSchemicals which fall out the model domain have been disregarded and are given in Table III.• <strong>QSAR</strong> application on the SIDS subset defined by model domain in descriptor and responsespace (XY-D)Predictions were used only <strong>for</strong> chemicals with log Kow values in the range from 0.34 to 7.54,and E LUMO values in the range from -2.51 to 0.53, according to the pre-defined applicabilitydomain <strong>of</strong> the model.The predicted toxicities <strong>of</strong> the 77 SIDS test chemicals, together with their MOA, leverageand standardized error in prediction are collected in the Table IV.PC113


Regression line model: LogLC50 (mol/l) = - 0.574 LogKow + 0.454 ELUMO - 2.445Predicted LogLC50 (mol/l)-2-3-4-5-6-7T47T16 T17T64 T48 S45S117S107 S32T1T42T18 T19S156S132T72S92 T68 S75S73T57 T58 T49T24T63S20S131S50 S22T86T110 T82 T55 T22 T21 T2T4 T3T7T9S116 S28S18T109T107 T105T87 T59 T37 T73 T36T60 T50T101T89T88T76T75 T77T74T104T56 T71T70 T53T26T39T10T15T40T8T23S31T25 T78S172T108 T69T6T5 T111 T67 T106 T66 T38S61T85T112T102 T92 T90 T79 T65T44T11T80 T61 T27 T20 T91T113 T45S112 T28 T81 T43T83 T84 T94 S100 T93 T33T51 T62T41T30 S35T35 T31S169 S177 T97 T100 S98T98 T29T52 T103T95T14 T12T32 T96T114T34T99T13T46T54MOAAChEMIX-trainingMTANPNPNSN2UNK-8-8-7-6 -5 -4Experimental LogLC50 (mol/l)-3-2Figure 12 - Mixed model regression plot: training and SIDS test data colored by MOA.Williams plotStd.Err.Pred.43210-1-2S117T110T24T16S73T26T45T57 T64T62T51S116 T56T17 S177T55 T82 T41T44T34T39S169T22 T2 T30T83S131 T21 S31 T10 T7 T85T1 T42T72T47T102T23 T60 T15 T4 T11T27 T61 T9S35 T97 T103 T100T92T88T50T58T35T80T25 T20T96T87 T53 T76 T8T18 T32T6 S28 T77 T78 S98S112 T101 T98S92S45T90 T104T89S100T93 T94T108T107T84T71T63 T31 T48T70 T49 T29T19T12T28 T59 T33 T69 T5 T75 T74 T79 T37 T68T81S172T91 S156 S132S107S32S50 T73T111S22S75T109 T43T67T95 T99T86 T66T106 S20T38 T54T105T65T36S61S18T112T114 T52T40T14T113T13T46MOAAChEMIX-trainingMTANPNPNSN2UNK-30.000.020.04Hat0.060.080.10Figure 13 - Mixed model Williams plot: training and SIDS test data colored by MOA.In the Williams plot, it is possible to identify three SIDS chemicals as outliers: 2-Cyclohexen-1-one, 3,5,5-trimethyl (S18), 2-Propenoic acid, 2-methylpropyl ester (S73) and2-Propenoic acid, ethyl ester (S117). These chemicals are outliers only in the Y-responsespace, since they are inside the X-AD <strong>of</strong> the model: either the toxicity value is wrong <strong>for</strong> agiven outlier or the model is lacking in some additional feature.14


Moreover, one SIDS chemical (Hexadecanoic acid, 2-sulfo-, 1-methyl ester, sodium salt(S155)), not displayed in Williams plot because there was no experimental toxicity value, isout <strong>of</strong> the applicability domain <strong>of</strong> the model according to its leverage. This prediction is notreliable.Evaluation <strong>of</strong> predictive per<strong>for</strong>manceThe prediction capability <strong>of</strong> the model in terms <strong>of</strong> explained variance (Q 2 ext) and ExternalStandard Deviation Error <strong>of</strong> Prediction (SDEP ext ), evaluated including only those SIDS testdata with reliable predictions according to the leverage approach, is satisfactory.N. ext = 25Q 2 ext = 75.06SDEP ext = 0.623The model predictive power is thus strongly reduced by the Y-outliers (2-Cyclohexen-1-one,3,5,5-trimethyl (S18), 2-Propenoic acid, 2-methylpropyl ester (S73) and 2-Propenoic acid,ethyl ester (S117)). If they are removed from the calculation <strong>of</strong> explained variance (Q 2 ext)and external standard deviation error <strong>of</strong> prediction (SDEP ext ), because <strong>of</strong> their suspicioustoxicity values or their possession <strong>of</strong> additional features, the model predictive powerincreases slightly:N. ext = 22Q 2 ext = 87.10SDEP ext = 0.4584.5 ConclusionsIn conclusion, having checked the model correspondence with the OECD principles it can behighlighted that, <strong>for</strong> the investigated <strong>QSAR</strong> model the OECD principles were completelyfulfilled; thus, on the basis <strong>of</strong> this in<strong>for</strong>mation, this <strong>QSAR</strong> model could certainly be regarded assufficiently well developed to be used <strong>for</strong> regulatory purposes.In fact, it should be noted that the model was developed <strong>for</strong> a clear endpoint defined on a specificexperimental system; it shows an unambiguous algorithm which ensures the model algorithmtransparency. The applicability domain <strong>of</strong> the model was defined by the developers and themodel exhibits a satisfactory goodness-<strong>of</strong>–fit, robustness and predictivity.Finally the model has a mechanistic interpretation being the descriptor used in the modelassociated to predicted endpoint.Moreover the exercise pointed out the importance <strong>of</strong> identifying properly the model applicabilitydomain when it is applied to make predictions on the SIDS test set.In fact, the applicability domain has to be considered in all three phases <strong>of</strong> the (Q)SAR lifecycle:in the development to ensure that the domain is defined as broadly as possible, in themodel validation, to verified and eventually refined the domain and in the model application.To apply properly a <strong>QSAR</strong> model and to identify the subset <strong>of</strong> reliable predictions provided bythe model its domain has to be investigated.15


5. E-STATE INDICES <strong>QSAR</strong> 5 EVALUATION5.1 Defined endpoint and algorithmThe model was recently proposed in 2003 <strong>for</strong> predicting the acute toxicity <strong>of</strong> organic chemicalsto the fathead minnow. It is based on atom-type electrotopological state (E-state) indices. Theoriginal data set comprising 140 chemicals (130 training and 10 test chemicals) was reduced byeliminating chemical repetitions. The resulted toxicity data set, consisting <strong>of</strong> 130 compounds,was divided by the model developers into a training set <strong>of</strong> 121 compounds <strong>for</strong> developing the<strong>QSAR</strong> model, and into a test set <strong>of</strong> 9 compounds <strong>for</strong> evaluating the predictive capability <strong>of</strong> themodel. The multiple linear regression model obtained is the following:LogLC 50 = -Σ (a i S i ) - 0.916being a i and S i the regression coefficients and the corresponding structural parameters <strong>for</strong> a set <strong>of</strong>14 atom-type E-state indices.5.2 Mechanistic basisIt is not known if the chemicals act by narcotic and/or reactivity modes <strong>of</strong> action. However, thedevelopers indicated that the parameters used can be divided in two classes, i.e. hydrophobic andpolar. The parameters SsCH3, SdsCH, SaaCH, SsssCH, SaasC, SssssC, SsCl and SsBr all have anegative sign and suggest that an increase in hydrophobicity also increases acute toxicity(decreasing LogLC50) <strong>of</strong> the chemical; in the same way the polar parameter SsOH indicates thereactivity mode <strong>of</strong> action. Electron withdrawing groups, like halogens increase acute toxicity,indicating the increasing reactivity <strong>for</strong> substituted phenols when they are in ortho position to thehydroxyl group. Finally, the halogens (Cl and Br) increase the hydrophobicity <strong>of</strong> the chemicals.5.3 Domain <strong>of</strong> applicabilityThe model was defined by the developer to be applicable to chemicals with LogLC 50 values inthe range from -0.85 to -6.09. The 14 atom-type E-state indices together with their correspondingrange values are listed below:No. Symbol Atom type Range values1 SsCH 3 -CH 3 0.000 æ 8.1672 SdsCH -CH= -0.434 æ 0.8333 SaaCH aCHa 0.000 æ 19.5174 SsssCH -CH< -1.346 æ 0.7505 SaasC aasC -12.225 æ 4.1056 SssssC >C< -3.699 æ 0.0427 SsNH 2 -NH 2 0.000 æ 5.4668 StN ≡ N 0.000 æ 8.5659 SddsN -N


The following fitness regression parameters were calculated:2R2Radj s F LOF2R = Coefficient <strong>of</strong> determination;84.04 81.93 0.389 39.86 0.225SDEC AIC FIT0.364 0.194 1.7492Radj= Coefficient <strong>of</strong> determination adjusted <strong>for</strong> thedegrees <strong>of</strong> freedom; s = standard error <strong>of</strong> the estimate; F = Fisher function; LOF = Friedmanmodified; SDEC = Standard Deviation Error in Calculation; AIC = Akaike In<strong>for</strong>mationCriterion; FIT = Kubinyi function.• Outlier detection:The regression line <strong>of</strong> the equation, the Williams and the residual plots are illustrated below.Two chemicals (1-amino-2-methyl-3,6-dinitrobenzene (48) and 2,2,2-trichloroethanol (98))were identified as Y-outliers, which are inside the X-AD <strong>of</strong> the model, meaning that eithertheir toxicity values are wrong or these chemicals have some additional feature not accounted<strong>for</strong> by the model.The Williams plot identifies three chemicals (1-amino-2,3,4,5,6-pentafluorobenzene (82), 1-aldehydo-pentafluorobenzene (85), and hexachloroethane (121)) as outliers with highinfluence. Moreover 1,3,5-tribromo-2-hydroxybenzene (53) and 1,1,2,2-tetrachloroethane(119) are high influential chemicals with leverage values greater than 3p/n (=0.372).Regression line model LogLC50 = -Σ (aiSi) - 0.916Predicted Log(LC50) mol/l0-1-2-3-4-5-6100104 103 102108 117 112 105113106 1078597199 4124114 84 91 122 118 121769 101130 119 129 90 8919 28 271315 30 222120121675824162625 14 112 38 83 9893 92126 817880 87 12012312712595116 115 941117973 68 62 3938 371848 23 10 615241 959 4945 50 6075777666 51 4746 4442 7 5346 433536 86 8840323133 72 63 70 10996110655 54 715653648257StatusTestTraining-7128-7-6-5 -4 -3 -2Experimental Log(LC50) mol/l-10Figure 14 - E-State model regression plot.18


Std.Err.Pred.10530-3-58548 13012681 42 45125114 77 78 108100127 117 99 11312387 95 55 80 76 65 8410691122129 12011935 29 28 27 5294116624959 41115118111 10497 110 1249644 36 31 3230 23 1910 722 13 821 20 12 6517153473 70 6393 92 105 1079089 8847 83 72103867971 75 74 69 68646760 51 50 38 37 54399 34 3314 16 5846 265761 11266432518401110156102 1095398 1282 1 Williams plot121StatusTestTraining82-100.00.10.20.30.4Hat0.50.60.70.8Figure 15 - E-State model Williams plot.Std.Err.Pred.1085121Std.Err.Pred.504813012577811261 19424510065 767880 84 99 108247 10 235591106 113 114 117 119 12731387 955 82730521161206 15 20 21 22 28 293235394149 59 6769949697 115 122123 1299 1231 3758 62141617 24 334363844475051546063646668705772 73 74 83 90 104 110 118 124111825 2646 61 7179 88 899293 105 107111404375 861031125653101102 10998128-582-100204060ID80100120140Figure 16 - E-State model residual plot.The model descriptor space was investigated by principal component analysis (PCA) toidentify anomalous or isolated chemicals. In the first two PCs, the strongest outliers andinfluential chemicals (1-amino-2,3,4,5,6-pentafluorobenzene (82), 1-aldehydopentafluorobenzene(85), 1,2,2-tetrachloroethane (119) hexachloroethane (121) are wellisolated from all the other chemicals.19


The first two chemicals (1-amino-2,3,4,5,6-pentafluorobenzene (82), 1-aldehydopentafluorobenzene(85)) are characterized by high values <strong>of</strong> the sum <strong>of</strong> single bond t<strong>of</strong>luorine atom (SsF), while the hexachloroethane (121) by high values <strong>of</strong> the sum <strong>of</strong> singlebond to chlorine atom (SsCl).PC276543210-1-28582Score plot PC1 vs PC2Cum E.V.= 33.5%98outliers118 576554 5512864 117 63110 97 31 33 32868460 5983111 765 3462 130 1008867 371835368712712670 104 103 102 124 12396 9910195 92 9310861 775873 8911212294 105 116 8180 74787268504948 51 52 47 402342 45467969766656432726 29 28 2416 130 25 20 21 2212 4 1117 14 13215114109 1155375 19 183837 39 12510710641113129120119121outlierinfluentialStatusTestTraining-6-5-4-3-2 -1PC10123Figure 17 - Training score plot PC1 vs PC2 calculated on E-state indices.Loading Plot PC1 vs PC20.50SsClSsFSecond Component0.250.00-0.25SdOSsNH2StNSsBrSsOHSdsCHSaaCHSsCH3SsssCHSssssCSaasCSddsN-0.50-0.50-0.250.00First Component0.250.50Figure 18 - Loading plot PC1 vs PC2 calculated on E-state indices.20


On the fourth component the singular behavior <strong>of</strong> 1,3,5-tribromo-2-hydroxybenzene (53) dueto its high value <strong>of</strong> the sum <strong>of</strong> single bond to bromine atom SsBr is highlighted.PC443210-1-2-38582Score plot PC3 vs PC4Cum E.v. = 21.4%5366769912511 78 4080 7779 95 9616 74119100 10181 73 564 9272708972349391 90 71 23 6412 86 5713126 120145815 831239875 118116 44117127 11510946 4342 451118750 49 521 4847 5138 37 69 3 67 94 623984 89 68 2012421 22 76531 33 32130 122 8854 55110 29 27 30 28 24 26 25 60 59 129 63128104 103 102 10 936 35 651761 12119 18105 114108 41107112106113StatusTestTraining-4-8-6-4-2PC302Figure 19 - Score plot PC3 vs PC4 calculated on E-state indices.Loading plot PC3 vs PC40.50SsOHSsBrSdsCH0.25SdOPC40.00-0.25SsFSddsNSsssCHSssssCStNSaaCHSsNH2SsClSaasC-0.50SsCH3-0.50-0.250.00PC30.250.50Figure 20 - Loading plot PC3 vs PC4 calculated on E-state indices.21


• Internal validation:2The model evaluated by leave-one-out internal cross-validation ( QLOO) shows a moderatepredictive power and according to the bootstrap it is not predictive at all.2QLOO2QLOO= explained variance in prediction;bootstrapping;Prediction2Qboostrap(5000 iterations)2Qext SDEP68.28 9.30 60.73 0.5052Qboostrap= explained variance in prediction by2Qext= external explained variance; SDEP = Standard Deviation Error inPredictions per<strong>for</strong>med <strong>for</strong> the 9 evaluation chemicals, together with their leverage andstandardized error in prediction are collected in the Table VI. Two chemicals in the test setare out <strong>of</strong> the applicability domain <strong>of</strong> the model, since they are identified as Y-outliers:pentachloronaphthalene (128) and 1-<strong>for</strong>myl-2-fluorobenzene (130).5.4.2 External validation on SIDS test dataThe <strong>QSAR</strong> model was used to make predictions <strong>of</strong> SIDS test data.• Model descriptor applicability domainThe domain <strong>of</strong> applicability with respect to descriptor ranges was evaluated by analyzing thedistribution <strong>of</strong> the SIDS chemicals in the principal component space. The score plot <strong>of</strong> thefirst two PCs clearly highlights that not all SIDS chemicals are covered by the model trainingchemicals.7.5Score plot PC1 vs PC2Cum E.V.= 28.2%S148T82T85StatusSidsTraining5.0T121S16PC22.50.0-2.5-5.0T120 S143T98T57S55S161 S131S28S126S118 S50S13 S14S17T119T55 T54 S157S90S144T65T118 S22S170S176S129 S158S20 T117 S99S132 S151S165S177S169 S159 S156 S154S145 S141 S113S142 S124 S125 S122S138 S147S150 S120 S137 S119 S117 S136S163S167 S168S146 S123 S135S130 S116 S107 S103 S106 S97S75 S78S96 S12 S89S100S72 S49 S86 S64 S51 S21 S81 S74 S15S11S92S112 S114 S110S91 S84S3 T86T110 T63T97T99T100S95S88S68S83S73 S53 S37 S36S48 S43 S52 S32 S27 S26 S30S24T84T66 T68T67T36 T35 T7T8 T6 T5 T34T101 T111 T88T71T53T113 T112T79 T80 T81T108 S18 S8 S70S121 S108S111 S76 S94 S23T106 T107 T105 T104 T103 T102S87S134 S77S40 S56 S65 S63 S29 S127S115 S31T114T39 T69 T37 T10 T38 T9 T61T76 T72 T78S47 S82 T75S175 S174S171 S140S59S58S60 S69 S57 S149 S54 S71S128S4 S155S5S6 S109 S7S101 T115 S66 S44T89T74 T83T41T19T18 T17 T15 T14 T13 T3 T12T11 T4T2 T58S102 S39 S62 S67S139S133 S166S45S35S38S41T109S46S61 T91 T90 T93 T16T22 T21 T20T92T62S42 T87T70S98T25 T26 T30 T28 T24 T29 T27T73T116 S152T64T94 T46 T43 T44 T23S172 S33 T48T95 T40 T45 T42 T47 T52 T49 T51 T50S162S173S164S160-5.0 -2.5 0.0 2.5PC1S34T96T565.07.5Figure 21 - SIDS and training score plot PC1 vs PC2 calculated on E-state indices.22


PC4543210-1-2-3-4S167Score plot PC3 vs PC4Cum E.V.= 20.2%T67S149 S94S140S44 S43S76 S125 T65T63S109 T60 T59T82T61 T84 T58 T68S113 S82T62S58 S42T69 T91 T90S159 S138S29T30T29 T28 T27 T26 T25 T24 S38 T66T55 T54S23 T48 T52 S163S146T51 T50 T49 T47 S50T33T32 T31S175S139S75S22S49 S72 S85S174S39 S20T117 T85 T118S131 S133 S124S144 S171 S137S134S130 S60S28S66S98S55S74 S15 S84 S91S88S97 S120 S168 S172 S132S141 S177 S166 S119 S151 S150 S121S147S142 S117 S106S112 S114S86 S110S77S70S69S14 S48 S41S65S62S12 S11 T88 T110 T57T111T7 T6T119T115 T109 T120S57S46S108S107 S104 S102 S100S68 S73 S96 S83 S71S136 S169 S156S116 S61S37 S53S32 S52S10S26 S25 S47 S36T89T86 T83T116T121T113 T106 T107S8S5T114 S35 S45 T112 T105 T104 T103 T99T74T72 T71T36 T35T64T10 T9T2T102T101T17 T34T19 T18 T70 T87T73 T53 T8 T22 T21 T20T100 T98S143S154 S64 S51 S24 S9S128 S105S157 S158 S176 S170 S122 S95S30 S4 T75 T97 T77T46T44 T43 T45 T42T23T41T108S59T94T96S89 S21 S56 S27 S81T78 T76 T12 T15 T14 T13 T16T56S103 S54 T93 T92S67S16S18 T95S90 S92 T81T80 T79 T39 T38 T37 T40T11S152S153S164S161 S129 S145 S111 S123 S40 S63 S3S126S87 S31S93 S165 S33S34S155S162S99 S115 S135 S127 S78S2S160S19S7 S6S118S101S173S17S148StatusSidsTraining-5.0-2.50.02.5PC35.07.510.0Figure 22 - SIDS and training score plot PC3 vs PC4 calculated on E-state indices.Since all the model descriptors are meaningful and relevant, the first four PCs catch onlyabout the 50% <strong>of</strong> the total explained variance. Thus to avoid analyzing all the modeldimensions the applicability domain according to the descriptors selected in the model wasinvestigated by the Multidimensional scaling approach. The Multidimensional scaling (MDS)can be considered to be an alternative to factor analysis, typically used as an exploratorytechnique to visualize objects in a low dimensional space. In general, the analysis allowsdetection <strong>of</strong> meaningful underlying dimensions <strong>for</strong> similarities or dissimilarities (distances)between the investigated chemicals. In factor analysis, the similarities between objects (e.g.,variables) are expressed in the correlation matrix. With MDS it is possible to analyze notonly correlation matrices but also any kind <strong>of</strong> similarity or dissimilarity matrix. Non-metricmultidimensional scaling is based on a distance matrix computed with any distancemeasures. The algorithm then attempts to place the data points in a two-dimensionalcoordinate system such that the ranked differences are preserved.23


Multidimensional scaling0.200.15S1S53T82T85StatusSidsTrainingDim 20.100.050.00-0.05-0.10S148S101S16S14S13S98S17S6S116T57T120T121S133S115 S43T84S149 S167S171S132S177 S156 S107S169S142 S138 S94 S125 S113 S76S100S106 S140 S157 S97 S86 S96 S91 S90 S82S48S28T53T66S56 S74 S47S40T101T113T97 T112 S44 T69 T114T41 S159 S65 S124 S85 S143S11S12S126 S145S144 S154 S19 T99 T100T58 T19 T18 T17 T61 T10T39 T13 T4 T15 T14 T12T67 T2 T68S141S176 S129 S130 S131S104S93 S170 S158 S89 S81S75 S84S64 S51 S72 S49 S22 S50 S55S21 S9 T38 T37 T60 T59 T118 T117 S20 T98 T119S134S151 S120 S110S147S174S173S121 S108 S61 T106 T104 T102 S83 S52 S26 S37 T75T36 T35T8T9T34T3T7 T33 T32 T31 T63T55T54 T65T92 T22 T21 T20 S103S163 S66S71 S62 S78 S3T115T29 T26 T16 T93T11T110S31S175S139S122 S24T28 S77S136 S146 S102S25S59 S95 S80S79S161S150 S57 S114 S68 S73 S119 S88 S117S112S99 S70 S92S58S118S69T111S2S27 S30S29 S5S32S23 T89T86 T88T76 T80 T79 T78 T81S42 S41 T108 T107 T105T77 T74T72T71T27 T25 T24 T70S165S67S18T30 T87T62T109S54S109S60S162S168S127 S160 S164S153 S111 S105S123 S87 S45S135S63S39S4S36 S35S7T83 T116 T91T90S38T64S46S172S155 S152 S137 T23 T103T95S166 S128S8T46 T45 T44 T43 T42T40T94 T48 T47 T52 T51 T50 T49S33T96T56S34T73-0.15-0.10-0.050.00Dim 10.050.100.15Figure 23 - SIDS and training projection in the MDS two dimensional space.According to the applicability domain <strong>of</strong> the model descriptors, predictions can be per<strong>for</strong>medonly <strong>for</strong> the SIDS chemicals within the domain highlighted with the red Hotelling ellipse. Thecomplete list <strong>of</strong> the SIDS chemicals which falling outside the model domain have beendisregarded is illustrated in Table VII.• <strong>QSAR</strong> application on the SIDS subset defined by model domain in descriptor and responsespace (XY-D)Predictions were used only <strong>for</strong> the chemicals within the ellipse in the Multidimensionalscaling graph according to the applicability domain <strong>of</strong> the model descriptors. The predictedtoxicities <strong>of</strong> the 152 SIDS test chemicals, together with their leverage and standardized errorin prediction, are collected in the Table VIII.24


Predicted LogLC50 (mol/l)0-1-2-3-4-5-6Regression line model LogLC50 (mol/l) = -S (aiSi) - 0.916S1S76T100S137S110S120 S32S26S108T104S89 S136S77S9T103 T102S81S67S112T105 S51S64S21 S3T108 T117 S75S111 T113T106 T107 T112T85T97T1T99S78T4 S20S103 S85 S23S8S117S65 T114 T84 T118 T12 T17T91T89T69S61 S50S169 S35S141 T119 S156S132T101T28 T27 T29T13 T30 T22 T21 T15T14 T20 T90 T11T19T67T58 T24T25 T16T121T8 T39 T37 T38T18 T26S45 T98T81 S116T83T93T92T48T80S73 T78T87T23 T10 T41T120S177S56S172S87T116 T115 T94 T79T111 T74T73 T68 T62 T9T61T52 T59T60 T49T45 T50 T51 T47T46 T44T42 T7 T75 T43T77 T76 T95 T35 T36 T66T34 T6T86T88 T40T32T33 T31 T72 T63 T70T109T96T110T65 T55 T54T71 T56T53S131S18S92T64S31T82T57StatusTestTraining-7-7-6-5 -4 -3 -2Experimental LogLC50 (mol/l)-10Figure 24 - E-state model regression plot: training and SIDS test data.Std.Err.Pred.1510520-2-5T48S117S137S120S136 S116S108 S103 S89 S76S67T42 T45S73S65T77 S32T114T78T81S26 T108 T100T10T80 T76S111S85T117 T65S75 S77T120T119T116T106 T99T95T84T52 T87 T55 T91 T113T62 T49 T59S56S172 S51 S20 S50 T118T115T111 T105T104T94T110 T97S9 T107T90 T96T93 T92 T89 T88S87T103 S64 T86T79T83T41T44 T36 T34 T31T29 T28 T27 T32 T19T22 T13 T7T21 T20T9 T17T15T60 T51 T50 T63 T54T39T14T12 T6 T5 T8T47 T38 T58T71 T75 T74 T69 T68T67 T64T46T2 T33 T24T26 T16T66T61 T112T57T43 T25 T18T40 T11S23T101T56S156 T102 S21S132S45T109T98T53Williams plotT85S169 S112S1S110 S35 S177S81 S3S8S78S31T121S131S141StatusTestTrainingT82-100.00.20.4Hat0.60.81.0Figure 25 - E-state model Williams plot: training and SIDS test data.In the Williams plot it is possible to identify five SIDS chemicals which are both outliers andhighly influential chemicals, thus being outside the applicability domain <strong>of</strong> the model:Phenol,4,4'-(1-methylethylidene)bis (S31), 1,2,3-Propanetriol,triacetate (S67), 2-Propenoicacid,2-methylpropylester (S73), 1-Butene,3,4-dichloro (S131), Cyclohexanol,5-methyl-2-(1-methylethyl)- (S141).25


It has to be pointed out that, while high leverage chemicals in the <strong>QSAR</strong> model training setrein<strong>for</strong>ce the model itself, the test chemicals with high leverage values greater than thewarning value have unreliable predicted data, being the result <strong>of</strong> substantial extrapolation <strong>of</strong>the model.Several other SIDS chemicals have unreliable predictions according to their leverage values(1,2,3-Propanetriol (S2), Ethene,chloro (S12), 2-Cyclohexen-1-one,3,5,5-trimethyl- (S18),1,6-Octadien-3-ol, 3,7-dimethyl (S19), 2-Propenamide (S23), 2-Pyrrolidinone, 1-ethenyl(S37), 1,2-Benzenedicarbonitrile (S43), Propanoic acid, 2-methyl-, anhydride (S54), 2-Propenoic acid, 2-ethylhexyl ester (S68), 2-Butenal, 3-methyl (S79), Cyclohexene (S90), 5-Hepten-2-one,6-methyl- (S92), 2-Propanol, 1,1'-oxybis (S93), 1-Propene (S96), 1,6-Octadien-3-ol, 3,7-dimethyl-, acetate (S99), Phenol, 2,6-bis(1,1-dimethylethyl)-4-methyl-(S115), 3,5,9-Undecatrien-2-one, 6,10-dimethyl- (S118), 2-Propenoic acid, butyl ester(S119), 1-Hexadecen-3-ol, 3,7,11,15-tetramethyl- (S126), 1,2,4-Benzenetricarboxylic acid(S127), 5-Isobenz<strong>of</strong>urancarboxylic acid, 1,3-dihydro-1,3-dioxo (S128), 2-Buten-1-ol, 3-methyl (S129), 1,4-Benzenediamine, N-(1,3-dimethylbutyl)-N'-phenyl- (S133), 1,3,5-Triazine-2,4,6(1H,3H,5H)-trione, 1,3,5-tris(2-hydroxyethyl)- (S135), HCFC 141b (S143),Cyclohexanol, 5-methyl-2-(1-methylethyl)-, [1R-(1alpha,2beta,5alpha)]- (S144), 2-Propenoicacid, 2-(dimethylamino)ethyl ester (S147), Cyclohexanemethanamine, 5-amino-1,3,3-trimethyl- (S149), 1,2,4-Benzenetricarboxylic acid, tris(2-ethylhexyl) ester (S152),Hexadecanoic acid, 2-sulfo-, 1-methyl ester, sodium salt (S155), 2H-Pyran, 3,4-dihydro-2-methoxy (S157), 2-Benzothiazolesulfenamide, N,N-dicyclohexyl- (S159), 2,6-Octadienal,3,7-dimethyl (S161), Benzene, 1,4-dimethyl-2-(1-phenylethyl)- (S163), Benzenepropanoicacid, 3,5-bis(1,1-dimethylethyl)-4-hydroxy-, methyl ester (S165), 1,4-Benzenedicarboxylicacid, bis(2-ethylhexyl) ester (S166), Cyclohexanamine, 4,4'-methylenebis[2-methyl- (S167),Benzene, bis(1-methylethyl)- (S171), Benzene, 1,1'-oxybis-, pentabromo deriv. (S174)).These are not displayed in Williams plot because their experimental toxicity values aremissing. These are outside the applicability domain <strong>of</strong> the model according to their leverageand thus their predictions are not reliable.Then 22 SIDS chemicals were identified as strong outliers (Formaldehyde (S1), 1,2-Propanediol (S3), Formamide, N,N-dimethyl- (S8), 2-Butanol (S21), 2-Propenoic acid, 2-methyl-, methyl ester (S32), 1,2-Benzenedicarboxylic acid, dibutyl ester (S35), Butanamide,N-(2-methylphenyl)-3-oxo- (S45), 1,2-Ethanediamine (S76), 2,4-Pentanediol, 2-methyl-(S78), 2-Propanol, 1-methoxy (S81), 2-Butyne-1,4-diol (S89), 1,2-Benzenediol (S103), 2,4-Pentanedione (S108), Acetic acid, butyl ester (S110), Phosphoric-acid-tributyl-ester- (S112),2-Propenoic acid, ethyl ester (S117), Acetic-acid-ethyl-ester (S120), 2-Propanol, 1-phenoxy(S132), Phosphonic acid, dimethyl ester (S137), 1-Propanol, 2-phenoxy (S156), Phenol,nonyl- (S169), Phenol, 4-nonyl-, branched (S177). These chemicals are outliers only in theY-response space, since they are inside the X-AD <strong>of</strong> the model: either their toxicity valuesare wrong or the model is lacking in some additional feature.Evaluation <strong>of</strong> predictive per<strong>for</strong>manceThe prediction capability <strong>of</strong> the model in terms <strong>of</strong> explained variance (Q 2 ext) and externalstandard deviation error <strong>of</strong> prediction (SDEP ext ), evaluated by including only those SIDS testdata with reliable predictions according to the leverage approach, is satisfactory.26


N. ext = 39Q 2 ext = 48.98SDEP ext = 1.144The model predictive power is strongly reduced by the Y-outliers: Formaldehyde (S1), 1,2-Propanediol (S3), Formamide, N,N-dimethyl- (S8), 2-Butanol (S21), 2-Propenoic acid, 2-methyl-, methyl ester (S32), 1,2-Benzenedicarboxylic acid, dibutyl ester (S35), Butanamide,N-(2-methylphenyl)-3-oxo- (S45), 1,2-Ethanediamine (S76), 2,4-Pentanediol, 2-methyl-(S78), 2-Propanol, 1-methoxy (S81), 2-Butyne-1,4-diol (S89), 1,2-Benzenediol (S103), 2,4-Pentanedione (S108), Acetic acid, butyl ester (S110), Phosphoric-acid-tributyl-ester- (S112),2-Propenoic acid, ethyl ester (S117), Acetic-acid-ethyl-ester (S120), 2-Propanol, 1-phenoxy(S132), Phosphonic acid, dimethyl ester (S137), 1-Propanol, 2-phenoxy (S156), Phenol,nonyl- (S169), Phenol, 4-nonyl-, branched (S177). If these outliers are removed from thecalculation <strong>of</strong> the explained variance (Q 2 ext) and external standard deviation error <strong>of</strong>prediction (SDEP ext ), because <strong>of</strong> their suspicious toxicity values or their possession <strong>of</strong>additional feature, the model predictive power increases slightly:N. ext = 17Q 2 ext = 89.43SDEP ext = 0.3985.5 ConclusionsIn conclusion, it should be noted that generally, a <strong>QSAR</strong> model would either aim to have a broadapplicability, sacrificing to some extent the level <strong>of</strong> predictivity, or the model would aim to havenarrow applicability, but with greater predictivity.Since the model analyzed was intended to be a global model, developed to have a broadapplicability and to make predictions <strong>for</strong> chemicals acting with different modes <strong>of</strong> toxic action athe model could exhibit better predictive per<strong>for</strong>mance If trained with a wider training dataset. Infact, the a group contribution approach Is expected to provide better results if applied to morediverse training dataset.The atom-type electrotopological state (E-state) indices used as structural parameters to developthe model are attractive theoretical descriptors, because they can be calculated easily, rapidly andare error free, and thus not affected by variability.The investigated <strong>QSAR</strong> model fulfills the OECD principles; in fact, it was developed <strong>for</strong> a clearendpoint defined on a specific experimental system; it shows an unambiguous algorithm whichensures the model algorithm transparency. The applicability domain <strong>of</strong> the model was defined bythe developers and the model exhibits a satisfactory goodness-<strong>of</strong>–fit, robustness and predictivity.Finally the model has a mechanistic interpretation being the descriptors used in the modelassociated to predicted endpoint.27


6. TERRA<strong>QSAR</strong> FHM <strong>QSAR</strong> 6 EVALUATION6.1 IntroductionTerra<strong>QSAR</strong> TM FHM is a specialized neural network based s<strong>of</strong>tware program, designed andoptimized solely <strong>for</strong> the computation <strong>of</strong> acute (96hr) median lethal concentrations (LC50) <strong>of</strong>organic (carbon-containing) substances with a defined chemical structure to the fish fatheadminnow (Pimephales promelas).Terra<strong>QSAR</strong> TM developed by TerraBase Inc, is based on the probabilistic neural networkmethodology using the molecular structure <strong>of</strong> the substances under investigation. TheTerra<strong>QSAR</strong> TM FHM program estimates the 96hr lethal concentrations to 50% <strong>of</strong> a population(LC50) <strong>of</strong> the North American fish fathead minnow (Pimephales promelas), a widely used testspecies.Terra<strong>QSAR</strong> modules use as input a chemical’s SMILES code (2D or 3D).The Terra<strong>QSAR</strong> TM FHM module computes the LC50 in both mg/L and pT (log[L/mmole]) units,as well as the molecular weight (MW) <strong>of</strong> substances entered.6.1.1 TheoryThe Terra<strong>QSAR</strong> products make use <strong>of</strong> the neural network methodologies developed in recentyears by researchers and programmers both within and outside the company. In contrast to linearmethodologies, such as simple regression methods, principal components analysis and others,neural networks make use <strong>of</strong> nonlinear relationships, which makes them particularly useful <strong>for</strong>chemical/biological problems where different and/or unknown modes <strong>of</strong> action are known orlikely to be present, in addition to linear relationships.6.1.2 Computation processThe Terra<strong>QSAR</strong> TM FHM fathead minnow toxicity estimation program is based on a data set <strong>of</strong>measured values <strong>for</strong> 886 organic (carbon-containing) compounds.The majority <strong>of</strong> the fragments used in Terra<strong>QSAR</strong> have been described in several publicationsand specially in the Kaiser et al. papers.28


An overview <strong>of</strong> basic fragment types considered is given below.Fragment TypeAcidity fragmentAliphatic ring fragmentAromatic ring fragmentAtom fragmentBond fragmentGroup fragmentHydrophobicity fragmentIonisation fragmentPolarity fragmentReactivity fragmentStereo fragmentWeight fragmentExampleC(=O)O, S(=O)(=O)OC1CCCCC1, C1CCCC1c1ccccc1, c1ccccn1C, H, N, OCC, C=C, C#CC-O-H, C-O-C, O=C-O-CC(C)(C)C, CCCC[O - ], [Na+]O=N(=O)CC(O)C=CC=OCl[C@H](C)N, Cl[C@@H](C)Nmolecular weightThe computer evaluates the number and type <strong>of</strong> fragments present in the query string andcomputes the resulting estimate on the basis <strong>of</strong> the same types <strong>of</strong> fragments present in a data set<strong>of</strong> 886 compounds <strong>for</strong> which measured values have been published in the literature.6.2 Application <strong>of</strong> the OECD principles to Terra<strong>QSAR</strong> TMThe Terra<strong>QSAR</strong> s<strong>of</strong>tware has been checked <strong>for</strong> its correspondence with OECD principles inorder to evaluate to which extent the model fulfils the agreed OECD principles <strong>for</strong> the validation,<strong>for</strong> regulatory purposes, <strong>of</strong> (Q)SAR models, according to which a (Q)SAR model <strong>for</strong> regulatorypurposes should be associated with the following in<strong>for</strong>mation:1) a defined endpoint2) an unambiguous algorithm3) a defined domain <strong>of</strong> applicability4) appropriate measures <strong>of</strong> goodness-<strong>of</strong>–fit, robustness and predictivity5) a mechanistic interpretation, if possibleThese principles are aimed to provide generic base-line guidance <strong>for</strong> integrating the use <strong>of</strong>(Q)SAR models into regulatory frameworks. It should be emphasized that these principlesidentify the types <strong>of</strong> in<strong>for</strong>mation that are considered useful <strong>for</strong> the regulatory application <strong>of</strong>(Q)SAR models in a regulatory context.6.2.1 Defined endpointThe intent <strong>of</strong> this principle is to ensure clarity in the endpoint being predicted by the model, sincea given endpoint could be determined by different experimental protocols and under differentexperimental conditions. It is there<strong>for</strong>e important to identify the experimental system that isbeing modeled by the (Q)SAR.29


The Terra<strong>QSAR</strong> TM FHM program estimates the 96hr lethal concentrations to 50% <strong>of</strong> apopulation (LC50) <strong>of</strong> the North American fish fathead minnow (Pimephales promelas), which isone <strong>of</strong> the endpoints referred to in OECD Test Guideline 203.6.2.2 Defined algorithmIn order to use the <strong>QSAR</strong> <strong>for</strong> regulatory purposes the transparency in the model algorithm thatgenerates predictions <strong>of</strong> an endpoint is required. It is recognized that, in the case <strong>of</strong>commercially-developed models, this in<strong>for</strong>mation is not always made publicly available.However, without this in<strong>for</strong>mation, the per<strong>for</strong>mance <strong>of</strong> a model cannot be independentlyestablished, which is likely to represent a barrier <strong>for</strong> regulatory acceptance.The <strong>QSAR</strong> considered here <strong>for</strong> predicting the acute toxicity <strong>of</strong> organic chemicals to the fatheadminnow (Pimephales promelas) has been developed by a neural network methodologies. Thevalues used to train the network are all in the public domain, i.e. these values are published in thescientific literature. They are also available, in a number <strong>of</strong> databases, such as the USgovernment produced “AQUIRE” database (available free), or TerraTox – Explorer database (acommercial product). However, this should not be interpreted as a simple reproduction <strong>of</strong> theAQUIRE data in TerraTox database. In consulting with the original references, TerraBase Incused its own system <strong>of</strong> data evaluation and, there<strong>for</strong>e, the values used <strong>for</strong> the training <strong>of</strong> theTerra<strong>QSAR</strong> - FHM model may or may not be the same as <strong>for</strong> other fish toxicity estimationmodels, such as ECOSAR, TOPKAT, and so <strong>for</strong>th.Terra<strong>QSAR</strong> program is based on a “probabilistic neural network” algorithm, as developed bySpecht. This type <strong>of</strong> network does not work like other neural networks (such as, <strong>for</strong> example, theback-propagation network), where the number <strong>of</strong> cycles, neurons, layers, etc. are pretty much a“trial and error” system and, hence, their results are highly variable and dependent very much onthe these variables. In contrast, the Terra<strong>QSAR</strong> neural network is based on the “optimalestimator <strong>of</strong> the conditional average”, defined as:qˆ( P) Q f ( Q P) dQ= ∫Qwhere:PQ == ( m1, m2,..., mM,#)...input vari ables(#,m , m , m ) ...output var iablesM+1M+2LThis results in an unique, automatically optimized network, which is not dependent on trainingcycle optimization, number <strong>of</strong> layers, neurons, initialization, etc.6.2.3 Mechanistic basisEven if the absence <strong>of</strong> a mechanistic interpretation <strong>for</strong> a model does not mean that a model is notpotentially useful in the regulatory context, being the mechanistic interpretation <strong>of</strong> a given(Q)SAR not always possible, the possibility <strong>of</strong> a mechanistic association between the descriptorsused in a model and the endpoint being predicted should be accounted.30


This <strong>QSAR</strong> was not developed <strong>for</strong> a specific class <strong>of</strong> chemicals acting with a defined mode <strong>of</strong>action.The model is based on fragment descriptors already described in several publications in theliterature. An overview is given above; however some adjustments and variations have beenintroduced by the developers and these details are part <strong>of</strong> their own knowledge base and are notpublic.6.2.4 Domain <strong>of</strong> applicabilityThe applicability domain <strong>of</strong> the (Q)SAR model has to be analyzed in order to evaluate itslimitations in terms <strong>of</strong> the types <strong>of</strong> chemical structures, physicochemical properties andmechanisms <strong>of</strong> action <strong>for</strong> which the models can provide reliable predictions.According to the developers <strong>of</strong> the Terra<strong>QSAR</strong> - FHM program, the model does not have anyapplicability or domain limits, other than it can only estimate values <strong>for</strong> organic (carboncontaining)compounds. Its domain is not limited to or determined by any type <strong>of</strong> chemicalsubstructure or affected by a compound’s practical use (such as dye, surfactant).6.2.5 Model per<strong>for</strong>mance6.2.5.1 Internal per<strong>for</strong>manceThis is intended to evaluate the model quality, distinguishing between the internal per<strong>for</strong>mance<strong>of</strong> the model (goodness-<strong>of</strong>-fit and robustness) and the predictivity <strong>of</strong> the model (externalvalidation).• Data qualityThe Terra<strong>QSAR</strong> TM FHM fathead minnow toxicity estimation program is based on a data set<strong>of</strong> measured values <strong>for</strong> 886 organic (carbon-containing) compounds. The training setexperimental (96h LC50) values are listed in Table IX, together with their predicted valuesand ordinary residual in prediction. The measured vs. predicted fathead minnow (FHM)values in the training set cover approximately 10 orders <strong>of</strong> magnitude. Their correlationcoefficient is 0.975.• Goodness-<strong>of</strong>-fitThe following statistics were reported <strong>for</strong> this <strong>QSAR</strong>: n = 886, R 2 = 94.56. The Root MeanSquare Error (RMSE) <strong>for</strong> a leave-one-out cross-validation <strong>of</strong> the Terra<strong>QSAR</strong> - FHM modelis 0.19 pT units.Fitness regression parameters:2RSDEC94.56 0.3472R = Coefficient <strong>of</strong> determination; SDEC = Standard Deviation Error in Calculation.31


32• Outlier detection:Residual plot123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886-4 -2 0 2 4 6 8Experimental Log LC50 (mmol/L)-2.0-1.5-1.0-0.50.00.51.01.52.0Rresidual in predictionFigure 26 - Terra<strong>QSAR</strong> residual plot.• Internal validation:No in<strong>for</strong>mation about further internal validation statistics is available.The regression line is illustrated below:Terra<strong>QSAR</strong> NN model123465789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886-4 -2 0 2 4 6 8Experimental Log LC50 (mmol/L)-4-202468Predicted Log LC50 (mmol/L)Figure 27 - Terra<strong>QSAR</strong> regression plot.


6.2.5.2 External validation on SIDS test dataThe <strong>QSAR</strong> model has been used to make predictions <strong>of</strong> SIDS test data.The response distribution <strong>of</strong> the training chemicals has been compared with the ones <strong>of</strong> the SIDStest data: the histogram below shows that the experimental values <strong>of</strong> the SIDS test data arecompletely covered by the training set.260Experimental Log LC50 (mmol/L)240220Training Log LC50 (mmol/L)SIDS Log LC50 (mmol/L)200180160No <strong>of</strong> obs140120100806040200-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8Figure 28 - Response distribution <strong>of</strong> training and SIDS test data.The chemical domain <strong>of</strong> applicability, i.e. the region in the space defined by the modeledresponse and the descriptors <strong>of</strong> the model, <strong>for</strong> which the <strong>QSAR</strong> model should make reliablepredictions is defined by the nature <strong>of</strong> the chemicals in the training set, and can be characterizedin various ways, like the Williams plot <strong>of</strong> the regression which allows a graphical detection <strong>of</strong>both outliers <strong>for</strong> the response and the structurally influential chemicals in a model.As the name <strong>of</strong> the training set chemicals and their descriptor values were not provided by theauthors, it was not possible to establish the applicability domain <strong>of</strong> the model to the SIDS testset.Only the outliers <strong>for</strong> the response could be detected.The projection <strong>of</strong> the SIDS data on the model regression line and in the residual plot areillustrated below.33


8Terra<strong>QSAR</strong> NN model SIDS predictions6Predicted Log LC50 (mmol/L)420-2S8 S3S111S78S81S64 S9S21S1 S51S22 S20 S23S26 S76S142S108 S137 S120 S77S75 S18S50 S89S92 S136S107S32S6 S61 S67S65 S28 S110 S85 S132 S156 S45S87S103 S141S112 S49S31 S56S72S73 S100 S116 S117 S104 S131 S172S98S35S177S169TrainingSIDS test-4-4 -2 0 2 4 6 8Experimental Log LC50 (mmol/L)Figure 29 - Terra<strong>QSAR</strong> model regression plot: training and SIDS test data.Residual Plot2Residual in prediction10-1S49S132 S156 S45S107S32 S21S104 S172 S72 S67S81S131 S136S18S35S31 S28S85 S61 S77 S142S9S73 S56 S89S6 S75 S20S3 S8S50 S22 S23S26S51S98 S100 S117S169 S116 S141 S65 S110S64S87S92S137S120 S78S103 S108 S76S112S177S1-2S111TrainingSIDS test-4 -2 0 2 4 6 8Experimental Log LC50 (mmol/L)Figure 30 - Terra<strong>QSAR</strong> model residual plot: training and SIDS test data.34


The predicted toxicities <strong>of</strong> the test set chemicals together with the residuals in prediction arereported in the Table X.Evaluation <strong>of</strong> predictive per<strong>for</strong>manceThe prediction capability <strong>of</strong> the model in terms <strong>of</strong> explained variance (Q 2 ext) and externalstandard deviation error <strong>of</strong> prediction (SDEP ext ), show a very high predictive power.N. ext = 57Q 2 ext = 99.39SDEP ext = 0.1166.3 ConclusionsIn conclusion, having checked the model correspondence with the OECD principles it can behighlighted that, <strong>for</strong> the investigated <strong>QSAR</strong> model the OECD principles were not completelyfulfilled; thus, on the basis <strong>of</strong> this in<strong>for</strong>mation, this <strong>QSAR</strong> model could not certainly be regardedas sufficiently well developed to be used <strong>for</strong> regulatory purposes.In fact, it should be noted that the model was developed <strong>for</strong> a clear defined endpoint but theunambiguous algorithm required was not fully available from the developers to preserve theircompany know-how. The applicability domain was not estimable, since the identification <strong>of</strong> thetraining set chemicals is missing, together with the precise list <strong>of</strong> descriptors used to train the net.A fully evaluation <strong>of</strong> the model per<strong>for</strong>mance could not be per<strong>for</strong>med. Finally the mechanisticinterpretation was not provided.Thus even if the model is a very well trained and powerful model it does not fulfills the OECDprinciples and thus could not be used <strong>for</strong> regulatory purposes.35


7. COMPARATIVE ANALYSIS OF THE MODEL QUALITY7.1 Fitness and predictive model comparisonThe evaluation <strong>of</strong> the six <strong>QSAR</strong> models has been collected in the Excel fileRIVMSIAMFishAQUIRE-JRC. The first part <strong>of</strong> the file is a copy <strong>of</strong> the DK-file which isextended with cells providing the results <strong>of</strong> the evaluated models. For each model the followingcolumns have been filled in:• SIDS fish toxicity prediction• Leverage value• Training set membership (1 if the SIDS chemical was in the model training set; 0 if not)• XY+MOA applicability domain (1 if the chemical is within the model applicabilitydomain according to the descriptor and response space (XY) and the mechanism <strong>of</strong> action(MOA); 0 if not)• XY applicability domain (1 if the substance is within the model applicability domainbased on the descriptor and response space (XY); 0 if not)• Use value which was intended to provide a measure <strong>of</strong> the prediction reliability:Score = 1: good value according to all criteria (the substance is in the model domaindefined by the descriptor and response space (XY domain) and the domain assessed by itsmode <strong>of</strong> action (MOA domain)Score = 2: good value good value even if it does not fulfill all criteria (the substance is inthe model domain defined by the descriptor and response space (XY domain) but not inthe domain assessed by its mode <strong>of</strong> action (MOA domain)Score = 3: unreliable value (the substance is out the model domain defined by thedescriptor and response space (XY domain) and out the domain assessed by its mode <strong>of</strong>action (MOA domain)Score = 4: the reliability cannot be assessed.The six models evaluated <strong>for</strong> their predictivity on SIDS data set have been compared in TableXI, where the main fitting and predictive regression parameters are collected together with somein<strong>for</strong>mation related to the total number <strong>of</strong> SIDS chemicals in the training set <strong>of</strong> the model,number <strong>of</strong> SIDS chemicals used to per<strong>for</strong>m the explained variance in prediction, total number <strong>of</strong>reliable predictions provided by each model.7.2 Model comparison by ratio <strong>of</strong> <strong>QSAR</strong> prediction/SIDS dataFor each model comparison between predictions and experimental toxicity on fish have beenper<strong>for</strong>med simply calculating the number <strong>of</strong> chemicals with predicted effect concentration whichwere within a factor <strong>of</strong> 10, 100, 1000 with respect to SIDS test data.Thus the ratio <strong>of</strong> the <strong>QSAR</strong> prediction over the SIDS LC50 on Pimephales promelas has beencalculated:36


a ratio equal to 1 identifies a perfect prediction, while a ratio lower than 1 highlights predictionunderestimated and thus chemical toxicity overestimated, as the lethal concentration is inverselycorrelated to toxicity (low LC50 values characterize high toxic chemicals).For each model the ratio has been calculated first on the entire SIDS data set (57 SIDSexperimental test data) and then only on the chemicals falling in the model domain. It can benoticed that when the model domain is taken into account, the ratio is always near one in therange from 0.1 to 10.Thus it is confirmed the opportunity to define the model applicability domain in order to provideonly reliable prediction discharging those predicted values that being unreliable can be the result<strong>of</strong> extrapolation.7.2.1 Comparison between non-polar narcosis model (<strong>QSAR</strong>1) and SIDS LC50.The <strong>QSAR</strong> model <strong>for</strong> non-polar narcosis provided 36 reliable predictions <strong>of</strong> the SIDS chemicals.All the measured effect concentrations expressed as “>” were disregarded, since these valueswere difficult to compare with <strong>QSAR</strong> predictions.ChemicalsoutsidedomainwereexcludedTGD-NPNmodelprediction/ SIDStest dataTGD-NPNmodelprediction/ SIDStest dataN. chem. underestimated N. chem. overestestimated10000 0 00 0 015(12*)12(9*)27(22*)24(19*)6(6*)8(8*)1(1*)0 0 0Total57(49*)36(28*)Fraction <strong>of</strong> chemicalswithin a factor <strong>of</strong>:110100TGD-NPN model prediction /SIDS test data (%)26 (= (15/57)*100)24* (= (12/49)*100)74 (= (15+27/57)*100)69* (= (12+22/49)*100)84 (= (15+27+6/57)*100)82* (= (12+22+6/49)*100)TGD-NPN model prediction /SIDS test data (%)----onlychemical in model domain32 (= (12/36)*100)32* (= (9/28)*100)100 (= (12+24/36)*100)100* (= (9+19/28)*100)*There were 8 chemicals out <strong>of</strong> the SIDS chemicals, which were included in the training set <strong>of</strong>the TGD NPN fish model. If those chemicals were left out when comparing TGD NPNpredictions with SIDS data, the percents drop down from 26 to 24, 74 to 69 and 84 to 82 within a37


factor <strong>of</strong> 1, 10 and 100, respectively, while in the case the model domain has been accounted thepercents do not change within a factor <strong>of</strong> 1 and 10, respectively.Figure 31 illustrates the ratio values <strong>of</strong> the <strong>QSAR</strong> prediction over the SIDS LC50 withoutaccounting the model applicability domain and then considering only reliable predictions.It can be noticed that when the model domain is taken into account, the ratio is always near onein the range from 0.1 to 10: which means that, since the domain was correctly identified, onlyreliable predictions are accounted.TGD - NPN model30N. chemicals2520151050< 0.001 0.001-0.01 0.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 10008 TGD training included 0 0 0 15 27 6 8 18 TGD training excluded 0 0 0 12 22 6 8 1TGD-NPN model prediction / SIDS test dataModelapplicabilitydomainN. chemicalsTGD -NPN model applied only to chemicals in its domain302520151050< 0.001 0.001-0.01 0.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 10008 TGD training included 0 0 0 12 24 0 0 08 TGD training excluded 0 0 0 9 19 0 0 0TGD-NPN model prediction / SIDS test dataFigure 31 - Ratio values <strong>of</strong> the <strong>QSAR</strong>1 prediction over the SIDS LC50.38


7.2.2 Comparison between polar narcosis model (<strong>QSAR</strong>2) and SIDS LC50.The <strong>QSAR</strong> model <strong>for</strong> polar narcosis provided 29 reliable predictions <strong>of</strong> the SIDS chemicals. Allthe measured effect concentrations expressed as “>” were disregarded.ChemicalsoutsidedomainwereexcludedTGD-PNmodelprediction/ SIDStest dataTGD-PNmodelprediction/ SIDStest dataN. chem. underestimated N. chem. overestestimated10000 08(8*)0 0 027(24*)21(18*)12(11*)8(7*)8(8*)1(1*)1(1*)0 0 0Total57(53*)29(25*)Fraction <strong>of</strong> chemicalswithin a factor <strong>of</strong>:110100TGD-PN model prediction /SIDS test data (%)61 (= (8+27/57)*100)60* (= (8+24/53)*100)82 (= (8+27+12/57)*100)81* (= (8+24+11/53)*100)74 (= (8+27+12+8/57)*100)69* (= (8+24+11+8/53)*100)TGD-PN model prediction /SIDS test data (%)----onlychemical in model domain72 (= (21/29)*100)72* (= (18/25)*100)100 (= (21+8/29)*100)100* (= (18+7/25)*100)*There were 5 chemicals out <strong>of</strong> the SIDS chemicals, which were included in the training set <strong>of</strong>the TGD PN fish model. If those chemicals were left out when comparing TGD PN predictionswith SIDS data, the percents drop down from 61 to 60, 82 to 81 and 74 to 69 within a factor <strong>of</strong> 1,10 and 100, respectively, while in the case the model domain has been accounted the percents donot change within a factor <strong>of</strong> 1 and 10, respectively.Figure 32 illustrates the ratio values <strong>of</strong> the <strong>QSAR</strong> prediction over the SIDS LC50 withoutaccounting the model applicability domain and then considering only reliable predictions.As expected when the model domain is taken into account, the ratio is always near one in therange from 0.1 to 10: which means that, since the domain was correctly identified, only reliablepredictions are accounted.39


TGD - PN model30N. chemicals2520151050< 0.0010.001-0.010.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 10005 TGD training included 0 0 8 27 12 8 1 15 TGD training excluded 0 0 8 24 11 8 1 1TGD-PN model prediction / SIDS test dataTGD -PN model applied only to chemicals in its domainModelapplicabilitydomain3025N.chemicals20151050< 0.0010.001-0.010.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 10005 TGD training included 0 0 0 21 8 0 0 05 TGD training excluded 0 0 0 18 7 0 0 0TGD-PN model prediction / SIDS test dataFigure 32 - Ratio values <strong>of</strong> the <strong>QSAR</strong>2 prediction over the SIDS LC50.40


7.2.3 Comparison between narcosis model (<strong>QSAR</strong>3) and SIDS LC50.The <strong>QSAR</strong> global model <strong>for</strong> narcosis provided 36 reliable predictions <strong>of</strong> the SIDS chemicals. Allthe measured effect concentrations expressed as “>” were disregarded.ChemicalsoutsidedomainwereexcludedNarcosismodelprediction/ SIDStest dataNarcosismodelprediction/ SIDStest dataN. chem. underestimated N. chem. overestestimated10000 04(4*)0 0 026(16*)23(13*)13(11*)13(11*)9(9*)4(4*)1(1*)0 0 0Total57(45*)36(24*)Fraction <strong>of</strong> chemicalswithin a factor <strong>of</strong>:110100Narcosis model prediction / SIDStest data (%)53 (= (4+26/57)*100)44* (= (4+16/45)*100)75 (= (4+26+13/57)*100)69* (= (4+16+11/45)*100)91 (= (4+26+13+9/57)*100)89* (= (4+16+11+9/45)*100)Narcosis model prediction /SIDS test data (%)----onlychemical in model domain64 (= (23/36)*100)54* (= (13/24)*100)100 (= (23+13/36)*100)100* (= (13+11/24)*100)*There were 13 chemicals out <strong>of</strong> the SIDS chemicals, which were included in the training set <strong>of</strong>the narcotic fish model. If those chemicals were left out when comparing narcotic predictionswith SIDS data, the percents drop down from 53 to 44, 75 to 69 and 91 to 89 within a factor <strong>of</strong> 1,10 and 100, respectively, while in the case the model domain has been accounted the percentsdrop down from 64 to 54 within a factor <strong>of</strong> 1 while they do not change within a factor <strong>of</strong> 10.The ratio values <strong>of</strong> the <strong>QSAR</strong> prediction over the SIDS LC50 without accounting the modelapplicability domain and then considering only reliable predictions are provided in Figure 33.As expected and as <strong>for</strong> the previous models when the model domain is taken into account, theratio is always near one in the range from 0.1 to 10: which means that, since the domain wascorrectly identified, only reliable predictions are accounted41


N Model30N. chemicals2520151050< 0.001 0.001- 0.01-0.1 0.1-1.0 1.0-10 10-100 100- > 100013 N training included 0 0 4 26 13 9 4 113 N training excluded 0 0 4 16 11 9 4 1N model prediction / SIDS test dataN model applied only to chemicals in its domainModelapplicabilitydomain3025N. chemicals20151050< 0.001 0.001-0.01 0.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 100013 N training included 0 0 0 23 13 0 0 013 N training excluded 0 0 0 13 11 0 0 0N model prediction / SIDS test dataFigure 33 - Ratio values <strong>of</strong> the <strong>QSAR</strong>3 prediction over the SIDS LC50.42


7.2.4 Comparison between mixed model (<strong>QSAR</strong>4) and SIDS LC50.The <strong>QSAR</strong> mixed model provided 29 reliable predictions <strong>of</strong> the SIDS chemicals. All themeasured effect concentrations expressed as “>” were disregarded.ChemicalsoutsidedomainwereexcludedMixedmodelprediction/ SIDStest dataMixedmodelprediction/ SIDStest dataN. chem. underestimated N. chem. overestestimated10000 00 03(2*)1(0*)25(21*)21(17*)21(20*)6(5*)6(5*)1(0*)1(1*)1(1*)0 0Total57(50*)29(22*)Fraction <strong>of</strong> chemicalswithin a factor <strong>of</strong>:110100Mixed model prediction / SIDStest data (%)49 (= (3+25/57)*100)46* (= (2+21/50)*100)86 (= (3+25+21/57)*100)86* (= (2+21+20/50)*100)96 (= (3+25+21+6/57)*100)96* (= (2+21+20+5/50)*100)Mixed model prediction /SIDS test data (%)----onlychemical in model domain76 (= (1+21/29)*100)77* (= (0+17/22)*100)97 (= (1+21+6/29)*100)100* (=(0+17+5/22)*100)*There were 9 chemicals out <strong>of</strong> the SIDS chemicals, which were included in the training set <strong>of</strong>the Mixed fish model. If those chemicals were left out when comparing mixed model predictionswith SIDS data, the percents drop down from 49 to 46 within a factor <strong>of</strong> 1, while they do notchange within a factor <strong>of</strong> 10 and 100; in the case the model domain has been accounted thepercents grow from 76 to 77, and from 97 to 100 within a factor <strong>of</strong> 1 and 10 respectively,meaning that not all the training chemicals were very well predicted.The ratio values <strong>of</strong> the <strong>QSAR</strong> prediction over the SIDS LC50 without accounting the modelapplicability domain and then considering only reliable predictions are provided in Figure 34.43


Mixed model30N. chemicals2520151050< 0.001 0.001-0.01 0.01-0.1 0.1-1.0 1.0-10 10-100 100-10009 training included 0 0 3 25 21 6 19 training excluded 0 0 2 21 20 5 1Mixed model prediction / SIDS test dataModelapplicabilitydomainMixed model applied only to chemicals in its domain30N. chemicals2520151050< 0.0010.001-0.010.01-0.1 0.1-1.0 1.0-10 10-100100-10009 training included 0 0 1 21 6 1 0 09 training excluded 0 0 0 17 5 0 0 0Mixed model prediction / SIDS test data> 1000Figure 34 - Ratio values <strong>of</strong> the <strong>QSAR</strong>4 prediction over the SIDS LC50.44


7.2.5 Comparison between E-state indices model (<strong>QSAR</strong>5) and SIDS LC50.The <strong>QSAR</strong> model based on E-state indices provided 25 reliable predictions <strong>of</strong> the SIDSchemicals. All the measured effect concentrations expressed as “>” were disregarded.ChemicalsoutsidedomainwereexcludedE-statemodelprediction/ SIDStest dataE-statemodelprediction/ SIDStest dataN. chem. underestimated N. chem. overestestimated100003(3*)7(7*)0 0 021(14*)16(9*)17(16*)9(8*)7(7*)2(2*)0(0*)0 0 0Total57(49*)25(17*)Fraction <strong>of</strong> chemicalswithin a factor <strong>of</strong>:110100E-state model prediction / SIDStest data (%)54 (= (3+7+21/57)*100)49* (= (3+7+14/49)*100)84 (= (3+7+21+17/57)*100)82* (= (3+7+14+16/49)*100)96 (=(3+7+21+17+7/57)*100)96* (=(3+7+14+16+7/49)*100)E-state model prediction /SIDS test data (%)----onlychemical in model domain64 (= (16/25)*100)53* (= (9/17)*100)100 (= (16+9/25)*100)100* (= (9+8/17)*100)*There were 8 chemicals out <strong>of</strong> the SIDS chemicals, which were included in the training set <strong>of</strong>the E-state fish model. If those chemicals were left out when comparing e-state predictions withSIDS data, the percents drop down from 54 to 49, 84 to 82 within a factor <strong>of</strong> 1 and 10respectively, while in the case the model domain has been accounted the percents drop downfrom 64 to 53 within a factor <strong>of</strong> 1 while they do not change within a factor <strong>of</strong> 10.The ratio values <strong>of</strong> the <strong>QSAR</strong> prediction over the SIDS LC50 without accounting the modelapplicability domain and then considering only reliable predictions are provided in Figure 35.As <strong>for</strong> the previous models when the model domain is taken into account, the ratio is always nearone in the range from 0.1 to 10: which means that, since the domain was correctly identified,only reliable predictions are accounted45


E-State modelN. chemicals3020100< 0.001 0.001-0.01 0.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 10008 training included 0 3 7 21 17 7 2 08 training excluded 0 3 7 14 16 7 2 0E-state model prediction / SIDS test dataModelapplicabilitydomainE-state model applied only to chemicals in its domainN.chemicals3020100< 0.001 0.001- 0.01-0.1 0.1-1.0 1.0-10 10-100 100-1000 > 10008 training included 0 0 0 16 9 0 0 08 training excluded 0 0 0 9 8 0 0 0E-state model prediction / SIDS test dataFigure 35 - Ratio values <strong>of</strong> the <strong>QSAR</strong>5 prediction over the SIDS LC50.46


7.2.6 Comparison between Terra<strong>QSAR</strong> model (<strong>QSAR</strong>6) and SIDS LC50.The number <strong>of</strong> reliable predictions <strong>of</strong> the SIDS chemicals provided by the Terra<strong>QSAR</strong> modelcould not be evaluated, as well as the model applicability domain.Terra<strong>QSAR</strong>modelprediction / SIDStest dataN. chem. underestimated N. chem. overestestimated1000Total0 0 1 33 22 0 1 0 57Terra<strong>QSAR</strong> modelFraction <strong>of</strong> chemicalsprediction / SIDS test datawithin a factor <strong>of</strong>:(%)1 60 (= (33+1/57)*100)1098 (=(33+1+22/57)*100)10098 (=(33+1+22/57)*100)Terra <strong>QSAR</strong>N.chemicals302520151050< 0.001 0.001- 0.01-0.1 0.1-1.0 1.0-10 10-100 100- > 1000Terra <strong>QSAR</strong> 0 0 1 33 22 0 1 0Terra<strong>QSAR</strong> model prediction / SIDS test dataFigure 36 - Ratio values <strong>of</strong> the <strong>QSAR</strong>6 prediction over the SIDS LC50.In this case it was not possible to evaluate the model applicability domain due to lack <strong>of</strong>in<strong>for</strong>mation provided by the authors.47


ACKNOWLEDGEMENTSThe authors would like to thank Eva Bay Wedebye (Danish Institute <strong>for</strong> Food and VeterinaryResearch), Terry Schultz (Department <strong>of</strong> <strong>Comparative</strong> Medicine, College <strong>of</strong> Veterinary Medicinat the University <strong>of</strong> Tennessee), Klaus Kaiser (TerraBase Inc., Ontario), Christine Russom (U.S.EPA, Office <strong>of</strong> Research and Development, National Health and Environmental EffectsResearch Laboratory (NHEERL), Mid-Continent Ecology Division (MED), Duluth, MN), KarinAschberger (European Chemicals Bureau, Joint Research Centre, Italy). Manuela Pavanacknowledges receipt <strong>of</strong> a post-doctoral grant from the Joint Research Centre (JRC JRC contract22460-2004-11 P1B30 ISP IT).48


REFERENCESAQUIRE (AQUatic toxicity In<strong>for</strong>mation REtrieval), U.S. Environmental Protection Agency.2002. ECOTOX User Guide: ECOTOXicology Database System. Version 3.0. Available:http:/www.epa.gov/ecotox/Chem3D Ultra 9.0. Chemistry s<strong>of</strong>tware developed and provided by ChembridgeS<strong>of</strong>t company.European Commission (1995). <strong>QSAR</strong> <strong>for</strong> Predicting Fate and Effects <strong>of</strong> Chemicals in theEnvironment, Final Report <strong>of</strong> DG XII Contract No. EV5V-CT92-0211.European Economic Community (1996). Technical Guidance Document in Support <strong>of</strong>Commission Directive 93/67/EEC on Risk <strong>Assessment</strong> <strong>for</strong> New Notified Substances andCommission Regulation (EC) No 1488/94 on Risk <strong>Assessment</strong> <strong>for</strong> Existing Substances,Luxemburg: European Commission, Office <strong>for</strong> Official Publications <strong>of</strong> the EuropeanCommunities.Huuskonen, J. (2003). <strong>QSAR</strong> modeling with the electrotopological state indices: predicting thetoxicity <strong>of</strong> organic chemicals. Chemosphere, 50, 949 – 953.Kaiser, K.L.E., S.P. Niculescu, and M.B. McKinnon. On the simple linear regression, themultiple linear regression and the elementary probabilistic neural network with Gaussiankernel’s per<strong>for</strong>mance in modeling toxicity values to fathead minnow based on Microtox data, theoctanol/water partition coefficient and various structural descriptors <strong>for</strong> a 419 compound dataset. <strong>QSAR</strong> in Environmental Sciences - VII, F. Chen and G. Schüürmann (Eds.), SETAC Press,Pensacola, FL, pp. 285-297 (1997).Kaiser, K.L.E., S.P. Niculescu, and G. Schüürmann. Feed <strong>for</strong>ward backpropagation neuralnetworks and their use in predicting the acute toxicity <strong>of</strong> chemicals to the fathead minnow. WaterQuality Res. J. Canada, 32: 637-657 (1997); http://www.cciw.ca/wqrjc/32-3/32-3-637.htmKaiser, K.L.E., and S.P. Niculescu. Using probabilistic neural networks to model the toxicity <strong>of</strong>chemicals to the fathead minnow (Pimephales promelas): A study based on 865 compounds.Chemosphere, 38: 3237-3245 (1999).Kaiser, K.L.E., and S.P. Niculescu. Probabilistic neural network (PNN) methodology <strong>for</strong> theprediction <strong>of</strong> acute toxicity <strong>of</strong> chemicals to fathead minnow based solely on chemical structurederivedinput parameters. National Water Research Institute Contribution, No. AEP-TN99-OO1,39 p. (1999).Kaiser, K.L.E., and S.P. Niculescu. Modeling the acute toxicity <strong>of</strong> chemicals to Daphnia magna:a probabilistic neural network approach. Environ. Toxicol. Chem., 20: 420-431 (2001).Kaiser, K.L.E., S.P. Niculeseu, and T.W. Schultz. Probabilistic neural network modeling <strong>of</strong> thetoxicity <strong>of</strong> chemicals to Tetrahymena pyiri<strong>for</strong>mis with molecular fragment descriptors. SAR &<strong>QSAR</strong> Environ. Res, 13: 57-67 (2002).49


Kaiser, K.L.E., and S.P. Niculescu. On the PNN modeling <strong>of</strong> estrogen receptor binding data <strong>for</strong>carboxylic acid esters and organochlorine compounds. Water Qual. Res. J. Canada, 36: 619-630(2001); http://www.cciw.ca/wqrjc/36-3/36-3-619..htm.OECD ENV/JM/Mono(2004)24. Environment Directorate Joint Meeting <strong>of</strong> the ChemicalsCommittee and the Working Party on Chemicals, pesticides and Biotechnology. OECD Series onTesting and assessment. Number 49. The report from the Expert Group on (Quantitative)Structure-Activity Relationships [(Q)SARs] on the Principles <strong>for</strong> the Validation <strong>of</strong> (Q)SARs.OECD ENV/JM/TG(2004)26. Comparison <strong>of</strong> SIDS Test data with (Q)SAR predictions doe acuteaquatic toxicity, biodegradability and mutagenicity on organic chemicals discussed at SIAM 11-18.Pavan, M., Worth, A., Netzeva, T. (2005).Preliminary analysis <strong>of</strong> an aquatic toxicity dataset andassessment <strong>of</strong> <strong>QSAR</strong> models <strong>for</strong> narcosis. European Commission Report EUR 21749 EN.Russom, C.L., Bradbury, S.P., Broderius, S.J., Hammermeister, D.E., Drummond, R.A. (1997).Predicting modes <strong>of</strong> action from chemical structure: Acute toxicity in the fathead minnow(Pimephales promelas). Environmental Toxicology and Chemistry, 16, 948-967.Terra<strong>QSAR</strong> TM – FHM, Fatehead minnow 96-hr LC50 Estimation, S<strong>of</strong>tware vs 1.1. TerraBaseInc. 1063 King St. West, Suite 130. http://www.terrabase-inc.com/Veith, G.D., Call D.J., and Brooke L.T.. 1983. Structure-toxicity relationships <strong>for</strong> the fatheadminnow, Pimephales promelas: Narcotic industrial chemicals. Can. J. Fish. Aquat. Sci. 40:743-748.Veith, GD, Broderius, S.J. (1987). Structure-toxicity relationships <strong>for</strong> industry chemicals causingtype (II) narcosis syndrome. In: Kaiser, K.L.E. (ed) <strong>QSAR</strong> in Environmental Toxicology – II. D.Reindel, Dordrecht, pp. 385-391.Veith, GD, Mekenyan, O.G. 1993. A <strong>QSAR</strong> approach <strong>for</strong> estimating the aquatic toxicity <strong>of</strong> s<strong>of</strong>telectrophiles [<strong>QSAR</strong> <strong>for</strong> s<strong>of</strong>t electrophiles]. Quantitative Structure-Activity Relationships 12,349-356.Verhaar, H.J.M., van Leeuwen, C.J. and Hermens, J.L.M. (1992) Classifying environmentalpollutants. 1. Structure-activity relationships <strong>for</strong> prediction <strong>of</strong> aquatic toxicity. Chemosphere, 25,471-491.Verhaar, H.J.M., Mulder, W. and Hermens, J.L.M. (1995). <strong>QSAR</strong>s <strong>for</strong> ecotoxicity. In: Overview<strong>of</strong> structure-activity relationships <strong>for</strong> environmental endpoints, Part 1: General outline andprocedure. Hermens, J.L.M. (Ed), Report prepared within the framework <strong>of</strong> the project “<strong>QSAR</strong><strong>for</strong> Prediction <strong>of</strong> Fate and Effects <strong>of</strong> Chemicals in the Environment”, an international project <strong>of</strong>the Environmental; Technologies RTD Programme (DGXII/D-1) <strong>of</strong> the European Commissionunder contract number EV5V-CT92-0211.50


Verhaar, H.J.M., Solbe,J., Speksnijder, J., van Leeuwen, C.J. and Hermens, J.L.M. (2000)‘Classifying environmental pollutants: Part 3. External validation <strong>of</strong> th classification system.Chemosphere, 40, 875-883.51


TABLESTable I – SIDS test data.ID CASN EINECS name ECB-MOASchultzVerhaar LogLC50EPA-MOA CONS1 CONS2MOAMOA (mol/l)LogP1 50-00-0 Formaldehyde- SB SB NPN SB R R -3.081 0.352 56-81-5 1,2,3-Propanetriol NPN NPN NPN NPN N NPN -1.653 57-55-6 1,2-Propanediol NPN NPN NPN NPN N NPN -0.838 -0.784 58-08-21H-Purine-2,6-dione, 3,7-dihydro-1,3,7-trimethyl-CNS NPN CNS CNS S 0.165 58-55-91H-Purine-2,6-dione, 3,7-dihydro-1,3-dimethyl-CNS NPN NPN NPN N -0.396 60-00-4Glycine, N,N'-1,2-NPN_Log CARB.ethanediylbis[N-D ACID(carboxymethyl)-NPN UNK N* -3.689 -3.867 64-02-8Glycine, N,N'-1,2-ethanediylbis[N-(carboxymethyl)-, tetrasodiumsaltNPN_LogDCARB.ACIDNPN UNK N* R/S -3.868 68-12-2 Formamide, N,N-dimethyl- NPN SB NPN NPN N R -0.839 -0.939 71-36-3 1-Butanol NPN NPN NPN NPN N NPN -1.601 0.8410 74-83-9 Methane, bromo- NPN NPN NPN NPN N NPN 1.1811 74-87-3 Methane, chloro- NPN NPN NPN NPN N NPN 1.0912 75-01-4 Ethene, chloro- SN2 NPN NPN NPN N 1.6213 75-10-5 Methane, difluoro- NPN NPN NPN NPN N 0.7114 75-38-7 Ethene, 1,1-difluoro- SN2 NPN NPN NPN N 1.2415 75-56-9 Oxirane, methyl- RAD EPOXALKY-ARYLUNK R R 0.3716 75-68-3 Ethane, 1-chloro-1,1-difluoro- NPN NPN NPN NPN N NPN 2.0517 77-92-918 78-59-11,2,3-Propanetricarboxylicacid, 2-hydroxy-2-Cyclohexen-1-one, 3,5,5-trimethyl-NPN_LogDCARB.ACIDNPN NPN N* -1.67MTA MTA NPN MTA R R -2.762 2.6253


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOASchultzMOAEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)19 78-70-61,6-Octadien-3-ol, 3,7-dimethyl-PE PE NPN PE R R 3.3820 78-87-5 Propane, 1,2-dichloro- NPN NPN NPN NPN N NPN -2.907 2.2521 78-92-2 2-Butanol NPN NPN NPN NPN N NPN -1.305 0.7722 79-00-5 Ethane, 1,1,2-trichloro- NPN NPN NPN NPN N NPN -3.214 2.0123 79-06-1 2-Propenamide MTA MTAALKY-ARYLMTA R R -2.767 -0.8124 79-10-7 2-Propenoic acid MTAREAC. ALKY-ACID ARYLUNK R R 0.4425 79-11-8 Acetic acid, chloro- SN2REAC. ALKY-ACID ARYLUNK R 0.3426 79-20-9 Acetic acid, methyl ester NPN NPN EN NPN N -2.365 0.3727 79-31-2Propanoic acid, 2- NPN_Log CARB.methyl-D ACIDNPN UNK N* 128 79-34-5Ethane, 1,1,2,2-tetrachloro-NPN NPN NPN NPN N NPN -3.917 2.1929 79-39-02-Propenamide, 2-methyl-MTA MTA NPN MTA R R -0.2630 79-41-42-Propenoic acid, 2-REAC.MTAmethyl-ACIDNPN UNK R R 0.9931 80-05-7Phenol, 4,4'-(1-methylethylidene)bis-PN PN PN PN N PN -4.696 3.6432 80-62-62-Propenoic acid, 2-methyl-, methyl esterMTA MTA EN MTA R R -2.552 1.28Ethanone, 1-[4-(1,1-33 81-14-1dimethylethyl)-2,6-REAC.DINPN REAC.dimethyl-3,5-ITROUNK R 4.31dinitrophenyl]-34 81-15-235 84-74-236 87-56-9Benzene, 1-(1,1-dimethylethyl)-3,5-dimethyl-2,4,6-trinitro-1,2-Benzenedicarboxylicacid, dibutyl ester2-Butenoic acid, 2,3-dichloro-4-oxo-, (Z)-PNREAC.REAC.DINITROLogPUNK R 4.45PN NPN DE UNK N R -5.306 4.61MTAREAC.ACIDALKY-ARYLUNK R R 1.3754


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOA37 88-12-038 88-19-72-Pyrrolidinone, 1-ethenyl-Benzenesulfonamide, 2-methyl-SchultzMOAEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)MTA REAC. NPN UNK R 0.25UNKNONSPEC.ELECTSTRONGACIDLogPNPN UNK UNK 0.9239 88-44-8Benzenesulfonic acid, 2-amino-5-methyl-UNKPN UNK UNK R -1.5340 88-60-8Phenol, 2-(1,1-dimethylethyl)-5-methyl-PN PN PN PN N 3.9741 88-73-3Benzene, 1-chloro-2-SOFTPNnitro-ELECTNPN UNK N PN 2.4642 88-74-4 Benzenamine, 2-nitro- PN PN NPN PN N 2.0243 91-15-61,2-SOFTUNKBenzenedicarbonitrileELECTNPN UNK UNK R 1.0944 91-76-91,3,5-Triazine-2,4-diamine, 6-phenyl-CNS CNS PN CNS S 1.4445 93-68-5Butanamide, N-(2-methylphenyl)-3-oxo-PNNONSPEC.ELECTREAC.DIKEUNK R -2.782 0.9946 94-36-0 Peroxide, dibenzoyl UNK NPN SULPHY UNK UNK R 3.4347 95-31-82-Benzothiazolesulfenamid CNS NPN NPN NPN N 2.56e, N-(1,1-dimethylethyl)-48 95-49-8Benzene, 1-chloro-2-methyl-PN NPN NPN NPN N NPN 3.1849 95-50-1 Benzene, 1,2-dichloro- PN NPN NPN NPN N NPN -3.411 3.2850 96-18-4 Propane, 1,2,3-trichloro- NPN NPN NPN NPN N NPN -3.346 2.551 96-29-7 2-Butanone, oxime NPN NPN NPN NPN N R -2.014 1.6952 96-31-1 Urea, N,N'-dimethyl- NPN NPN NPN NPN N R -0.6253 96-33-32-Propenoic acid, methylesterMTA MTA ACRY MTA R R 0.7354 97-72-3Propanoic acid, 2-REAC.UNKmethyl-, anhydrideHYD.DE UNK R 1.2455 98-07-7Benzene,(trichloromethyl)-PN NPN NPN NPN N R 3.955


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOASchultzMOAEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)56 98-54-4Phenol, 4-(1,1-dimethylethyl)-PN PN PN PN N PN -4.466 3.4257 98-59-9Benzenesulfonyl chloride, 4-methyl-UNK SN2 SULPHY UNK R 3.4958 98-92-0 3-Pyridinecarboxamide PN PN PN PN N -0.4559 99-04-7 Benzoic acid, 3-methyl- PN_Log DCARB.ACIDNPN UNK N* 2.4260 99-54-7Benzene, 1,2-dichloro-4-nitro-61 99-99-0 Benzene, 1-methyl-4-nitro- PNPN SN2 NPN UNK N PN 3.1NONSPEC.ELECTLogPNPN UNK N PN -3.438 2.3662 100-00-5 Benzene, 1-chloro-4-nitro- PN SN2 NPN UNK N PN 2.4663 100-21-01,4-BenzenedicarboxylicCARB.PN_Log DacidACIDNPN UNK N* 1.7664 100-37-8 Ethanol, 2-(diethylamino)- NUCAMIN.ALCHNPN UNK UNK -1.818 0.0565 100-41-4 Benzene, ethyl- PN NPN NPN NPN N NPN -3.943 3.0366 102-06-7 Guanidine, N,N'-diphenyl- UNK NPN NPN NPN N R 2.8967 102-76-1 1,2,3-Propanetriol, triacetate EN NPN DE UNK UNK -3.121 0.3668 103-11-72-Propenoic acid, 2-ethylhexyl esterMTA MTA ACRY MTA R R 4.0969 103-84-4 Acetamide, N-phenyl- PN70 105-60-22H-Azepin-2-one,hexahydro-NONSPEC.ELECTNPN UNK N R 1.1PN NPN NPN NPN N R 0.6671 106-31-0 Butanoic acid, anhydride UNKREAC.HYD.DE UNK R R 1.3972 106-46-7 Benzene, 1,4-dichloro- PN NPN NPN NPN N NPN -4.015 3.2873 106-63-82-Propenoic acid, 2-methylpropyl esterMTA MTA ACRY MTA R -4.788 2.1374 106-88-7 Oxirane, ethyl- RAD EPOXALKY-ARYLUNK R R 0.8656


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOASchultzVerhaar LogLC50EPA-MOA CONS1 CONS2MOAMOA (mol/l)LogP75 107-06-2 Ethane, 1,2-dichloro- NPN NPN NPN NPN N NPN -2.931 1.8376 107-15-3 1,2-Ethanediamine AN NPN REAC. UNK UNK -2.576 -1.6277 107-22-2 Ethanedial- MTA SBCARB.REAC.UNK R R -2.431 -1.6678 107-41-5 2,4-Pentanediol, 2-methyl- NPN NPN NPN NPN N NPN -1.089 0.5879 107-86-8 2-Butenal, 3-methyl- MTA MTAALKY-ARYLMTA R R 1.1580 107-92-6 Butanoic-acid- NPN_log DCARB.ACIDNPN UNK N* 1.0781 107-98-2 2-Propanol, 1-methoxy- NPN NPN NPN NPN N NPN -0.637 -0.4982 108-44-1 Benzenamine, 3-methyl- PN PN PN PN N PN 1.6283 108-65-62-Propanol, 1-methoxy-,acetateEN NPN EN EN N 0.5284 108-77-01,3,5-Triazine, 2,4,6-trichloro-UNK SN2 NPN UNK UNK R 1.7385 108-88-3 Benzene, methyl- PN NPN NPN NPN N NPN -3.549 2.5486 109-66-0 Pentane- NPN NPN NPN NPN N NPN 2.887 110-16-7 2-Butenedioic acid (Z)- MTADICARB. ALKY-ACID. ARYLUNK R -4.366 0.0588 110-19-0Acetic acid, 2-methylpropyl esterEN NPN EN EN N 1.7789 110-65-6 2-Butyne-1,4-diol PE PEALKY-ARYLPE R R -3.206 -0.9390 110-83-8 Cyclohexene- PN NPN NPN NPN N 2.9691 110-85-0 Piperazine- PN NPN NPN NPN N -0.892 110-93-0 5-Hepten-2-one, 6-methyl- NPN NPN NPN NPN N -3.167 2.0693 110-98-5 2-Propanol, 1,1'-oxybis- NPN NPN NPN NPN N NPN -0.6494 112-57-21,2-Ethanediamine, N-(2-aminoethyl)-N'-[2-[(2- NPN NPN NPN NPN N -3.16aminoethyl)amino]ethyl]-95 112-85-6 Docosanoic-acid- NPN_log DCARB.ACIDNPN UNK N* 9.9196 115-07-1 1-Propene NPN NPN NPN NPN N NPN 1.6897 115-11-7 1-Propene, 2-methyl- NPN NPN NPN NPN N NPN 2.2357


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOASchultzMOAEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)98 115-86-6Phosphoric acid, triphenylesterAChE REAC. NPN UNK UNK -5.594 4.799 115-95-71,6-Octadien-3-ol, 3,7-dimethyl-, acetateMTA REAC. EN UNK R 4.39100 118-79-6 Phenol, 2,4,6-tribromo- WARE PN PN PN N PN -4.705 4.18101 119-47-1Phenol, 2,2'-methylenebis[6-(1,1-dimethylethyl)-4-NPN NPN PN NPN N PN 7.97methyl-102 120-61-61,4-Benzenedicarboxylicacid, dimethyl esterPN_Log D NPN DE UNK N* 1.66103 120-80-9 1,2-Benzenediol PE_RAD PE PN UNK R -4.288 1.03104 120-83-2 Phenol, 2,4-dichloro- PN PN PN PN N PN -4.277 2.8105 121-91-51,3-BenzenedicarboxylicDICARB.PN_log DacidACID.NPN UNK N* 1.76106 122-52-1Phosphorous acid, triethylesterAChE REAC. NPN UNK UNK R/S 0.74107 122-99-6 Ethanol, 2-phenoxy- PN NPN NPN NPN N -2.604 1.1108 123-54-6 2,4-Pentanedione UNKREAC.REAC.DIKNONESPECIFICUNK R NPN -2.860 0.05109 123-77-3 Diazenedicarboxamide- MTA NTAS NPN UNK UNK R -3.89110 123-86-4 Acetic acid, butyl ester EN NPN EN EN N -3.810 1.85111 124-04-9 Hexanedioic-acid- NPN_log DDICARB.ACID.NPN UNK N* -3.178 0.23112 126-73-8Phosphoric-acid-tributyl-REAC.AChEester-PHOSP.OP-AchE AChE S R/S -4.774 3.82113 126-98-7 2-Propenenitrile, 2-methyl- MTA MTA NPN MTA R 0.76114 127-19-5 Acetamide, N,N-dimethyl- NPNNONSPEC. NPN NPN N R -0.49ELECT115 128-37-0Phenol, 2,6-bis(1,1-dimethylethyl)-4-methyl-PN PN PN PN N PN 5.03116 135-19-3 2-Naphthalenol PN PN UNCOUPL PN N -4.620 2.69117 140-88-5 2-Propenoic acid, ethyl ester MTA MTA ACRY MTA R R -4.603 1.22LogP58


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOA118 141-10-6119 141-32-23,5,9-Undecatrien-2-one,6,10-dimethyl-2-Propenoic acid, butylesterMTASchultzMOAMTAEPA-MOA CONS1 CONS2ALKY-ARYLVerhaarMOALogLC50(mol/l)LogPMTA R 4.43MTA MTA ACRY MTA R R 2.2120 141-78-6 Acetic-acid-ethyl-ester- EN MTA EN EN N -2.583 0.86121 141-97-9Butanoic acid, 3-oxo-,ethyl esterENREAC.NONSPECIFICCARB.ACIDCARB.ACIDREAC.DIKEUNK R R -0.2122 144-55-8Carbonic-acid-monosodium-salt-PENPN UNK N* -0.46123 150-90-3Butanedioic acid,disodium saltNPN_log DNPN UNK N* -0.75124 288-32-4 1H-Imidazole UNK NPN NPN NPN N 0.06125 461-58-5 Guanidine, cyano- UNK REAC. NPN UNK UNK R -1.34126 505-32-81-Hexadecen-3-ol,3,7,11,15-tetramethyl-NPN PE NPN NPN N 8.23127 528-44-91,2,4-CARB.Benzenetricarboxylic PN_log DACIDacidNPN UNK N* 0.95128 552-30-75-Isobenz<strong>of</strong>urancarboxylicacid, 1,3-dihydro-1,3-dioxo-UNK REAC. ACY UNK R R 1.96129 556-82-1 2-Buten-1-ol, 3-methyl- PE NPN NPN NPN N R 1.17130 611-19-8Benzene, 1-chloro-2-ALKY-PN NPN(chloromethyl)-ARYLUNK N 3.44131 760-23-6 1-Butene, 3,4-dichloro- SN2 SN2ALKY-ARYLSN2 R R -4.184 2.6132 770-35-4 2-Propanol, 1-phenoxy- PN NPN NPN NPN N -2.735 1.52133 793-24-81,4-Benzenediamine, N-(1,3-dimethylbutyl)-N'- NPN NPN NPN NPN N R/S 4.68phenyl-134 822-06-0Hexane, 1,6-REAC.ISOCYAdiisocyanato-HYD.ISOCYA UNK R 3.259


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOASchultzMOAEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)135 839-90-71,3,5-Triazine-2,4,6(1H,3H,5H)-trione, UNK REAC. NPN UNK UNK 0.071,3,5-tris(2-hydroxyethyl)-136 868-77-92-Propenoic acid, 2-methyl-, 2-hydroxyethyl MTA MTA EN MTA R R -2.758 0.3ester137 868-85-9Phosphonic acid, dimethylesterUNK REAC. NPN UNK UNK -2.689 -1.13138 919-30-23-AminopropyltriethoxysilaneUNK REAC. UNK UNK UNK 0.31139 1163-19-5Benzene, 1,1'-oxybis[2,3,4,5,6-NPN NPN NPN NPN N R/S 12.11pentabromo-140 1477-55-01,3-BenzenedimethanaminePN NPN NPN NPN N 0.15141 1490-04-6Cyclohexanol, 5-methyl-2-(1-methylethyl)-NPN NPN NPN NPN N NPN -3.929 3.38142 1634-04-4Propane, 2-methoxy-2-methyl-NPN NPN NPN NPN N NPN -2.118 1.43143 1717-00-6 HCFC 141b NPN NPN NPN NPN N 2.37144 2216-51-5Cyclohexanol, 5-methyl-2-(1-methylethyl)-, [1R- NPN NPN NPN NPN N -1.67(1alpha,2beta,5alpha)]-145 2403-88-54-Piperidinol, 2,2,6,6-NPN NPN NPN NPN N 0.94146 2432-99-7147 2439-35-2148 2837-89-0149 2855-13-2tetramethyl-Undecanoic acid, 11-amino-2-Propenoic acid, 2-(dimethylamino)ethyl esterEthane, 2-chloro-1,1,1,2-tetrafluoro-Cyclohexanemethanamine,5-amino-1,3,3-trimethyl-NPN_log DCARB.ACIDLogPNPN UNK N* -0.16MTA MTA ACRY MTA R R 0.42NPN NPN NPN NPN N 1.86NPN NPN NPN NPN N 1.960


Table I – SIDS test data (continued).ID CASN EINECS name ECB-MOA150 2867-47-22-Propenoic acid, 2-methyl-, 2-(dimethylamino)ethylesterSchultzMOA151 3268-49-3 Propanal, 3-(methylthio)- NPN SB152 3319-31-1153 3323-53-3154 3452-97-9155 4016-24-41,2,4-Benzenetricarboxylicacid, tris(2-ethylhexyl)esterHexanedioic acid, compd.with 1,6-hexanediamine(1:1)1-Hexanol, 3,5,5-trimethyl-Hexadecanoic acid, 2-sulfo-, 1-methyl ester,sodium saltEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)MTA MTA EN MTA R R 0.97CARB.REAC.LogPUNK R R 0.41NPN_log D NPN DE UNK N* 11.59NPN_log D NPN NPN NPN N* 0.23NPN NPN NPN NPN N NPN 3.11NPN_log D NPN EN UNK N* 6.21156 4169-04-4 1-Propanol, 2-phenoxy- PN NPN NPN NPN N -2.735 1.52157 4454-05-12H-Pyran, 3,4-dihydro-2-methoxy-PN NPN NPN NPN N NPN 0.88158 4457-71-01,5-Pentanediol, 3-methyl-NPN NPN NPN NPN N 0.69159 4979-32-22-Benzothiazolesulfenamide CNS NPN NPN NPN N 5.96, N,N-dicyclohexyl-160 5102-83-0Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'-diyl)bis(azo)]bis[N- NPN REAC. NPN NPN N 8.11(2,4-dimethylphenyl)-3-oxo-161 5392-40-52,6-Octadienal, 3,7-ALKY-MTA MTAdimethyl-ARYLMTA R 3.4561


Table I – SIDS test data (continued).ID CASN EINECS nameECB-MOASchultzMOAEPA-MOA CONS1 CONS2VerhaarMOALogLC50(mol/l)162 5567-15-7Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'-diyl)bis(azo)]bis[N-(4-chloro-NPN REAC. NPN NPN N 7.942,5-dimethoxyphenyl)-3-oxo-163 6165-51-1Benzene, 1,4-dimethyl-2-(1-phenylethyl)-PN NPN NPN NPN N 5.24164 6358-85-6Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'-diyl)bis(azo)]bis[3-oxo-N-NPN REAC. NPN NPN N 7.06phenyl-165 6386-38-5Benzenepropanoic acid, 3,5-bis(1,1-dimethylethyl)-4-EN NPN EN EN N 5.06hydroxy-, methyl ester166 6422-86-21,4-Benzenedicarboxylic acid,bis(2-ethylhexyl) esterNPN NPN DE NPN N 8.39167 6864-37-5Cyclohexanamine, 4,4'-methylenebis[2-methyl-AN NPN NPN NPN N 4.1168 11070-44-31,3-Isobenz<strong>of</strong>urandione,CARB.MTA REAC.tetrahydromethyl-BasedUNK R 2.64169 25154-52-3 Phenol, nonyl- PN PN PN PN N PN -6.236 5.99170 25265-71-8 Propanol, oxybis- NPN NPN NPN NPN N NPN -0.49171 25321-09-9 Benzene, bis(1-methylethyl)- PN NPN NPN NPN N NPN 4.9172 25321-14-6 Benzene, methyldinitro- PN REAC.REAC.DINITROUNK R -4.030 2.18173 31570-04-4Phenol, 2,4-bis(1,1-dimethylethyl)-, phosphite (3:1)NPN NPN NPN NPN N 18.08174 32534-81-9Benzene, 1,1'-oxybis-,pentabromo deriv.NPN NPN NPN NPN N 7.66175 32536-52-0Benzene, 1,1'-oxybis-,octabromo deriv.NPN NPN NPN NPN N 10.33176 56539-66-3 1-Butanol, 3-methoxy-3-methyl- PN NPN NPN NPN N 0.46177 84852-15-3 Phenol, 4-nonyl-, branched PN PN PN PN N -6.236 5.92AQUIRE values are highlighted in bold.LogP62


Table II – Mixed model (<strong>QSAR</strong>4) training set.ID CASN Chemical LogK ow E LUMOLog(LC50)(mol/l) Hat Err.Calc. Err.Pred.Exp Pred.1 108-95-2 Phenol 1.46 0.23 -3.46 -3.17 0.029 0.28 0.292 95-48-7 o-cresol 2.12 0.18 -3.89 -3.57 0.020 0.31 0.323 106-44-5 p-cresol 1.94 0.19 -3.82 -3.46 0.022 0.35 0.364 105-67-9 2,4-dimethylphenol 2.30 0.17 -3.87 -3.69 0.019 0.18 0.185 527-60-6 2,4,6-trimethylphenol 3.42 0.10 -4.02 -4.37 0.015 -0.34 -0.356 2416-94-6 2,3,6-trimethylphenol 3.42 0.03 -4.22 -4.40 0.014 -0.18 -0.187 123-07-9 4-ethylphenol 2.58 0.20 -4.07 -3.83 0.018 0.23 0.248 645-56-7 4-propylphenol 3.18 0.19 -4.09 -4.19 0.016 -0.09 -0.109 1745-81-9 2-allylphenol 2.64 0.18 -3.95 -3.88 0.017 0.07 0.0710 98-54-4 4-tert-butylphenol 3.31 0.20 -4.46 -4.25 0.017 0.21 0.2111 80-46-6 4-tert-pentylphenol 3.98 0.20 -4.80 -4.64 0.020 0.16 0.162,6-di-tert-butyl-4-12 128-37-0 methylphenol6.07 -0.05 -5.78 -5.96 0.052 -0.17 -0.1813 732-26-3 2,4,6-tri-tert-butylphenol 7.40 -0.03 -6.63 -6.72 0.097 ** -0.08 -0.0914 104-40-5 4-nonylphenol 6.36 0.19 -6.20 -6.00 0.066 * 0.19 0.2015 92-69-3 4-phenylphenol 3.36 0.19 -4.44 -4.29 0.017 0.15 0.1516 120-80-9 catechol 0.88 0.16 -4.08 -2.83 0.038 1.20 * 1.25 *17 108-46-3 resorcinol 0.80 0.14 -3.34 -2.82 0.039 0.50 0.5218 150-19-6 3-methoxyphenol 1.58 0.23 -3.22 -3.25 0.027 -0.03 -0.0319 150-76-5 4-methoxyphenol 1.34 0.06 -3.05 -3.19 0.028 -0.14 -0.1420 831-82-3 4-phenoxyphenol 3.75 0.05 -4.58 -4.58 0.016 0.00 0.0021 95-57-8 2-chlorophenol 2.15 -0.14 -3.97 -3.74 0.015 0.23 0.2322 106-48-9 4-chlorophenol 2.48 -0.18 -4.32 -3.95 0.012 0.37 0.3723 106-48-9 4-chloro-3-methylphenol 3.10 -0.20 -4.42 -4.31 0.010 0.10 0.1124 2138-22-9 4-chlorocatechol 1.97 -0.28 -4.96 -3.68 0.016 1.26 * 1.28 *25 120-83-2 2,4-dichlorophenol 2.92 -0.52 -4.32 -4.36 0.009 -0.04 -0.0426 3428-24-8 4,5-dichlorocatechol 2.90 -0.57 -5.30 -4.36 0.009 0.93 0.9427 2460-49-3 4,5-dichloroguaiacol 3.26 -0.70 -4.64 -4.63 0.009 0.01 0.0128 88-06-2 2,4,6-trichlorophenol 3.69 -0.86 -4.61 -4.96 0.012 -0.34 -0.3529 58-90-2 2,3,4,6-tetrachlorophenol 4.45 -1.22 -5.35 -5.56 0.024 -0.20 -0.2130 4901-51-3 2,3,4,5-tetrachlorophenol 4.21 -1.21 -5.75 -5.40 0.021 0.34 0.3531 1198-55-6 tetrachlorocatechol 4.29 -1.25 -5.29 -5.48 0.023 -0.19 -0.1932 87-86-5 pentachlorophenol 5.12 -1.51 -6.04 -6.07 0.040 -0.03 -0.0333 118-79-6 2,4,6-tribromophenol 4.02 -0.72 -4.70 -5.09 0.013 -0.38 -0.3934 608-71-9 pentabromophenol 5.74 -1.26 -6.72 -6.29 0.046 0.41 0.4335 609-23-4 2,4,6-triiodophenol 4.80 -0.79 -5.59 -5.56 0.022 0.03 0.0336 88-75-5 2-nitrophenol 1.85 -0.79 -2.94 -3.88 0.019 -0.93 -0.9437 100-02-7 4-nitrophenol 1.91 -0.81 -3.53 -3.92 0.018 -0.38 -0.3938 573-56-8 2,6-dinitrophenol 1.91 -1.69 -3.67 -4.34 0.041 -0.64 -0.6739 329-71-5 2,5-dinitrophenol 1.75 -1.81 -4.74 -4.25 0.048 0.47 0.4940 51-28-5 2,4-dinitrophenol 1.54 -2.03 -4.23 -4.25 0.062 * -0.02 -0.0241 88-85-7 2-sec-butyl-4,6-dinitrophenol 3.69 -1.55 -5.65 -5.26 0.027 0.38 0.3942 119-34-6 4-amino-2-nitrophenol 0.96 -0.81 -3.63 -3.36 0.033 0.27 0.2763


Table II – Mixed model (<strong>QSAR</strong>4) training set (continued).ID CASN Chemical LogK ow E LUMOLog(LC50)(mol/l) Hat Err.Calc. Err.Pred.Exp Pred.3-trifluoromethyl-4-43 88-30-2 nitrophenol3.00 -1.58 -4.36 -4.90 0.028 -0.53 -0.5444 534-52-1 4,6-dinitro-o-cresol 2.56 -1.59 -5.06 -4.62 0.031 0.42 0.442,2'-methylene-bis(4-chlorophenol)45 97-23-44.26 -0.25 -5.94 -4.99 0.016 0.94 0.952,2'-methylenebis(3,4,6-46 70-30-4 trichlorophenol)7.54 -1.01 -7.29 -7.23 0.098 ** 0.06 0.0647 62-53-3 aniline 0.90 0.52 -2.91 -2.72 0.046 0.18 0.1948 106-49-0 4-toluidine 1.39 0.53 -2.83 -3.01 0.038 -0.17 -0.1849 589-16-2 4-ethylaniline 1.96 0.22 -3.22 -3.48 0.022 -0.25 -0.2650 104-13-2 4-butylaniline 3.15 0.22 -4.16 -4.15 0.017 0.01 0.0151 16245-79-7 4-octylaniline 5.27 0.21 -6.23 -5.34 0.039 0.85 0.8952 37529-30-9 4-decylaniline 6.32 0.21 -6.58 -5.94 0.065 * 0.60 0.6453 24544-04-5 4,6-diisopropylaniline 3.18 0.13 -4.06 -4.21 0.015 -0.15 -0.1554 95-80-7 2,4-diaminotoluene 0.34 0.20 -1.94 -2.58 0.050 -0.61 -0.6455 1484-26-0 3-benzyloxyaniline 2.79 0.18 -4.34 -3.96 0.016 0.37 0.3856 39905-57-2 4-hexyloxyaniline 3.66 0.44 -4.81 -4.34 0.025 0.46 0.4757 95-51-2 2-chloroaniline 1.90 -0.11 -4.35 -3.57 0.018 0.76 0.7858 106-47-8 4-chloroaniline 1.83 -0.14 -3.61 -3.56 0.018 0.05 0.0559 615-65-6 2-chloro-4-methylaniline 2.58 -0.16 -3.60 -4.00 0.012 -0.40 -0.4060 95-76-1 3,4-dichloroaniline 2.69 -0.50 -4.33 -4.22 0.010 0.11 0.1161 634-67-3 2,3,4-trichloroaniline 3.33 -0.76 -4.73 -4.70 0.010 0.03 0.0362 3481-20-7 2,3,5,6-tetrachloroaniline 4.10 -1.05 -5.93 -5.26 0.017 0.65 0.6763 106-40-1 4-bromoaniline 2.26 -0.10 -3.56 -3.79 0.015 -0.23 -0.2364 371-40-4 4-fluoroaniline 1.15 0.14 -3.81 -3.02 0.032 0.77 0.7965 771-60-8 2,3,4,5,6-pentafluoroaniline 2.22 -1.66 -3.69 -4.50 0.036 -0.78 -0.81a,a,a-4-tetrafluoro-3-66 2357-47-3 toluidine2.62 -1.00 -3.77 -4.41 0.015 -0.63 -0.64a,a,a-4-tetrafluoro-2-67 393-39-5 toluidine2.62 -0.97 -3.78 -4.40 0.014 -0.61 -0.6268 100-01-6 4-nitroaniline 1.31 -0.48 -3.04 -3.42 0.024 -0.38 -0.3869 616-86-4 4-ethoxy-2-nitroaniline 2.47 -0.87 -3.85 -4.26 0.014 -0.41 -0.4170 121-87-9 2-chloro-4-nitroaniline 2.17 -1.16 -3.93 -4.22 0.021 -0.29 -0.2971 97-02-9 2,4-dinitroaniline 1.84 -1.64 -4.09 -4.25 0.040 -0.16 -0.1672 71-43-2 benzene 2.13 0.37 -3.65 -3.50 0.025 0.15 0.1573 108-88-3 toluene 2.73 0.26 -3.43 -3.90 0.018 -0.46 -0.4774 95-47-6 o-xylene 3.12 0.19 -3.81 -4.16 0.016 -0.34 -0.3575 108-38-3 m-xylene 3.20 0.20 -3.82 -4.20 0.017 -0.37 -0.3876 106-42-3 p-xylene 3.15 0.13 -4.08 -4.20 0.015 -0.11 -0.1277 100-41-4 ethylbenzene 3.15 0.26 -4.00 -4.14 0.018 -0.14 -0.1478 98-82-8 isopropylbenzene 3.66 0.27 -4.28 -4.43 0.020 -0.14 -0.1579 95-63-6 1,2,4-trimethylbenzene 3.78 0.10 -4.19 -4.58 0.017 -0.38 -0.3980 68411-44-9 butylbenzene 4.26 0.26 -4.83 -4.77 0.025 0.06 0.0681 141-93-5 1,3-diethylbenzene 4.50 0.23 -4.51 -4.94 0.027 -0.41 -0.4382 100-42-5 styrene 2.95 0.27 -4.41 -4.01 0.018 0.39 0.4064


Table II – Mixed model (<strong>QSAR</strong>4) training set (continued).ID CASN Chemical LogK ow E LUMOLog(LC50)(mol/l) Hat Err.Calc. Err.Pred.Exp Pred.83 1746-23-2 4-tert-butylstyrene 4.84 0.15 -5.52 -5.14 0.030 0.36 0.3884 538-68-1 amylbenzene 4.91 0.25 -4.94 -5.16 0.033 -0.21 -0.2285 92-52-4 biphenyl 4.09 0.24 -4.90 -4.68 0.022 0.22 0.2286 150-78-7 1,4-dimethoxybenzene 2.15 -0.10 -3.07 -3.74 0.015 -0.66 -0.6787 5673-07-4 2,6-dimethoxytoluene 2.80 -0.02 -3.88 -4.06 0.013 -0.18 -0.1888 140-67-0 1-allyl-4-methoxybenzene 3.31 0.03 -4.28 -4.33 0.013 -0.05 -0.0589 108-90-7 chlorobenzene 2.86 -0.13 -3.82 -4.15 0.011 -0.33 -0.3390 95-50-1 1,2-dichlorobenzene 3.38 -0.51 -4.19 -4.62 0.009 -0.43 -0.4391 541-73-1 1,3-dichlorobenzene 3.60 -0.53 -4.26 -4.76 0.010 -0.49 -0.5092 106-46-7 1,4-dichlorobenzene 3.37 -0.59 -4.56 -4.65 0.009 -0.09 -0.0993 95-75-0 3,4-dichlorotoluene 4.22 -0.58 -4.74 -5.14 0.014 -0.39 -0.4094 120-82-1 1,2,4-trichlorobenzene 4.02 -0.93 -4.78 -5.18 0.015 -0.40 -0.4095 634-66-2 1,2,3,4-tetrachlorobenzene 4.99 -1.18 -5.29 -5.86 0.030 -0.56 -0.5796 608-93-5 pentachlorobenzene 5.17 -1.49 -6.00 -6.09 0.040 -0.09 -0.0997 95-94-3 1,2,4,5-tetrachlorobenzene 4.82 -1.27 -5.80 -5.79 0.029 0.01 0.0198 81-19-6 a,a-2,6-tetrachlorotoluene 4.64 -1.21 -5.38 -5.67 0.026 -0.28 -0.2999 1825-21-4 pentachloroanisole 5.34 -1.50 -5.64 -6.22 0.044 -0.55 -0.58100 2176-62-7 pentachloropyridine 4.34 -1.76 -5.73 -5.74 0.039 -0.01 -0.01101 108-86-1 bromobenzene 2.99 -0.09 -3.94 -4.21 0.011 -0.26 -0.27102 583-53-9 1,2-dibromobenzene 3.64 -0.42 -4.76 -4.73 0.010 0.03 0.03103 13209-15-9 a,a,a',a'-tetrabromo-o-xylene 5.17 -1.16 -5.98 -5.94 0.033 0.04 0.04104 95-52-3 2-fluorotoluene 2.93 -0.15 -3.75 -4.20 0.011 -0.45 -0.45105 98-95-3 nitrobenzene 1.85 -0.78 -3.02 -3.88 0.018 -0.84 -0.86106 99-08-1 3-nitrotoluene 2.45 -1.21 -3.73 -4.42 0.020 -0.67 -0.69107 350-46-9 1-fluoro-4-nitrobenzene 1.80 -1.21 -3.70 -4.04 0.027 -0.33 -0.34108 121-73-3 1-chloro-3-nitrobenzene 2.41 -1.14 -3.92 -4.35 0.019 -0.43 -0.43109 88-73-3 1-chloro-2-nitrobenzene 2.24 -1.13 -3.73 -4.26 0.020 -0.51 -0.53110 100-25-4 1,4-dinitrobenzene 1.46 -1.82 -5.37 -4.04 0.053 * 1.26 * 1.33 *111 121-14-2 2,4-dinitrotoluene 2.00 -1.71 -3.88 -4.39 0.041 -0.49 -0.511,3-Dichloro-4,6-112 3698-83-7 dinitrobenzene2.49 -2.34 -6.71 -4.81 0.067 * 1.77 ** 1.90 **1,3,5-trichloro-2,4-113 6284-83-9 dinitrobenzene2.65 -2.51 -6.09 -5.03 0.077 * 0.98 * 1.06 *114 77-47-4 hexachlorocyclopentadiene 5.04 -2.11 -6.95 -6.25 0.063 * 0.65 0.70* Chemicals with values between 2 times SDEC (or SDEP or critical HAT) and 3 times SDEC (orSDEP or critical HAT). ** Chemicals with values greater than 3 times SDEC (or SDEP or averagevalue <strong>of</strong> HAT).65


Table III – SIDS chemicals not suitable <strong>for</strong> <strong>QSAR</strong> 4:N.Comp. SIDS ChemicalsMotivation911 2 3 4 5 6 78 9 10 11 12 13 1517 19 21 23 26 27 2937 39 51 52 57 58 6467 70 74 76 77 78 8081 83 86 87 88 89 9091 93 94 95 96 97 99101 106 108 109 110 111 114120 121 122 123 124 125 126129 135 136 137 138 139 140141 142 144 145 146 149 151152 153 154 157 158 160 162166 167 170 173 174 175 176Out <strong>of</strong> the X - domain(0.34 ≤ LogKow ≤ 7.54-2.51 ≤ E LUMO ≤ 0.53)941 49 56 65 72 85 103104 115In the training setXY-domain1 1553 18 73 117High leveragechemicals(structurally distantfrom the trainingchemicals)Y-Outliers(cross-validatedstandardised residualgreater than twostandard deviationunits)66


Table IV – <strong>QSAR</strong> 4 predictions <strong>for</strong> the SIDS subset defined by model domain indescriptor and response space (XY-D).Log(LC50) (mol/l)Std.ID CASN EINECS name MOA LogK ow E LUMOHatErr.Pred.Exp Pred.S14 75387 Ethene, 1,1-difluoro- NPN 1.24 -0.01 - -3.16 0.028 -S16 75683Ethane, 1-chloro-1,1-difluoro-NPN 2.05 -0.62 - -3.90 0.015 -S18 785912-Cyclohexen-1-one, 3,5,5-trimethyl-MTA 2.62 -0.24 -2.76 -4.06 0.011 -2.69S20 78875 Propane, 1,2-dichloro- NPN 2.25 0.37 -2.91 -3.57 0.023 -1.38S22 79005 Ethane, 1,1,2-trichloro- NPN 2.01 -0.53 -3.21 -3.84 0.015 -1.30S24 79107 2-Propenoic acid UNK 0.44 0.16 - -2.63 0.047 -S25 79118 Acetic acid, chloro- UNK 0.34 0.30 - -2.50 0.052 -S28 79345 Ethane, 1,1,2,2-tetrachloro- NPN 2.19 -0.87 -3.92 -4.10 0.016 -0.37S30 79414 2-Propenoic acid, 2-methyl- UNK 0.99 -0.19 - -3.10 0.031 -S31 80057Phenol, 4,4'-(1-methylethylidene)bis-PN 3.64 0.16 -4.70 -4.46 0.017 0.50S32 806262-Propenoic acid, 2-methyl-,methyl esterMTA 1.28 0.41 -2.55 -2.99 0.036 -0.92S33 81141Ethanone, 1-[4-(1,1-dimethylethyl)-2,6-dimethyl- UNK 4.31 -1.92 - -5.79 0.045 -3,5-dinitrophenyl]-S34 81152Benzene, 1-(1,1-dimethylethyl)-3,5-dimethyl- UNK 4.45 -2.45 - -6.11 0.074 -2,4,6-trinitro-S35 847421,2-Benzenedicarboxylicacid, dibutyl esterUNK 4.61 -0.52 -5.31 -5.33 0.018 -0.05S36 875692-Butenoic acid, 2,3-dichloro-4-oxo-, (Z)-UNK 1.37 -1.06 - -3.71 0.030 -S38 88197Benzenesulfonamide, 2-methyl-UNK 0.92 -1.53 - -3.67 0.052 -S40 88608Phenol, 2-(1,1-dimethylethyl)-5-methyl-PN 3.97 -0.02 - -4.73 0.016 -S42 88744 Benzenamine, 2-nitro- PN 2.02 -0.76 - -3.95 0.016 -S43 91156 1,2-Benzenedicarbonitrile UNK 1.09 -1.13 - -3.58 0.036 -S44 917691,3,5-Triazine-2,4-diamine,6-phenyl-CNS 1.44 -0.34 - -3.43 0.022 -S45 93685Butanamide, N-(2-methylphenyl)-3-oxo-UNK 0.99 0.03 -2.78 -3.00 0.033 -0.46S46 94360 Peroxide, dibenzoyl UNK 3.43 -0.61 - -4.69 0.009 -S47 953182-Benzothiazolesulfenamide,N-(1,1-dimethylethyl)-NPN 2.56 -0.65 - -4.21 0.011 -S48 95498 Benzene, 1-chloro-2-methyl- NPN 3.18 -0.18 - -4.35 0.010 -S50 96184 Propane, 1,2,3-trichloro- NPN 2.50 0.02 -3.35 -3.87 0.015 -1.08S53 963332-Propenoic acid, methylesterMTA 0.73 -0.02 - -2.87 0.038 -67


Table IV – <strong>QSAR</strong> 4 predictions <strong>for</strong> the SIDS subset defined by model domain in descriptorand response space (XY-D) (continued).ID CASN EINECS name MOA LogK ow E LUMOLog(LC50) (mol/l)Std.HatErr.Pred.Exp Pred.S54 97723Propanoic acid, 2-methyl-,anhydrideUNK 1.24 0.50 - -2.93 0.039 -S55 98077 Benzene, (trichloromethyl)- NPN 3.90 -0.87 - -5.08 0.013 -S59 99047 Benzoic acid, 3-methyl- UNK 2.42 -0.23 - -3.94 0.012 -S60 99547Benzene, 1,2-dichloro-4-nitro-UNK 3.10 -1.84 - -5.06 0.038 -S61 99990 Benzene, 1-methyl-4-nitro- UNK 2.36 -1.27 -3.44 -4.38 0.022 -1.96S62 100005 Benzene, 1-chloro-4-nitro- UNK 2.46 -1.59 - -4.58 0.031 -S63 1002101,4-BenzenedicarboxylicacidUNK 1.76 -0.73 - -3.79 0.019 -S66 102067 Guanidine, N,N'-diphenyl- NPN 2.89 0.03 - -4.09 0.013 -S68 1031172-Propenoic acid, 2-ethylhexyl esterMTA 4.09 0.01 - -4.79 0.017 -S69 103844 Acetamide, N-phenyl- UNK 1.10 0.13 - -3.02 0.033 -S71 106310 Butanoic acid, anhydride UNK 1.39 0.46 - -3.03 0.035 -S73 1066382-Propenoic acid, 2-methylpropyl esterMTA 2.13 0.00 -4.79 -3.67 0.017 2.33S75 107062 Ethane, 1,2-dichloro- NPN 1.83 -0.08 -2.93 -3.53 0.019 -1.25S79 107868 2-Butenal, 3-methyl- MTA 1.15 0.29 - -2.97 0.035 -S82 108441 Benzenamine, 3-methyl- PN 1.62 0.22 - -3.27 0.026 -S84 1087701,3,5-Triazine, 2,4,6-trichloro-UNK 1.73 -1.59 - -4.16 0.039 -S92 110930 5-Hepten-2-one, 6-methyl- NPN 2.06 0.52 -3.17 -3.39 0.029 -0.46S98 115866Phosphoric acid, triphenylesterUNK 4.70 -1.40 -5.59 -5.78 0.031 -0.39S100 118796 Phenol, 2,4,6-tribromo- PN 4.18 -0.72 -4.71 -5.17 0.014 -0.96S102 1206161,4-Benzenedicarboxylicacid, dimethyl esterUNK 1.66 -1.12 - -3.91 0.026 -S105 1219151,3-BenzenedicarboxylicacidUNK 1.76 -0.62 - -3.74 0.018 -S107 122996 Ethanol, 2-phenoxy- NPN 1.1 0.06 -2.60 -3.05 0.031 -0.93S112 126738Phosphoric-acid-tributyl-ester-AChE 3.82 -0.84 -4.77 -5.02 0.012 -0.51S113 126987 2-Propenenitrile, 2-methyl- MTA 0.76 -0.04 - -2.90 0.037 -S116 135193 2-Naphthalenol PN 2.69 -0.39 -4.62 -4.16 0.010 0.95S117 140885 2-Propenoic acid, ethyl ester MTA 1.22 0.01 -4.60 -3.14 0.029 3.06S118 1411063,5,9-Undecatrien-2-one,6,10-dimethyl-MTA 4.43 0.13 - -4.93 0.023 -S119 141322 2-Propenoic acid, butyl ester MTA 2.20 0.01 - -3.70 0.016 -S127 5284491,2,4-BenzenetricarboxylicacidUNK 0.95 -1.07 - -3.48 0.038 -S128 5523075-Isobenz<strong>of</strong>urancarboxylicacid, 1,3-dihydro-1,3-dioxo-UNK 1.96 -1.81 - -4.39 0.045 -68


Table IV – <strong>QSAR</strong> 4 predictions <strong>for</strong> the SIDS subset defined by model domain in descriptorand response space (XY-D) (continued).ID CASN EINECS name MOA LogK ow E LUMOLog(LC50) (mol/l)Std.HatErr.Pred.Exp Pred.S130 611198Benzene, 1-chloro-2-(chloromethyl)-UNK 3.44 -0.64 - -4.71 0.009 -S131 760236 1-Butene, 3,4-dichloro- SN2 2.60 -0.04 -4.18 -3.96 0.013 0.47S132 770354 2-Propanol, 1-phenoxy- NPN 1.52 0.17 -2.74 -3.24 0.027 -1.05S133 7932481,4-Benzenediamine, N-(1,3-dimethylbutyl)-N'-phenyl-NPN 4.68 0.11 - -5.08 0.026 -S134 822060 Hexane, 1,6-diisocyanato- UNK 3.20 0.50 - -4.06 0.025 -S143 1717006 HCFC 141b NPN 2.37 -0.70 - -4.13 0.013 -S147 24393522-Propenoic acid, 2-(dimethylamino)ethyl esterMTA 0.42 0.30 - -2.55 0.050 -S148 2837890Ethane, 2-chloro-1,1,1,2-tetrafluoro-NPN 1.86 -0.98 - -3.96 0.021 -S150 28674722-Propenoic acid, 2-methyl-,2-(dimethylamino)ethyl esterMTA 0.97 0.39 - -2.83 0.040 -S155 4016244Hexadecanoic acid, 2-sulfo-,1-methyl ester, sodium saltUNK 6.21 -2.45 - -7.13 0.104 -S156 4169044 1-Propanol, 2-phenoxy- NPN 1.52 0.08 -2.74 -3.28 0.025 -1.13S159 49793222-Benzothiazolesulfenamide,N,N-dicyclohexyl-NPN 5.96 -0.60 - -6.14 0.044 -S161 53924052,6-Octadienal, 3,7-dimethyl-MTA 3.45 0.32 - -4.28 0.020 -S163 6165511Benzene, 1,4-dimethyl-2-(1-phenylethyl)-NPN 5.24 0.11 - -5.40 0.036 -S164 6358856Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'-diyl)bis(azo)]bis[3-oxo-N-NPN 7.06 -0.71 - -6.82 0.077 -phenyl-S165 6386385Benzenepropanoic acid, 3,5-bis(1,1-dimethylethyl)-4- EN 5.06 -0.21 - -5.44 0.027 -hydroxy-, methyl esterS168 110704431,3-Isobenz<strong>of</strong>urandione,tetrahydromethyl-UNK 2.64 -0.81 - -4.33 0.012 -S169 25154523 Phenol, nonyl- PN 5.99 0.19 -6.24 -5.80 0.055 0.93S171 25321099 Benzene, bis(1-methylethyl)- NPN 4.90 0.22 - -5.16 0.032 -S172 25321146 Benzene, methyldinitro- UNK 2.18 -1.61 -4.03 -4.43 0.034 -0.84S177 84852153 Phenol, 4-nonyl-, branched PN 5.92 0.19 -6.24 -5.76 0.053 1.02Y outliers are highlighted in bold in the standardized residual in prediction column.Unreliable predictions according to the leverage approach are highlighted in bold in the leveragecolumn.69


Table V – E-state indices model (<strong>QSAR</strong>5) training set.ID CASN ChemicalLog(LC50)(mol/l) Hat Err.Pred.Exp Pred.T1 71-43-2 Benzene -3.40 -2.89 0.151 0.51T2 108-86-1 Bromobenzene -3.89 -3.59 0.127 0.30T3 108-90-7 ChIorobenzene -3.77 -3.61 0.063 0.16T4 108-95-2 Hydroxybenzene -3.51 -3.14 0.075 0.37T5 95-50-1 1,2-Dichlorobenzene -4.40 -4.25 0.041 0.15T6 541-73-1 1,3-Dichlorobenzene -4.30 -4.26 0.041 0.04T7 106-46-7 1,4-Dichlorobenzene -4.62 -4.25 0.042 0.37T8 95-57-8 1-Chloro-2-hydroxybenzene -4.02 -3.77 0.051 0.25T9 108-41-8 1-Chloro-3-methylbenzene -3.84 -3.99 0.035 -0.15T10 106-43-4 1-Chloro-4-methylbenzene -4.33 -3.97 0.035 0.36T11 108-46-3 1,3-Dihydroxybenzene -3.04 -3.40 0.154 -0.36T12 150-19-6 1-Hydroxy-3-methoxybenzene -3.21 -3.26 0.047 -0.05T13 95-48-7 1-Hydroxy-2-methylbenzene -3.77 -3.48 0.048 0.29T14 108-39-4 1-Hydroxy-3-methylbenzene -3.29 -3.52 0.048 -0.23T15 106-44-5 1-Hydroxy-4-methylbenzene -3.58 -3.52 0.048 0.06T16 100-02-7 1-Hydroxy-4-nitrobenzene -3.36 -3.57 0.061 -0.21T17 150-78-7 1,4-Dimethoxybenzene -3.07 -3.17 0.048 -0.10T18 95-47-6 1,2-DimethyIbenzene -3.48 -3.73 0.072 -0.25T19 106-42-3 1,4-Dimethylbenzene -4.21 -3.67 0.070 0.54T20 88-72-2 1-Methyl-2-nitrobenzene -3.57 -3.52 0.041 0.05T21 99-08-1 1-Methyl-3-nitrobenzene -3.63 -3.56 0.039 0.07T22 99-99-0 1-Methyl-4-nitrobenzene -3.76 -3.58 0.040 0.18T23 99-65-0 1,3-Dinitrobenzene -4.38 -4.03 0.061 0.35T24 603-83-8 1-Amino-2-methyl-3-nitrobenzene -3.48 -3.55 0.060 -0.07T25 99-52-5 1-Amino-2-methyl-4-nitrobenzene -3.24 -3.61 0.064 -0.37T26 99-55-81-Amino-2-methyI-5-nitrobenzene-3.35 -3.64 0.060 -0.29T27 570-24-1 1-Amino-2-methyl-6-nitrobenzene -3.80 -3.60 0.055 0.20T28 578-46-1 1-Amino-3-methyl-6-nitrobenzene -3.80 -3.63 0.054 0.17T29 89-62-3 1-Amino-2-nitro-4-methylbenzene -3.79 -3.61 0.054 0.18T30 119-32-4 1-Amino-3-nitro-4-methylbenzene -3.77 -3.55 0.059 0.22T31 87-61-6 1,2,3-Trichlorobenzene -4.89 -4.89 0.054 0.00T32 120-82-1 1,2,4-Trichlorobenzene -5.00 -4.89 0.058 0.11T33 108-70-3 1,3,5-TrichIorobenzene -4.74 -4.91 0.060 -0.17T34 120-83-2 1,3-Dichloro-4-hydroxybenzene -4.30 -4.42 0.056 -0.12T35 95-75-0 1,2-Dichloro-4-methylbenzene -4.74 -4.61 0.050 0.13T36 95-73-8 1,3-Dichloro-4-methylbenzene -4.54 -4.63 0.054 -0.09T37 105-67-9 1-Hydroxy-2,4-dimethylbenzene -3.86 -3.85 0.092 0.01T38 576-26-1 1-Hydroxy-2,6-dimethylbenzene -3.75 -3.83 0.084 -0.08T39 95-65-8 1-Hydroxy-3,4-dimethylbenzene -3.90 -3.86 0.102 0.04T40 51-28-5 1-Hydroxy-2,4-dinitrobenzene -4.04 -4.55 0.097 -0.51T41 95-63-6 1,2,4-TrimethyIbenzene -4.21 -4.06 0.186 0.15T42 602-01-7 1-Methyl-2,3-dinitrobenzene -5.01 -4.36 0.051 0.65T43 121-14-2 1-Methyl-2,4-dinitrobenzene -3.75 -4.26 0.054 -0.5170


Table V – E-state indices model (<strong>QSAR</strong>5) training set (continued).ID CASN ChemicalLog(LC50)(mol/l) Hat Err.Pred.Exp Pred.T44 606-20-2 1-Methyl-2,6-dinitrobenzene -3.99 -4.17 0.051 -0.18T45 610-39-9 1-Methyl-3,4-dinitrobenzene -5.08 -4.43 0.057 0.65T46 618-85-9 1 -Methyl-3,5-dinitrobenzene -3.91 -4.27 0.055 -0.36T47 35572-78-21-Amino-2-methyl-3,5-dinitrobenzene-4.12 -4.31 0.067 -0.19T48 56207-39-71-Amino-2-methyl-3,6-dinitrobenzene-5.34 -4.08 0.069 1.26 *T49 10202-92-31-Amino-2,4-dinitro-3-methylbenzene-4.26 -4.24 0.069 0.02T50 70343-06-51-Amino-2,6-dinitro-3-methylbenzene-4.21 -4.34 0.076 -0.13T51 6393-42-61-Amino-2,6-dinitro-4-methylbenzene-4.18 -4.37 0.074 -0.19T52 19406-51-01-Amino-3,5-dinitro-4-methylbenzene-4.46 -4.22 0.066 0.24T53 118-79-61,3,5-Tribromo-2-hydroxybenzene-4.70 -5.39 0.496 ** -0.69T54 634-66-2 1,2,3,4-Tetrachlorobenzene -5.43 -5.53 0.093 -0.10T55 95-94-3 1,2,4,5-Tetrachlorobenzene -5.85 -5.49 0.098 0.36T56 118-96-7 1 -MethyI-2,4,6-trinitrobenzene -4.88 -5.46 0.204 -0.58T57 87-86-51-Hydroxy-2,3,4,5,6-pentachlorobenzene-6.06 -6.28 0.264 * -0.22T58 106-40-1 1-Amino-4-bromobenzene -3.56 -3.66 0.133 -0.10T59 95-76-1 1-Amino-3,4-dichlorobenzene -4.33 -4.26 0.070 0.07T60 554-00-7 1-Amino-2,4-dichlorobenzene -4.07 -4.28 0.070 -0.21T61 615-65-61-Amino-2-chloro-4-methylbenzene-3.60 -4.01 0.072 -0.41T62 121-87-9 1-Amino-2-chloro-4-nitrobenzene -3.93 -3.99 0.054 -0.06T63 634-67-3 1-Amino-2,3,4-trichlorobenzene -4.74 -4.90 0.089 -0.16T64 6284-83-91,3,5-Trichloro-2,4-dinitrobenzene-6.09 -6.23 0.219 -0.14T65 3481-20-71-Amino-2,3,5,6-tetrachlorobenzene-5.93 -5.44 0.133 0.49T66 1689-84-51-Cyano-3,5-dibromo-4-hydroxybenzene-4.38 -4.54 0.327 * -0.16T67 5922-60-11-Cyano-2-amino-5-chlorobenzene-3.73 -3.75 0.224 -0.02T68 6575-09-31-Cyano-2-chloro-6-methylbenzene-4.00 -4.08 0.193 -0.08T69 529-19-1 1-Cyano-2-methylbenzene -3.42 -3.44 0.189 -0.02T70 6361-21-31-Aldehydo-2-chloro-5-nitrobenzene-4.72 -4.79 0.084 -0.07T71 874-42-0 1-Aldehydo-2,4-dichlorobenzene -4.99 -5.41 0.115 -0.42T72 104-88-1 1-Aldehydo-4-chlorobenzene -4.81 -4.91 0.133 -0.10T73 552-89-6 1-Aldehydo-2-nitrobenzene -4.02 -4.15 0.080 -0.1371


Table V – E-state indices model (<strong>QSAR</strong>5) training set (continued).ID CASN ChemicalLog(LC50)(mol/l) Hat Err.Pred.Exp Pred.T74 100-52-7 1-Aldehydobenzene -4.14 -4.37 0.176 -0.23T75 613-45-61-Aldehydo-2,4-dimethoxybenzene-3.92 -4.36 0.136 -0.44T76 1761-61-11-Aldehydo-2-hydroxy-5-bromobenzene-5.19 -4.69 0.122 0.50T77 635-93-81-Aldehydo-2-hydroxy-5-chlorobenzene-5.31 -4.61 0.083 0.70T78 90-02-8 1-Aldehydo-2-hydroxybenzene -4.73 -4.09 0.095 0.64T79 18278-34-71-Aldehydo-2-methoxy-4-hydroxybenzene-4.02 -4.33 0.104 -0.31T80 121-33-51-Aldehydo-3-methoxy-4-hydroxybenzene-4.81 -4.28 0.109 0.53T81 708-76-91-Aldehydo-2-hydroxy-4,6-dimethoxybenzene-4.83 -4.01 0.096 0.82T82 771-60-81-Amino-2,3,4,5,6-pentafluorobenzene-3.69 -6.10 0.526 ** -2.41 **T83 350-46-9 1 -Fluoro-4-nitrobenzene -3.70 -3.88 0.132 -0.18T84 371-40-4 1-Amino-4-fluorobenzene -3.82 -3.36 0.138 0.46T85 653-37-2 1-Aldehydo-pentafluorobenzene -5.25 -2.74 0.535 ** 2.51 **T86 387-45-11-Aldehydo-2-chloro-6-fluorobenzene-4.23 -4.68 0.097 -0.45T87 5465-65-6 1-Acyl-4-chloro-3-nitrobenzene -4.56 -4.28 0.091 0.28T88 2234-16-4 1-Acyl-2,4-dichlorobenzene -4.21 -4.55 0.064 -0.34T89 98-86-2 1-Acylbenzene -2.87 -3.29 0.062 -0.42T90 619-24-9 3-Nitrobenzonitrile -3.39 -3.53 0.188 -0.14T91 619-72-7 4-Nitrobenzonitrile -3.79 -3.41 0.189 0.38T92 4920-77-8 3-Methyl-2-nitrophenol -3.52 -3.88 0.050 -0.36T93 700-38-9 5-Methyl-2-nitrophenol -3.51 -3.95 0.049 -0.44T94 616-72-8 1,5-Dimethyl-2,4-dinitrobenzene -4.39 -4.42 0.080 -0.03T95 616-73-9 2,4-Dinitro-5-methylphenol -4.92 -4.63 0.094 0.29T96 99-35-4 1,3,5-Trinitrobenzene -5.29 -5.29 0.255 * 0.00T97 127-00-4 1-Chloro-2-propanol -2.58 -2.61 0.098 -0.03T98 115-20-8 2,2,2-Trichloroethanol -2.70 -3.77 0.166 -1.07 *T99 96-13-9 2,3-Dibromopropanol -3.49 -3.06 0.259 * 0.43T100 108-93-0 Cyclohexanol -2.15 -1.53 0.197 0.62T101 122-99-6 2-Phenoxyethanol -2.60 -3.42 0.066 -0.82T102 67-64-1 Acetone -0.85 -1.88 0.096 -1.03 *T103 78-93-3 2-Butanone -1.35 -1.92 0.089 -0.57T104 107-87-9 2-Pentanone -1.75 -1.92 0.087 -0.17T105 563-80-4 3-Methyl-2-butanone -2.00 -2.37 0.088 -0.37T106 110-12-3 5-Methyl-2-hexanone -2.86 -2.58 0.167 0.28T107 108-10-1 4-Methyl-2-pentanone -2.29 -2.57 0.130 -0.28T108 75-97-8 3,3-Dimethyl-2-butanone -3.07 -2.51 0.163 0.56T109 119-61-9 Benzophenone -4.09 -4.93 0.237 -0.84T110 13608-87-2 2,3,4-Trichloroacetophenone -5.00 -5.17 0.106 -0.1772


Table V – E-state indices model (<strong>QSAR</strong>5) training set (continued).ID CASN ChemicalLog(LC50)(mol/l) Hat Err.Pred.Exp Pred.T111 937-20-2 2,4-Dichloroacetophenone -4.16 -4.47 0.067 -0.31T112 1634-04-4 tert-Butylmethyl ether -2.10 -2.52 0.199 -0.42T113 108-20-3 Diisopropyl ether -3.05 -2.71 0.274 * 0.34T114 5671-07-4 2,6-Dimethoxytoluene -3.88 -3.44 0.088 0.44T115 101-84-8 Diphenyl ether -4.63 -4.59 0.220 0.04T116 620-88-2 p-Nitrophenyl phenyl ether -4.91 -4.80 0.130 0.11T117 107-06-2 1,2-Dichloroethane -2.92 -2.56 0.137 0.36T118 79-00-5 1,1,2-Trichloroethane -3.21 -3.32 0.110 -0.11T119 79-34-5 1,1,2,2-Tetrachloroethane -3.92 -3.68 0.493 ** 0.24T120 76-01-7 Pentachloroethane -4.44 -4.27 0.320 * 0.17T121 67-72-1 Hexachloroethane -5.19 -3.90 0.775 ** 1.29 *T122 108-88-3 Methylbenzene -3.32 -3.34 0.062 -0.02T123 59-50-71-Chloro-2-methyl-4-hydroxybenzene-4.27 -4.14 0.057 0.13T124 62-53-3 1-Aminobenzene -2.84 -2.99 0.128 -0.15T125 609-22-31-Formyl-2-hydroxy-3,5-dibromobenzene-5.52 -4.76 0.266 0.76T126 97-02-9 5-Amino-2,4-dinitroaniline -4.91 -4.14 0.073 0.77T127 90-13-1 1-Chloronaphthalene -4.85 -4.47 0.132 0.38T128 1321-64-8 Pentachloronaphthalene -6.01 -7.19 0.221 -1.18T129 56-23-5 Carbon tetrachloride -3.75 -3.63 0.182 0.12T130 95-52-3 1-Formyl-2-fluorobenzene -4.96 -3.72 0.112 1.24* Chemicals with values between 2 times SDEC (or SDEP or critical HAT) and 3 times SDEC (orSDEP or critical HAT). ** Chemicals with values greater than 3 times SDEC (or SDEP or averagevalue <strong>of</strong> HAT).73


Table VI – <strong>QSAR</strong>5 predictions <strong>for</strong> the 9 test set chemicals.ID CASN EINECS nameLog(LC50) (mol/l)Std.HatErr.Pred.Exp Pred.T122 108-88-3 Methylbenzene -3.32 -3.34 0.062 -0.05T123 59-50-71-Chloro-2-methyl-4-hydroxybenzene-4.27 -4.14 0.057 0.34T124 62-53-3 1-Aminobenzene -2.84 -2.99 0.128 -0.41T125 609-22-31-Formyl-2-hydroxy-3,5-dibromobenzene-5.52 -4.76 0.266 2.28T126 97-02-9 5-Amino-2,4-dinitroaniline -4.91 -4.14 0.073 2.06T127 90-13-1 1-Chloronaphthalene -4.85 -4.47 0.132 1.05T1281321-64-8Pentachloronaphthalene -6.01 -7.19 0.221 -3.44T129 56-23-5 Carbon tetrachloride -3.75 -3.63 0.182 0.34T130 95-52-3 1-Formyl-2-fluorobenzene -4.96 -3.72 0.112 3.38Y outliers are highlighted in bold in the standardized residual in prediction column.74


Table VII – SIDS chemicals not suitable <strong>for</strong> <strong>QSAR</strong> 5.N.Comp. SIDS Chemicals Motivation176 7 13 14 16 17 3334 98 101 139 148 160 162164 `173 175Out <strong>of</strong> the X -domain822 28 49 72 100 104 107142In the training set392 12 18 19 23 37 4354 68 79 90 92 93 9699 115 118 119 126 127 128129 133 135 143 144 147 149152 155 157 159 161 163 165166 167 171 174High leveragechemicals(structurally distantfrom the trainingchemicals)XYdomain221 3 8 21 32 35 4576 78 81 89 103 108 110112 117 120 132 137 156 169177Y-Outliers(cross-validatedstandardised residualgreater than twostandard deviationunits)5 31 67 73 131 141Y-Outliers and highleverage chemicals75


Table VIII –<strong>QSAR</strong> 5 predictions <strong>for</strong> the SIDS subset defined by model domain indescriptor and response space (XY-D).Log(LC50)Std.ID CASN EINECS name(mol/l) HatErr.Pred.Exp Pred.S1 50000 Formaldehyde- -3.08 -1.15 0.252 5.74S2 56815 1,2,3-Propanetriol - -2.46 0.543 -S3 57556 1,2-Propanediol -0.84 -2.27 0.235 -4.21S4 580821H-Purine-2,6-dione,3,7-dihydro-1,3,7-trimethyl-- -2.64 0.234 -S5 585591H-Purine-2,6-dione,3,7-dihydro-1,3-dimethyl-- -2.21 0.200 -S8 68122 Formamide,N,N-dimethyl- -0.84 -3.12 0.187 -6.50S9 71363 1-Butanol -1.60 -1.96 0.102 -0.97S10 74839 Methane,bromo- - -1.96 0.171 -S11 74873 Methane,chloro- - -1.98 0.134 -S12 75014 Ethene,chloro- - -3.80 0.565 -S15 75569 Oxirane,methyl- - -1.55 0.288 -S18 78591 2-Cyclohexen-1-one,3,5,5-trimethyl- -2.76 -5.55 1.006 * -S19 78706 1,6-Octadien-3-ol,3,7-dimethyl- - -8.94 4.394 ** -S20 78875 Propane,1,2-dichloro- -2.91 -3.09 0.082 -0.48S21 78922 2-Butanol -1.31 -2.27 0.088 -2.58S23 79061 2-Propenamide -2.77 -3.24 0.387 -1.55S24 79107 2-Propenoicacid - -3.22 0.212 -S25 79118 Aceticacid,chloro- - -2.59 0.123 -S26 79209 Aceticacid,methylester -2.36 -1.72 0.105 1.74S27 79312 Propanoicacid,2-methyl- - -2.38 0.105 -S29 79390 2-Propenamide,2-methyl- - -1.76 0.129 -S30 79414 2-Propenoicacid,2-methyl- - -2.11 0.109 -S31 80057 Phenol,4,4'-(1-methylethylidene)bis- -4.70 -6.21 0.569 -5.91S32 80626 2-Propenoicacid,2-methyl-,methylester -2.55 -1.78 0.099 2.09S35 847421,2-Benzenedicarboxylicacid,dibutylester-5.31 -3.63 0.241 4.95S36 875692-Butenoicacid,2,3-dichloro-4-oxo-,(Z)-- -4.09 0.194 -S37 88120 2-Pyrrolidinone,1-ethenyl- - -3.94 0.689 -S38 88197 Benzenesulfonamide,2-methyl- - -3.45 0.187 -S39 88448Benzenesulfonicacid,2-amino-5-methyl-- -3.73 0.285 -S40 88608Phenol,2-(1,1-dimethylethyl)-5-methyl-- -4.80 0.348 -S41 88733 Benzene,1-chloro-2-nitro- - -3.97 0.036 -S42 88744 Benzenamine,2-nitro- - -3.34 0.070 -S43 91156 1,2-Benzenedicarbonitrile - -3.22 0.705 -S44 91769 1,3,5-Triazine-2,4-diamine,6-phenyl- - -3.45 0.272 -S45 93685Butanamide,N-(2-methylphenyl)-3-oxo--2.78 -3.82 0.184 -2.9676


Table VIII –<strong>QSAR</strong> 5 predictions <strong>for</strong> the SIDS subset defined by model domain indescriptor and response space (XY-D) (continued).ID CASN EINECS nameLog(LC50)Std.(mol/l) HatErr.Pred.Exp Pred.S46 94360 Peroxide,dibenzoyl - -4.54 0.361 -S47 953182-Benzothiazolesulfenamide,N-(1,1-dimethylethyl)-- -3.85 0.201 -S48 95498 Benzene,1-chloro-2-methyl- - -3.98 0.034 -S50 96184 Propane,1,2,3-trichloro- -3.35 -3.55 0.092 -0.54S51 96297 2-Butanone,oxime -2.01 -2.28 0.079 -0.71S52 96311 Urea,N,N'-dimethyl- - -1.81 0.094 -S53 96333 2-Propenoicacid,methylester - -3.35 0.360 -S54 97723 Propanoicacid,2-methyl-,anhydride - -2.66 0.377 -S55 98077 Benzene,(trichloromethyl)- - -5.00 0.155 -S56 98544 Phenol,4-(1,1-dimethylethyl)- -4.47 -4.52 0.246 -0.16S57 98599 Benzenesulfonylchloride,4-methyl- - -4.07 0.169 -S58 98920 3-Pyridinecarboxamide - -2.65 0.102 -S59 99047 Benzoicacid,3-methyl- - -3.69 0.078 -S60 99547 Benzene,1,2-dichloro-4-nitro- - -4.62 0.043 -S61 99990 Benzene,1-methyl-4-nitro- -3.44 -3.59 0.040 -0.39S62 100005 Benzene,1-chloro-4-nitro- - -3.95 0.041 -S63 100210 1,4-Benzenedicarboxylicacid - -3.77 0.362 -S64 100378 Ethanol,2-(diethylamino)- -1.82 -2.41 0.084 -1.58S65 100414 Benzene,ethyl- -3.94 -3.40 0.062 1.43S66 102067 Guanidine,N,N'-diphenyl- - -4.65 0.211 -S67 102761 1,2,3-Propanetriol,triacetate -3.12 -2.24 0.575 3.47S68 103117 2-Propenoicacid,2-ethylhexylester - -4.35 0.459 -S69 103844 Acetamide,N-phenyl- - -3.28 0.060 -S70 105602 2H-Azepin-2-one,hexahydro- - -1.22 0.246 -S71 106310 Butanoicacid,anhydride - -2.26 0.168 -S73 106638 2-Propenoicacid,2-methylpropylester -4.79 -4.14 0.407 2.17S74 106887 Oxirane,ethyl- - -1.59 0.302 -S75 107062 Ethane,1,2-dichloro- -2.93 -2.61 0.136 0.89S76 107153 1,2-Ethanediamine -2.58 -1.45 0.337 3.56S77 107222 Ethanedial- -2.43 -2.09 0.209 0.98S78 107415 2,4-Pentanediol,2-methyl- -1.09 -2.92 0.328 -5.74S79 107868 2-Butenal,3-methyl- - -5.88 1.484 * -S80 107926 Butanoic-acid- - -2.19 0.098 -S81 107982 2-Propanol,1-methoxy- -0.64 -2.09 0.119 -3.97S82 108441 Benzenamine,3-methyl- - -3.35 0.082 -S83 108656 2-Propanol,1-methoxy-,acetate - -2.08 0.098 -S84 108770 1,3,5-Triazine,2,4,6-trichloro- - -3.59 0.095 -S85 108883 Benzene,methyl- -3.55 -3.34 0.062 0.56S86 109660 Pentane- - -1.77 0.116 -S87 110167 2-Butenedioicacid(Z)- -4.37 -4.63 0.341 -0.84S88 110190 Aceticacid,2-methylpropylester - -2.44 0.111 -S89 110656 2-Butyne-1,4-diol -3.21 -2.19 0.200 2.93S90 110838 Cyclohexene- - -8.66 6.009 ** -77


Table VIII –<strong>QSAR</strong> 5 predictions <strong>for</strong> the SIDS subset defined by model domain indescriptor and response space (XY-D) (continued).ID CASN EINECS nameLog(LC50)Std.(mol/l) HatErr.Pred.Exp Pred.S91 110850 Piperazine- - -0.92 0.313 -S92 110930 5-Hepten-2-one,6-methyl- -3.17 -5.88 1.309 -S93 110985 2-Propanol,1,1'-oxybis- - -2.59 0.422 -S94 1125721,2-Ethanediamine,N-(2-aminoethyl)-N'-[2-[(2-aminoethyl)amino]ethyl]-- -1.49 0.362 -S95 112856 Docosanoic-acid- - -2.35 0.095 -S96 115071 1-Propene - -4.27 1.023 * -S97 115117 1-Propene,2-methyl- - -1.68 0.121 -S99 1159571,6-Octadien-3-ol,3,7-dimethyl-,acetate- -9.01 4.514 ** -S102 1206161,4-Benzenedicarboxylicacid,dimethylester- -3.26 0.160 -S103 120809 1,2-Benzenediol -4.29 -3.33 0.162 2.69S105 121915 1,3-Benzenedicarboxylicacid - -3.77 0.366 -S106 122521 Phosphorousacid,triethylester - -2.04 0.125 -S108 123546 2,4-Pentanedione -2.86 -2.04 0.162 2.30S109 123773 Diazenedicarboxamide- - -1.96 0.292 -S110 123864 Aceticacid,butylester -3.81 -1.88 0.088 5.20S111 124049 Hexanedioic-acid- -3.18 -2.80 0.352 1.21S112 126738 Phosphoric-acid-tributyl-ester- -4.77 -2.47 0.121 6.32S113 126987 2-Propenenitrile,2-methyl- - -1.69 0.252 -S114 127195 Acetamide,N,N-dimethyl- - -2.17 0.086 -S115 128370Phenol,2,6-bis(1,1-dimethylethyl)-4-methyl-- -6.09 1.238 * -S116 135193 2-Naphthalenol -4.62 -3.99 0.150 1.76S117 140885 2-Propenoicacid,ethylester -4.60 -3.50 0.365 3.56S118 1411063,5,9-Undecatrien-2-one,6,10-dimethyl-- -15.79 17.663 ** -S119 141322 2-Propenoicacid,butylester - -3.62 0.377 -S120 141786 Acetic-acid-ethyl-ester- -2.58 -1.82 0.093 2.05S121 141979 Butanoicacid,3-oxo-,ethylester - -2.11 0.164 -S122 144558 Carbonic-acid-monosodium-salt- - -2.29 0.192 -S123 150903 Butanedioicacid,disodiumsalt - -2.75 0.334 -S124 288324 1H-Imidazole - -1.73 0.170 -S125 461585 Guanidine,cyano- - -1.62 0.304 -S126 5053281-Hexadecen-3-ol,3,7,11,15-tetramethyl-- -7.54 2.895 ** -S127 528449 1,2,4-Benzenetricarboxylicacid - -4.15 0.986 -S128 5523075-Isobenz<strong>of</strong>urancarboxylicacid,1,3-dihydro-1,3-dioxo-- -3.19 0.479 -S129 556821 2-Buten-1-ol,3-methyl- - -5.34 0.986 -S130 611198 Benzene,1-chloro-2-(chloromethyl)- - -4.45 0.044 -S131 760236 1-Butene,3,4-dichloro- -4.18 -5.42 0.810 -7.29S132 770354 2-Propanol,1-phenoxy- -2.74 -3.57 0.135 -2.2978


Table VIII –<strong>QSAR</strong> 5 predictions <strong>for</strong> the SIDS subset defined by model domain indescriptor and response space (XY-D) (continued).ID CASN EINECS nameLog(LC50)Std.(mol/l) HatErr.Pred.Exp Pred.S133 7932481,4-Benzenediamine,N-(1,3-dimethylbutyl)-N'-phenyl-- -6.60 0.777 -S134 822060 Hexane,1,6-diisocyanato- - -1.47 0.279 -S135 8399071,3,5-Triazine-2,4,6(1H,3H,5H)-trione,1,3,5-tris(2-hydroxyethyl)-- -3.53 1.144 * -S136 8687792-Propenoicacid,2-methyl-,2-hydroxyethylester-2.76 -2.18 0.110 1.58S137 868859 Phosphonicacid,dimethylester -2.69 -1.72 0.106 2.64S138 919302 3-Aminopropyl-triethoxysilane - -2.35 0.205 -S140 1477550 1,3-Benzenedimethanamine - -3.33 0.248 -S141 1490046Cyclohexanol,5-methyl-2-(1-methylethyl)--3.93 -3.77 0.999 13.01S143 1717006 HCFC141b - -3.11 0.439 -S144 1760243Cyclohexanol,5-methyl-2-(1-methylethyl)-,[1R-- -3.77 0.999 -(1alpha,2beta,5alpha)]-S145 2403885 4-Piperidinol,2,2,6,6-tetramethyl- - -3.34 0.349 -S146 2432997 Undecanoicacid,11-amino- - -2.18 0.175 -S147 24393522-Propenoicacid,2-(dimethylamino)ethylester- -3.95 0.381 -S149 2855132Cyclohexanemethanamine,5-amino-1,3,3-trimethyl-- -3.23 0.583 -S150 28674722-Propenoicacid,2-methyl-,2-(dimethylamino)ethylester- -2.29 0.096 -S151 3268493 Propanal,3-(methylthio)- - -3.19 0.269 -S152 33193111,2,4-Benzenetricarboxylicacid,tris(2-ethylhexyl)ester- -5.70 1.527 * -S153 3323533Hexanedioicacid,compd.with1,6-hexanediamine(1:1)- -2.80 0.352 -S154 3452979 1-Hexanol,3,5,5-trimethyl- - -3.74 0.364 -S155 4016244Hexadecanoicacid,2-sulfo-,1-methylester,sodiumsalt- -2.68 1.223 * -S156 4169044 1-Propanol,2-phenoxy- -2.74 -3.69 0.085 -2.55S157 4454051 2H-Pyran,3,4-dihydro-2-methoxy- - -7.53 4.065 ** -S158 4457710 1,5-Pentanediol,3-methyl- - -2.85 0.216 -S159 49793222-Benzothiazolesulfenamide,N,N-dicyclohexyl-- -3.28 0.849 -S161 5392405 2,6-Octadienal,3,7-dimethyl- - -10.34 6.434 ** -S163 6165511Benzene,1,4-dimethyl-2-(1-phenylethyl)-- -6.47 0.495 -S165 6386385Benzenepropanoicacid,3,5-bis(1,1-dimethylethyl)-4-hydroxy-,methylester- -6.04 1.214 * -79


Table VIII –<strong>QSAR</strong> 5 predictions <strong>for</strong> the SIDS subset defined by model domain indescriptor and response space (XY-D) (continued).ID CASN EINECS nameLog(LC50)Std.(mol/l) HatErr.Pred.Exp Pred.S166 64228621,4-Benzenedicarboxylicacid,bis(2-ethylhexyl)ester- -4.93 0.604 -S167 6864375Cyclohexanamine,4,4'-methylenebis[2-methyl-- -4.22 5.166 ** -S168 110704431,3-Isobenz<strong>of</strong>urandione,tetrahydromethyl-- -1.83 0.231 -S169 25154523 Phenol,nonyl- -6.24 -3.73 0.056 6.64S170 25265718 Propanol,oxybis- - -2.25 0.207 -S171 25321099 Benzene,bis(1-methylethyl)- - -5.27 0.597 -S172 25321146 Benzene,methyldinitro- -4.03 -4.40 0.051 -0.98S174 32534819 Benzene,1,1'-oxybis-,pentabromoderiv. - -7.88 1.941 * -S176 56539663 1-Butanol,3-methoxy-3-methyl- - -2.62 0.113 -S177 84852153 Phenol,4-nonyl-,branched -6.24 -4.52 0.255 5.12Y outliers are highlighted in bold in the standardized residual in prediction column.Unreliable predictions according to leverage approach are highlighted in bold in the leveragecolumn.80


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set: measured versus Predicted FHMvalues in TQ-FHM model, pT units (log[mmol/L]).ID Measured Predicted Residual ID Measured Predicted Residual1 6.38 6.37 0.01 39 3.63 3.58 0.052 6.00 5.97 0.03 40 3.61 3.19 0.423 5.77 5.43 0.34 41 3.60 3.58 0.024 5.74 5.74 0.00 42 3.59 3.58 0.015 5.43 5.29 0.14 43 3.57 3.42 0.156 5.34 5.33 0.01 44 3.55 4.00 -0.457 4.83 4.82 0.01 45 3.55 3.50 0.058 4.78 4.01 0.77 46 3.53 2.80 0.739 4.74 4.72 0.02 47 3.52 3.38 0.1410 4.66 4.64 0.02 48 3.50 3.24 0.2611 4.65 4.52 0.13 49 3.48 3.52 -0.0412 4.59 4.51 0.08 50 3.48 3.30 0.1813 4.46 4.48 -0.02 51 3.45 3.31 0.1414 4.40 2.97 1.43 52 3.44 2.88 0.5615 4.40 4.40 0.00 53 3.42 3.11 0.3116 4.39 4.39 0.00 54 3.40 2.55 0.8517 4.36 4.27 0.09 55 3.39 2.33 1.0618 4.34 4.05 0.29 56 3.38 3.37 0.0119 4.33 4.29 0.04 57 3.38 3.35 0.0320 4.33 4.67 -0.34 58 3.37 2.90 0.4721 4.29 3.85 0.44 59 3.30 2.02 1.2822 4.24 4.06 0.18 60 3.26 3.22 0.0423 4.22 2.81 1.41 61 3.26 3.13 0.1324 4.21 4.23 -0.02 62 3.24 2.65 0.5925 4.12 3.96 0.16 63 3.24 3.29 -0.0526 4.11 3.64 0.47 64 3.24 3.07 0.1727 4.09 3.91 0.18 65 3.21 2.73 0.4828 3.92 3.88 0.04 66 3.20 3.41 -0.2129 3.89 3.93 -0.04 67 3.20 2.23 0.9730 3.83 3.34 0.49 68 3.20 2.73 0.4731 3.77 3.60 0.17 69 3.18 2.95 0.2332 3.76 3.92 -0.16 70 3.18 3.19 -0.0133 3.72 3.60 0.12 71 3.18 3.02 0.1634 3.72 3.58 0.14 72 3.17 3.01 0.1635 3.69 1.90 1.79 73 3.16 3.16 0.0036 3.69 4.01 -0.32 74 3.16 2.99 0.1737 3.68 3.60 0.08 75 3.14 2.96 0.1838 3.64 3.09 0.55 76 3.09 3.35 -0.2681


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual77 3.06 2.64 0.42 118 2.64 2.61 0.0378 3.06 2.83 0.23 119 2.64 2.78 -0.1479 3.04 3.03 0.01 120 2.63 2.29 0.3480 3.04 2.29 0.75 121 2.62 2.22 0.4081 3.04 3.00 0.04 122 2.62 2.49 0.1382 3.02 3.31 -0.29 123 2.60 2.51 0.0983 3.02 2.90 0.12 124 2.59 2.53 0.0684 3.00 2.87 0.13 125 2.59 2.41 0.1885 3.00 2.99 0.01 126 2.58 2.54 0.0486 2.98 2.64 0.34 127 2.57 2.54 0.0387 2.95 2.78 0.17 128 2.57 2.56 0.0188 2.94 2.01 0.93 129 2.54 2.62 -0.0889 2.92 2.59 0.33 130 2.54 2.56 -0.0290 2.91 2.69 0.22 131 2.53 1.92 0.6191 2.90 2.86 0.04 132 2.52 2.22 0.3092 2.87 2.91 -0.04 133 2.51 2.52 -0.0193 2.86 2.74 0.12 134 2.51 2.24 0.2794 2.86 2.88 -0.02 135 2.50 1.09 1.4195 2.85 2.41 0.44 136 2.49 2.38 0.1196 2.84 2.78 0.06 137 2.49 2.04 0.4597 2.82 2.11 0.71 138 2.47 2.44 0.0398 2.82 2.57 0.25 139 2.46 2.41 0.0599 2.78 2.81 -0.03 140 2.45 2.45 0.00100 2.77 1.96 0.81 141 2.45 1.82 0.63101 2.74 2.67 0.07 142 2.44 2.47 -0.03102 2.74 2.47 0.27 143 2.43 2.41 0.02103 2.73 2.74 -0.01 144 2.42 2.21 0.21104 2.73 2.74 -0.01 145 2.40 1.84 0.56105 2.72 2.74 -0.02 146 2.40 2.38 0.02106 2.72 2.46 0.26 147 2.39 2.54 -0.15107 2.71 2.57 0.14 148 2.38 1.34 1.04108 2.71 1.87 0.84 149 2.37 2.03 0.34109 2.69 1.63 1.06 150 2.36 1.95 0.41110 2.69 2.46 0.23 151 2.35 2.36 -0.01111 2.68 1.88 0.80 152 2.35 2.14 0.21112 2.68 1.30 1.38 153 2.35 2.46 -0.11113 2.67 1.37 1.30 154 2.34 1.99 0.35114 2.67 2.39 0.28 155 2.33 2.38 -0.05115 2.67 2.38 0.29 156 2.32 2.30 0.02116 2.65 1.58 1.07 157 2.31 1.93 0.38117 2.65 2.49 0.16 158 2.31 1.80 0.5182


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual159 2.30 1.33 0.97 200 2.09 2.16 -0.07160 2.30 1.84 0.46 201 2.08 1.56 0.52161 2.29 2.33 -0.04 202 2.07 1.90 0.17162 2.29 2.46 -0.17 203 2.07 1.25 0.82163 2.28 1.07 1.21 204 2.06 2.02 0.04164 2.27 2.02 0.25 205 2.06 2.10 -0.04165 2.27 2.30 -0.03 206 2.06 1.70 0.36166 2.26 2.20 0.06 207 2.05 1.38 0.67167 2.26 1.99 0.27 208 2.05 1.90 0.15168 2.26 1.74 0.52 209 2.05 1.86 0.19169 2.25 2.32 -0.07 210 2.04 1.43 0.61170 2.25 2.20 0.05 211 2.02 1.99 0.03171 2.24 2.17 0.07 212 2.01 2.04 -0.03172 2.22 1.92 0.30 213 2.01 1.56 0.45173 2.22 2.05 0.17 214 2.00 2.00 0.00174 2.21 2.22 -0.01 215 2.00 2.08 -0.08175 2.21 1.44 0.77 216 2.00 1.93 0.07176 2.20 2.09 0.11 217 1.99 1.66 0.33177 2.19 2.19 0.00 218 1.99 1.92 0.07178 2.19 2.16 0.03 219 1.98 1.80 0.18179 2.19 2.33 -0.14 220 1.97 1.97 0.00180 2.19 1.78 0.41 221 1.97 2.29 -0.32181 2.19 1.51 0.68 222 1.96 1.93 0.03182 2.18 2.08 0.10 223 1.96 1.58 0.38183 2.18 2.29 -0.11 224 1.96 1.47 0.49184 2.18 2.05 0.13 225 1.95 1.95 0.00185 2.16 2.34 -0.18 226 1.95 1.43 0.52186 2.16 1.90 0.26 227 1.95 1.87 0.08187 2.16 1.53 0.63 228 1.94 1.79 0.15188 2.15 1.56 0.59 229 1.94 1.94 0.00189 2.14 2.15 -0.01 230 1.94 2.35 -0.41190 2.13 2.11 0.02 231 1.93 2.13 -0.20191 2.13 2.10 0.03 232 1.92 1.92 0.00192 2.13 1.91 0.22 233 1.92 1.51 0.41193 2.12 2.12 0.00 234 1.92 1.90 0.02194 2.11 2.03 0.08 235 1.91 1.96 -0.05195 2.11 1.53 0.58 236 1.90 1.86 0.04196 2.11 1.76 0.35 237 1.90 1.90 0.00197 2.10 1.36 0.74 238 1.90 2.24 -0.34198 2.10 2.13 -0.03 239 1.90 1.68 0.22199 2.09 1.92 0.17 240 1.89 1.91 -0.0283


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual241 1.89 1.56 0.33 282 1.70 2.22 -0.52242 1.89 2.38 -0.49 283 1.70 1.01 0.69243 1.88 1.85 0.03 284 1.70 1.76 -0.06244 1.86 2.01 -0.15 285 1.70 1.40 0.30245 1.85 1.80 0.05 286 1.68 1.70 -0.02246 1.85 2.01 -0.16 287 1.68 1.74 -0.06247 1.84 1.53 0.31 288 1.67 2.05 -0.38248 1.84 1.78 0.06 289 1.67 1.24 0.43249 1.83 1.64 0.19 290 1.66 1.46 0.20250 1.83 1.60 0.23 291 1.65 1.41 0.24251 1.82 1.79 0.03 292 1.64 1.26 0.38252 1.82 1.82 0.00 293 1.64 1.75 -0.11253 1.82 1.87 -0.05 294 1.63 1.01 0.62254 1.81 1.78 0.03 295 1.63 1.64 -0.01255 1.81 1.76 0.05 296 1.63 1.76 -0.13256 1.81 1.67 0.14 297 1.62 1.53 0.09257 1.80 1.44 0.36 298 1.62 1.60 0.02258 1.80 0.93 0.87 299 1.61 1.34 0.27259 1.80 1.86 -0.06 300 1.60 1.53 0.07260 1.80 1.61 0.19 301 1.60 1.44 0.16261 1.80 1.61 0.19 302 1.58 1.59 -0.01262 1.79 1.59 0.20 303 1.58 1.23 0.35263 1.78 1.91 -0.13 304 1.58 1.34 0.24264 1.78 1.84 -0.06 305 1.58 1.21 0.37265 1.77 1.55 0.22 306 1.56 1.29 0.27266 1.76 1.48 0.28 307 1.56 1.53 0.03267 1.76 1.11 0.65 308 1.56 1.48 0.08268 1.75 1.67 0.08 309 1.56 1.30 0.26269 1.74 1.91 -0.17 310 1.56 1.22 0.34270 1.74 1.53 0.21 311 1.55 1.54 0.01271 1.74 1.79 -0.05 312 1.54 1.53 0.01272 1.74 1.74 0.00 313 1.54 0.14 1.40273 1.74 1.66 0.08 314 1.54 1.06 0.48274 1.74 1.65 0.09 315 1.54 1.54 0.00275 1.73 1.37 0.36 316 1.53 1.60 -0.07276 1.73 1.72 0.01 317 1.52 1.52 0.00277 1.73 1.91 -0.18 318 1.52 1.01 0.51278 1.73 1.18 0.55 319 1.51 1.26 0.25279 1.73 1.34 0.39 320 1.51 1.86 -0.35280 1.72 1.46 0.26 321 1.51 1.50 0.01281 1.71 1.69 0.02 322 1.51 1.50 0.0184


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual323 1.49 1.48 0.01 364 1.35 1.10 0.25324 1.48 1.48 0.00 365 1.35 1.27 0.08325 1.47 0.35 1.12 366 1.34 0.95 0.39326 1.47 1.75 -0.28 367 1.34 1.06 0.28327 1.47 1.50 -0.03 368 1.33 1.12 0.21328 1.47 1.61 -0.14 369 1.32 1.51 -0.19329 1.47 1.33 0.14 370 1.32 1.64 -0.32330 1.46 1.30 0.16 371 1.32 1.31 0.01331 1.46 1.27 0.19 372 1.32 1.65 -0.33332 1.45 1.37 0.08 373 1.32 1.23 0.09333 1.44 1.28 0.16 374 1.30 1.32 -0.02334 1.44 1.39 0.05 375 1.30 1.53 -0.23335 1.43 1.39 0.04 376 1.29 1.50 -0.21336 1.42 1.47 -0.05 377 1.28 1.11 0.17337 1.42 1.45 -0.03 378 1.28 1.24 0.04338 1.42 1.72 -0.30 379 1.27 1.09 0.18339 1.42 1.39 0.03 380 1.26 1.51 -0.25340 1.42 1.33 0.09 381 1.26 1.17 0.09341 1.42 1.05 0.37 382 1.25 1.70 -0.45342 1.41 1.40 0.01 383 1.25 1.31 -0.06343 1.41 1.50 -0.09 384 1.24 1.23 0.01344 1.41 1.21 0.20 385 1.24 1.24 0.00345 1.40 1.53 -0.13 386 1.24 1.24 0.00346 1.40 1.04 0.36 387 1.23 1.12 0.11347 1.40 1.03 0.37 388 1.23 1.49 -0.26348 1.40 1.07 0.33 389 1.22 1.04 0.18349 1.40 1.35 0.05 390 1.22 1.04 0.18350 1.40 1.19 0.21 391 1.21 0.94 0.27351 1.40 1.35 0.05 392 1.21 1.05 0.16352 1.39 1.36 0.03 393 1.21 1.26 -0.05353 1.39 1.30 0.09 394 1.21 1.51 -0.30354 1.38 1.29 0.09 395 1.21 1.01 0.20355 1.38 1.44 -0.06 396 1.21 1.20 0.01356 1.38 1.82 -0.44 397 1.20 1.62 -0.42357 1.37 1.42 -0.05 398 1.20 1.18 0.02358 1.37 1.20 0.17 399 1.19 0.61 0.58359 1.37 1.68 -0.31 400 1.18 1.19 -0.01360 1.36 1.36 0.00 401 1.18 1.18 0.00361 1.36 1.34 0.02 402 1.18 1.20 -0.02362 1.36 1.35 0.01 403 1.17 1.36 -0.19363 1.35 1.29 0.06 404 1.17 1.18 -0.0185


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual405 1.17 1.02 0.15 446 0.97 0.54 0.43406 1.17 1.20 -0.03 447 0.97 0.83 0.14407 1.16 1.32 -0.16 448 0.96 0.70 0.26408 1.15 0.93 0.22 449 0.96 0.69 0.27409 1.14 1.12 0.02 450 0.95 0.60 0.35410 1.14 1.15 -0.01 451 0.95 1.08 -0.13411 1.13 0.83 0.30 452 0.94 1.01 -0.07412 1.12 1.23 -0.11 453 0.94 0.87 0.07413 1.12 1.21 -0.09 454 0.93 1.04 -0.11414 1.12 1.47 -0.35 455 0.93 0.88 0.05415 1.11 0.89 0.22 456 0.93 1.08 -0.15416 1.10 1.10 0.00 457 0.93 0.85 0.08417 1.10 0.87 0.23 458 0.92 0.66 0.26418 1.10 1.43 -0.33 459 0.92 0.76 0.16419 1.09 1.38 -0.29 460 0.92 0.81 0.11420 1.09 1.08 0.01 461 0.92 1.01 -0.09421 1.09 1.18 -0.09 462 0.92 0.88 0.04422 1.08 1.29 -0.21 463 0.91 1.56 -0.65423 1.08 1.09 -0.01 464 0.89 1.11 -0.22424 1.08 1.32 -0.24 465 0.89 1.05 -0.16425 1.08 0.86 0.22 466 0.89 0.90 -0.01426 1.07 1.29 -0.22 467 0.89 0.89 0.00427 1.07 0.81 0.26 468 0.88 0.86 0.02428 1.05 1.29 -0.24 469 0.88 0.22 0.66429 1.05 1.13 -0.08 470 0.87 0.94 -0.07430 1.04 1.03 0.01 471 0.87 0.97 -0.10431 1.04 1.00 0.04 472 0.87 0.78 0.09432 1.03 1.10 -0.07 473 0.86 0.80 0.06433 1.03 0.95 0.08 474 0.86 0.70 0.16434 1.03 0.87 0.16 475 0.86 0.87 -0.01435 1.02 1.07 -0.05 476 0.85 0.85 0.00436 1.02 1.21 -0.19 477 0.85 0.76 0.09437 1.02 1.13 -0.11 478 0.85 0.82 0.03438 1.02 1.04 -0.02 479 0.84 1.12 -0.28439 1.02 1.23 -0.21 480 0.84 0.72 0.12440 1.00 0.93 0.07 481 0.84 0.87 -0.03441 1.00 0.86 0.14 482 0.84 0.76 0.08442 1.00 0.94 0.06 483 0.82 0.94 -0.12443 0.99 1.56 -0.57 484 0.82 0.71 0.11444 0.98 0.93 0.05 485 0.82 0.73 0.09445 0.98 0.72 0.26 486 0.81 0.94 -0.1386


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual487 0.81 0.55 0.26 528 0.69 0.67 0.02488 0.81 0.64 0.17 529 0.69 0.62 0.07489 0.80 0.82 -0.02 530 0.67 0.64 0.03490 0.80 0.82 -0.02 531 0.67 0.73 -0.06491 0.80 0.85 -0.05 532 0.67 0.76 -0.09492 0.79 0.82 -0.03 533 0.66 0.83 -0.17493 0.78 0.67 0.11 534 0.66 0.69 -0.03494 0.78 0.62 0.16 535 0.66 0.49 0.17495 0.78 0.76 0.02 536 0.65 0.57 0.08496 0.78 1.49 -0.71 537 0.65 0.61 0.04497 0.78 0.81 -0.03 538 0.64 0.64 0.00498 0.77 1.22 -0.45 539 0.64 0.74 -0.10499 0.77 0.54 0.23 540 0.63 0.29 0.34500 0.77 0.81 -0.04 541 0.63 0.66 -0.03501 0.77 0.51 0.26 542 0.62 0.72 -0.10502 0.77 0.64 0.13 543 0.62 0.36 0.26503 0.76 0.44 0.32 544 0.62 0.89 -0.27504 0.76 0.67 0.09 545 0.61 0.39 0.22505 0.75 0.59 0.16 546 0.61 0.62 -0.01506 0.75 0.89 -0.14 547 0.61 0.90 -0.29507 0.75 0.68 0.07 548 0.60 0.60 0.00508 0.75 0.80 -0.05 549 0.60 0.75 -0.15509 0.75 1.56 -0.81 550 0.60 0.37 0.23510 0.74 0.74 0.00 551 0.60 0.73 -0.13511 0.74 0.87 -0.13 552 0.60 0.61 -0.01512 0.74 1.00 -0.26 553 0.60 0.60 0.00513 0.73 0.73 0.00 554 0.59 1.13 -0.54514 0.73 0.75 -0.02 555 0.58 0.64 -0.06515 0.73 -0.24 0.97 556 0.57 0.83 -0.26516 0.73 0.77 -0.04 557 0.57 0.77 -0.20517 0.73 0.86 -0.13 558 0.57 0.54 0.03518 0.73 1.01 -0.28 559 0.57 0.58 -0.01519 0.72 0.53 0.19 560 0.57 0.97 -0.40520 0.72 0.81 -0.09 561 0.57 0.66 -0.09521 0.72 0.78 -0.06 562 0.57 0.66 -0.09522 0.70 1.10 -0.40 563 0.56 0.62 -0.06523 0.70 0.77 -0.07 564 0.56 0.79 -0.23524 0.70 0.91 -0.21 565 0.56 0.82 -0.26525 0.70 0.71 -0.01 566 0.55 0.54 0.01526 0.70 0.73 -0.03 567 0.55 0.84 -0.29527 0.69 0.74 -0.05 568 0.54 0.54 0.0087


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual569 0.54 0.58 -0.04 610 0.38 0.34 0.04570 0.54 0.58 -0.04 611 0.37 0.79 -0.42571 0.53 0.56 -0.03 612 0.37 0.38 -0.01572 0.53 0.41 0.12 613 0.36 0.80 -0.44573 0.53 0.38 0.15 614 0.36 0.45 -0.09574 0.53 0.54 -0.01 615 0.35 0.22 0.13575 0.52 0.53 -0.01 616 0.35 0.51 -0.16576 0.52 0.60 -0.08 617 0.34 0.06 0.28577 0.51 0.37 0.14 618 0.33 0.39 -0.06578 0.51 0.65 -0.14 619 0.33 0.77 -0.44579 0.51 0.70 -0.19 620 0.33 0.67 -0.34580 0.51 0.68 -0.17 621 0.33 0.09 0.24581 0.50 0.66 -0.16 622 0.32 0.24 0.08582 0.50 0.60 -0.10 623 0.30 0.65 -0.35583 0.50 0.24 0.26 624 0.30 -0.02 0.32584 0.49 0.42 0.07 625 0.29 0.72 -0.43585 0.48 0.50 -0.02 626 0.29 0.64 -0.35586 0.48 0.51 -0.03 627 0.29 0.31 -0.02587 0.48 0.09 0.39 628 0.27 -0.27 0.54588 0.47 0.44 0.03 629 0.26 0.52 -0.26589 0.47 0.20 0.27 630 0.25 0.33 -0.08590 0.46 0.44 0.02 631 0.25 0.68 -0.43591 0.46 0.42 0.04 632 0.25 0.24 0.01592 0.46 0.91 -0.45 633 0.24 0.26 -0.02593 0.45 0.08 0.37 634 0.24 0.60 -0.36594 0.45 0.53 -0.08 635 0.24 0.28 -0.04595 0.45 0.28 0.17 636 0.24 0.27 -0.03596 0.44 0.51 -0.07 637 0.24 0.60 -0.36597 0.44 0.60 -0.16 638 0.23 0.42 -0.19598 0.43 0.72 -0.29 639 0.23 0.23 0.00599 0.42 0.40 0.02 640 0.23 0.93 -0.70600 0.42 0.46 -0.04 641 0.22 0.66 -0.44601 0.42 0.43 -0.01 642 0.22 0.31 -0.09602 0.41 1.00 -0.59 643 0.21 0.21 0.00603 0.40 0.90 -0.50 644 0.21 0.38 -0.17604 0.40 1.00 -0.60 645 0.21 0.47 -0.26605 0.39 0.38 0.01 646 0.21 0.33 -0.12606 0.39 0.69 -0.30 647 0.19 0.34 -0.15607 0.39 1.42 -1.03 648 0.18 0.34 -0.16608 0.38 0.73 -0.35 649 0.18 0.71 -0.53609 0.38 0.71 -0.33 650 0.17 0.01 0.1688


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual651 0.17 0.80 -0.63 692 0.00 0.15 -0.15652 0.17 -0.04 0.21 693 -0.01 0.17 -0.18653 0.16 0.42 -0.26 694 -0.02 0.10 -0.12654 0.15 0.45 -0.30 695 -0.02 0.20 -0.22655 0.15 0.17 -0.02 696 -0.03 -0.12 0.09656 0.15 0.09 0.06 697 -0.03 -0.26 0.23657 0.14 1.00 -0.86 698 -0.04 -0.10 0.06658 0.13 -0.06 0.19 699 -0.04 0.52 -0.56659 0.13 0.04 0.09 700 -0.05 0.23 -0.28660 0.12 0.33 -0.21 701 -0.05 0.18 -0.23661 0.12 0.06 0.06 702 -0.06 0.31 -0.37662 0.11 0.11 0.00 703 -0.06 0.18 -0.24663 0.11 0.23 -0.12 704 -0.06 -0.01 -0.05664 0.11 0.33 -0.22 705 -0.08 -0.17 0.09665 0.10 -0.08 0.18 706 -0.08 1.04 -1.12666 0.10 -0.79 0.89 707 -0.08 0.32 -0.40667 0.10 0.09 0.01 708 -0.10 -0.13 0.03668 0.08 0.12 -0.04 709 -0.11 -0.45 0.34669 0.07 0.16 -0.09 710 -0.11 0.20 -0.31670 0.07 0.12 -0.05 711 -0.12 -0.52 0.40671 0.07 0.62 -0.55 712 -0.12 0.06 -0.18672 0.07 -0.22 0.29 713 -0.13 0.81 -0.94673 0.06 0.87 -0.81 714 -0.14 0.04 -0.18674 0.05 0.82 -0.77 715 -0.14 0.19 -0.33675 0.05 0.10 -0.05 716 -0.14 -0.05 -0.09676 0.04 0.07 -0.03 717 -0.14 -0.13 -0.01677 0.04 0.08 -0.04 718 -0.14 -0.11 -0.03678 0.04 0.22 -0.18 719 -0.14 -0.04 -0.10679 0.04 0.33 -0.29 720 -0.15 -0.40 0.25680 0.03 0.10 -0.07 721 -0.16 -0.14 -0.02681 0.03 0.70 -0.67 722 -0.17 -0.06 -0.11682 0.02 0.01 0.01 723 -0.19 -0.42 0.23683 0.02 0.59 -0.57 724 -0.19 -0.11 -0.08684 0.02 -0.42 0.44 725 -0.22 -0.50 0.28685 0.02 0.17 -0.15 726 -0.24 0.00 -0.24686 0.02 -0.12 0.14 727 -0.25 -0.28 0.03687 0.01 0.03 -0.02 728 -0.25 0.22 -0.47688 0.01 0.05 -0.04 729 -0.26 -0.17 -0.09689 0.01 0.54 -0.53 730 -0.26 -0.11 -0.15690 0.00 0.02 -0.02 731 -0.27 -0.27 0.00691 0.00 0.51 -0.51 732 -0.28 -0.60 0.3289


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual733 -0.28 -0.34 0.06 774 -0.58 -0.51 -0.07734 -0.29 -0.27 -0.02 775 -0.59 -0.81 0.22735 -0.29 -0.32 0.03 776 -0.59 -0.33 -0.26736 -0.30 -0.35 0.05 777 -0.61 -0.70 0.09737 -0.31 -0.11 -0.20 778 -0.61 -0.26 -0.35738 -0.31 -0.64 0.33 779 -0.62 -0.31 -0.31739 -0.31 0.66 -0.97 780 -0.63 -0.52 -0.11740 -0.31 0.49 -0.80 781 -0.63 0.52 -1.15741 -0.32 -0.30 -0.02 782 -0.64 -0.38 -0.26742 -0.33 0.13 -0.46 783 -0.64 -0.39 -0.25743 -0.35 -0.32 -0.03 784 -0.64 -0.81 0.17744 -0.36 -0.11 -0.25 785 -0.68 -0.50 -0.18745 -0.36 0.08 -0.44 786 -0.71 -0.55 -0.16746 -0.38 0.43 -0.81 787 -0.71 -0.37 -0.34747 -0.38 -0.11 -0.27 788 -0.72 -0.65 -0.07748 -0.39 -0.02 -0.37 789 -0.73 -0.33 -0.40749 -0.40 0.56 -0.96 790 -0.73 -0.68 -0.05750 -0.40 -0.43 0.03 791 -0.74 -0.71 -0.03751 -0.41 0.01 -0.42 792 -0.74 -0.33 -0.41752 -0.41 -0.29 -0.12 793 -0.77 -0.70 -0.07753 -0.42 -0.09 -0.33 794 -0.81 -1.02 0.21754 -0.43 -0.16 -0.27 795 -0.82 -0.49 -0.33755 -0.43 -0.44 0.01 796 -0.84 -0.66 -0.18756 -0.43 -0.50 0.07 797 -0.84 -0.83 -0.01757 -0.44 -0.37 -0.07 798 -0.85 -0.22 -0.63758 -0.45 -0.13 -0.32 799 -0.87 -0.24 -0.63759 -0.45 0.00 -0.45 800 -0.88 -0.81 -0.07760 -0.46 -0.37 -0.09 801 -0.88 -0.69 -0.19761 -0.50 -0.12 -0.38 802 -0.88 -0.65 -0.23762 -0.50 -0.12 -0.38 803 -0.89 -0.82 -0.07763 -0.51 -0.39 -0.12 804 -0.89 -0.48 -0.41764 -0.53 0.59 -1.12 805 -0.90 -1.14 0.24765 -0.53 -0.61 0.08 806 -0.92 -1.10 0.18766 -0.53 -0.52 -0.01 807 -0.93 -0.82 -0.11767 -0.54 -0.35 -0.19 808 -0.93 -0.51 -0.42768 -0.56 -0.52 -0.04 809 -0.94 -0.50 -0.44769 -0.56 -0.40 -0.16 810 -0.96 -0.92 -0.04770 -0.56 -0.43 -0.13 811 -0.98 -0.42 -0.56771 -0.57 -0.58 0.01 812 -0.99 -0.70 -0.29772 -0.57 -0.42 -0.15 813 -0.99 -0.91 -0.08773 -0.58 -0.58 0.00 814 -1.00 -0.73 -0.2790


Table IX – Terra<strong>QSAR</strong> (<strong>QSAR</strong>6) training set (continued).ID Measured Predicted Residual ID Measured Predicted Residual815 -1.07 -0.77 -0.30 851 -1.70 -1.28 -0.42816 -1.07 -0.21 -0.86 852 -1.75 -1.13 -0.62817 -1.09 -1.00 -0.09 853 -1.77 -1.56 -0.21818 -1.10 -0.74 -0.36 854 -1.81 -1.70 -0.11819 -1.10 -0.04 -1.06 855 -1.82 -1.91 0.09820 -1.11 -0.63 -0.48 856 -1.84 -0.94 -0.90821 -1.13 -1.04 -0.09 857 -1.85 -1.55 -0.30822 -1.16 -1.14 -0.02 858 -1.85 -1.66 -0.19823 -1.18 -0.60 -0.58 859 -1.85 -0.90 -0.95824 -1.19 -1.20 0.01 860 -1.88 -1.26 -0.62825 -1.19 -1.20 0.01 861 -1.88 -1.70 -0.18826 -1.21 -0.98 -0.23 862 -1.90 -1.76 -0.14827 -1.22 -1.13 -0.09 863 -1.91 -1.21 -0.70828 -1.22 -1.02 -0.20 864 -1.94 -1.71 -0.23829 -1.23 -1.31 0.08 865 -1.94 -1.38 -0.56830 -1.25 -0.88 -0.37 866 -1.95 -1.77 -0.18831 -1.25 -1.21 -0.04 867 -1.95 -1.69 -0.26832 -1.30 -1.31 0.01 868 -1.96 -1.92 -0.04833 -1.32 -1.25 -0.07 869 -1.96 -0.90 -1.06834 -1.33 -1.12 -0.21 870 -2.05 -1.78 -0.27835 -1.35 -1.35 0.00 871 -2.10 -1.85 -0.25836 -1.35 -1.32 -0.03 872 -2.11 -2.11 0.00837 -1.36 -1.43 0.07 873 -2.15 -1.54 -0.61838 -1.37 -1.22 -0.15 874 -2.16 -1.06 -1.10839 -1.41 -1.15 -0.26 875 -2.21 -1.72 -0.49840 -1.44 -1.50 0.06 876 -2.40 -2.48 0.08841 -1.48 -0.74 -0.74 877 -2.45 -2.17 -0.28842 -1.53 -1.19 -0.34 878 -2.51 -2.08 -0.43843 -1.54 -1.21 -0.33 879 -2.55 -2.43 -0.12844 -1.59 -0.91 -0.68 880 -2.64 -1.86 -0.78845 -1.59 -0.26 -1.33 881 -2.67 -2.42 -0.25846 -1.61 -1.55 -0.06 882 -2.81 -2.56 -0.25847 -1.65 -1.38 -0.27 883 -2.85 -2.52 -0.33848 -1.65 -1.46 -0.19 884 -2.93 -2.69 -0.24849 -1.69 -1.73 0.04 885 -2.95 -2.23 -0.72850 -1.69 -1.59 -0.10 886 -3.07 -2.92 -0.1591


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data.ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction1 50-00-0 Formaldehyde- SB -0.081 0.876 -0.9572 56-81-5 1,2,3-Propanetriol NPN 2.7983 57-55-6 1,2-Propanediol NPN 2.162 2.213 -0.0524 58-08-21H-Purine-2,6-dione, 3,7-dihydro-1,3,7-trimethyl-CNS -0.1115 58-55-91H-Purine-2,6-dione, 3,7-dihydro-1,3-dimethyl-NPN -0.1106 60-00-4Glycine, N,N'-1,2-ethanediylbis[N-(carboxymethyl)-UNK -0.689 -0.682 -0.0077 64-02-8Glycine, N,N'-1,2-ethanediylbis[N-(carboxymethyl)- UNK -0.682, tetrasodium salt8 68-12-2 Formamide, N,N-dimethyl- NPN 2.161 2.160 0.0019 71-36-3 1-Butanol NPN 1.399 1.238 0.16110 74-83-9 Methane, bromo- NPN -0.27611 74-87-3 Methane, chloro- NPN 0.36512 75-01-4 Ethene, chloro- NPN -1.04813 75-10-5 Methane, difluoro- NPN 0.55714 75-38-7 Ethene, 1,1-difluoro- NPN -0.44215 75-56-9 Oxirane, methyl- UNK 0.48016 75-68-3 Ethane, 1-chloro-1,1-difluoro- NPN 0.10917 77-92-91,2,3-Propanetricarboxylic acid,2-hydroxy-NPN 0.34718 78-59-12-Cyclohexen-1-one, 3,5,5-trimethyl-MTA 0.238 0.020 0.21819 78-70-6 1,6-Octadien-3-ol, 3,7-dimethyl- PE -1.21120 78-87-5 Propane, 1,2-dichloro- NPN 0.093 0.050 0.04321 78-92-2 2-Butanol NPN 1.695 1.165 0.53022 79-00-5 Ethane, 1,1,2-trichloro- NPN -0.214 -0.082 -0.13123 79-06-1 2-Propenamide MTA 0.233 0.314 -0.08124 79-10-7 2-Propenoic acid UNK -0.48325 79-11-8 Acetic acid, chloro- UNK 0.72326 79-20-9 Acetic acid, methyl ester NPN 0.635 0.640 -0.00527 79-31-2 Propanoic acid, 2-methyl- UNK 0.61628 79-34-5 Ethane, 1,1,2,2-tetrachloro- NPN -0.917 -0.920 0.00329 79-39-0 2-Propenamide, 2-methyl- MTA 1.31830 79-41-4 2-Propenoic acid, 2-methyl- UNK -0.28392


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data (continued).ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction31 80-05-7Phenol, 4,4'-(1-methylethylidene)bis-PN -1.696 -1.700 0.00432 80-62-62-Propenoic acid, 2-methyl-,methyl esterMTA 0.448 -0.145 0.59333 81-14-1Ethanone, 1-[4-(1,1-dimethylethyl)-2,6-dimethyl-3,5- UNK -2.672dinitrophenyl]-34 81-15-2Benzene, 1-(1,1-dimethylethyl)-3,5-dimethyl-2,4,6-trinitro-UNK -2.38935 84-74-21,2-Benzenedicarboxylic acid,dibutyl esterUNK -2.306 -2.400 0.09436 87-56-92-Butenoic acid, 2,3-dichloro-4-oxo-, (Z)-UNK -0.53737 88-12-0 2-Pyrrolidinone, 1-ethenyl- UNK 1.72038 88-19-7 Benzenesulfonamide, 2-methyl- UNK -1.49039 88-44-8Benzenesulfonic acid, 2-amino-5-methyl-UNK -0.23940 88-60-8Phenol, 2-(1,1-dimethylethyl)-5-methyl-PN -1.48841 88-73-3 Benzene, 1-chloro-2-nitro- UNK -1.02342 88-74-4 Benzenamine, 2-nitro- PN -1.00043 91-15-6 1,2-Benzenedicarbonitrile UNK -0.67444 91-76-91,3,5-Triazine-2,4-diamine, 6-phenyl-CNS -1.66745 93-68-5Butanamide, N-(2-methylphenyl)-3-oxo-UNK 0.218 -0.740 0.95946 94-36-0 Peroxide, dibenzoyl UNK -1.01647 95-31-82-Benzothiazolesulfenamide, N-(1,1-dimethylethyl)-NPN -1.30648 95-49-8 Benzene, 1-chloro-2-methyl- NPN -1.13349 95-50-1 Benzene, 1,2-dichloro- NPN -0.411 -1.440 1.02950 96-18-4 Propane, 1,2,3-trichloro- NPN -0.346 -0.229 -0.11751 96-29-7 2-Butanone, oxime NPN 0.986 0.923 0.06352 96-31-1 Urea, N,N'-dimethyl- NPN 1.99253 96-33-3 2-Propenoic acid, methyl ester MTA -1.59354 97-72-3Propanoic acid, 2-methyl-,anhydrideUNK -0.76555 98-07-7 Benzene, (trichloromethyl)- NPN -1.14193


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data (continued).ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction56 98-54-4 Phenol, 4-(1,1-dimethylethyl)- PN -1.466 -1.539 0.07357 98-59-9Benzenesulfonyl chloride, 4-methyl-UNK -0.75358 98-92-0 3-Pyridinecarboxamide PN 0.08559 99-04-7 Benzoic acid, 3-methyl- UNK -0.02860 99-54-7 Benzene, 1,2-dichloro-4-nitro- UNK -1.43761 99-99-0 Benzene, 1-methyl-4-nitro- UNK -0.438 -0.653 0.21562 100-00-5 Benzene, 1-chloro-4-nitro- UNK -1.02363 100-21-0 1,4-Benzenedicarboxylic acid UNK -0.00964 100-37-8 Ethanol, 2-(diethylamino)- UNK 1.182 1.180 0.00265 100-41-4 Benzene, ethyl- NPN -0.943 -0.886 -0.05866 102-06-7 Guanidine, N,N'-diphenyl- NPN -0.96567 102-76-1 1,2,3-Propanetriol, triacetate UNK -0.121 -0.570 0.44968 103-11-72-Propenoic acid, 2-ethylhexylesterMTA -1.94069 103-84-4 Acetamide, N-phenyl- UNK -0.26570 105-60-2 2H-Azepin-2-one, hexahydro- NPN 1.75071 106-31-0 Butanoic acid, anhydride UNK -1.13472 106-46-7 Benzene, 1,4-dichloro- NPN -1.015 -1.440 0.42573 106-63-82-Propenoic acid, 2-methylpropylesterMTA -1.788 -1.799 0.01274 106-88-7 Oxirane, ethyl- UNK 0.46075 107-06-2 Ethane, 1,2-dichloro- NPN 0.069 0.020 0.04976 107-15-3 1,2-Ethanediamine UNK 0.424 0.592 -0.16977 107-22-2 Ethanedial- UNK 0.569 0.372 0.19778 107-41-5 2,4-Pentanediol, 2-methyl- NPN 1.912 1.960 -0.04979 107-86-8 2-Butenal, 3-methyl- MTA -0.96880 107-92-6 Butanoic-acid- UNK 0.36081 107-98-2 2-Propanol, 1-methoxy- NPN 2.363 1.928 0.43582 108-44-1 Benzenamine, 3-methyl- PN -0.24583 108-65-6 2-Propanol, 1-methoxy-, acetate EN -0.24684 108-77-0 1,3,5-Triazine, 2,4,6-trichloro- UNK -3.00285 108-88-3 Benzene, methyl- NPN -0.549 -0.722 0.17286 109-66-0 Pentane- NPN -0.67387 110-16-7 2-Butenedioic acid (Z)- UNK -1.366 -1.253 -0.11388 110-19-0 Acetic acid, 2-methylpropyl ester EN -0.23689 110-65-6 2-Butyne-1,4-diol PE -0.206 -0.296 0.09090 110-83-8 Cyclohexene- NPN -0.43794


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data (continued).ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction91 110-85-0 Piperazine- NPN 1.27792 110-93-0 5-Hepten-2-one, 6-methyl- NPN -0.167 -0.123 -0.04393 110-98-5 2-Propanol, 1,1'-oxybis- NPN 1.75294 112-57-21,2-Ethanediamine, N-(2-aminoethyl)-N'-[2-[(2-NPN 3.065aminoethyl)amino]ethyl]-95 112-85-6 Docosanoic-acid- UNK -2.40996 115-07-1 1-Propene NPN -0.33797 115-11-7 1-Propene, 2-methyl- NPN -0.11198 115-86-6 Phosphoric acid, triphenyl ester UNK -2.594 -2.570 -0.02499 115-95-71,6-Octadien-3-ol, 3,7-dimethyl-,acetateUNK -1.089100 118-79-6 Phenol, 2,4,6-tribromo- PN -1.705 -1.700 -0.005101 119-47-1Phenol, 2,2'-methylenebis[6-(1,1-dimethylethyl)-4-methyl-NPN -2.900102 120-61-61,4-Benzenedicarboxylic acid,dimethyl esterUNK -0.700103 120-80-9 1,2-Benzenediol UNK -1.288 -1.080 -0.208104 120-83-2 Phenol, 2,4-dichloro- PN -1.277 -1.670 0.393105 121-91-5 1,3-Benzenedicarboxylic acid UNK -0.009106 122-52-1 Phosphorous acid, triethyl ester UNK 0.329107 122-99-6 Ethanol, 2-phenoxy- NPN 0.396 -0.339 0.735108 123-54-6 2,4-Pentanedione UNK 0.140 0.347 -0.207109 123-77-3 Diazenedicarboxamide- UNK 2.437110 123-86-4 Acetic acid, butyl ester EN -0.810 -0.810 0.000111 124-04-9 Hexanedioic-acid- UNK -0.178 1.850 -2.028112 126-73-8 Phosphoric-acid-tributyl-ester- AChE -1.774 -1.510 -0.264113 126-98-7 2-Propenenitrile, 2-methyl- MTA 0.543114 127-19-5 Acetamide, N,N-dimethyl- NPN 1.122115 128-37-0Phenol, 2,6-bis(1,1-dimethylethyl)-4-methyl-PN -2.780116 135-19-3 2-Naphthalenol PN -1.620 -1.585 -0.035117 140-88-5 2-Propenoic acid, ethyl ester MTA -1.603 -1.658 0.055118 141-10-63,5,9-Undecatrien-2-one, 6,10-dimethyl-MTA -1.329119 141-32-2 2-Propenoic acid, butyl ester MTA -1.802120 141-78-6 Acetic-acid-ethyl-ester- EN 0.417 0.420 -0.003121 141-97-9 Butanoic acid, 3-oxo-, ethyl ester UNK -0.02195


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data (continued).ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction122 144-55-8 Carbonic-acid-monosodium-salt- UNK 0.403123 150-90-3 Butanedioic acid, disodium salt UNK 1.251124 288-32-4 1H-Imidazole NPN 0.498125 461-58-5 Guanidine, cyano- UNK 0.051126 505-32-81-Hexadecen-3-ol, 3,7,11,15-tetramethyl-NPN -1.597127 528-44-9 1,2,4-Benzenetricarboxylic acid UNK 0.277128 552-30-75-Isobenz<strong>of</strong>urancarboxylic acid,1,3-dihydro-1,3-dioxo-UNK -1.495129 556-82-1 2-Buten-1-ol, 3-methyl- NPN -0.816130 611-19-8Benzene, 1-chloro-2-(chloromethyl)-UNK -1.386131 760-23-6 1-Butene, 3,4-dichloro- SN2 -1.184 -1.526 0.342132 770-35-4 2-Propanol, 1-phenoxy- NPN 0.265 -0.693 0.958133 793-24-81,4-Benzenediamine, N-(1,3-dimethylbutyl)-N'-phenyl-NPN -1.315134 822-06-0 Hexane, 1,6-diisocyanato- UNK -0.020135 839-90-71,3,5-Triazine-2,4,6(1H,3H,5H)-trione, 1,3,5-tris(2-hydroxyethyl)-UNK -1.528136 868-77-92-Propenoic acid, 2-methyl-, 2-hydroxyethyl esterMTA 0.242 -0.098 0.339137 868-85-9 Phosphonic acid, dimethyl ester UNK 0.311 0.310 0.001138 919-30-2 3-Aminopropyl-triethoxysilane UNK -0.057139 1163-19-5Benzene, 1,1'-oxybis[2,3,4,5,6-pentabromo-NPN -3.700140 1477-55-0 1,3-Benzenedimethanamine NPN -0.180141 1490-04-6Cyclohexanol, 5-methyl-2-(1-methylethyl)-NPN -0.929 -0.916 -0.013142 1634-04-4 Propane, 2-methoxy-2-methyl- NPN 0.882 0.664 0.218143 1717-00-6 HCFC 141b NPN 0.028144 2216-51-5Cyclohexanol, 5-methyl-2-(1-methylethyl)-, [1R-NPN -0.916(1alpha,2beta,5alpha)]-145 2403-88-54-Piperidinol, 2,2,6,6-tetramethyl-NPN -0.592146 2432-99-7 Undecanoic acid, 11-amino- UNK -1.213147 2439-35-22-Propenoic acid, 2-(dimethylamino)ethyl esterMTA -1.36096


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data (continued).ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction148 2837-89-0Ethane, 2-chloro-1,1,1,2-tetrafluoro-NPN -0.221149 2855-13-2Cyclohexanemethanamine, 5-amino-1,3,3-trimethyl-NPN -1.120150 2867-47-22-Propenoic acid, 2-methyl-, 2-(dimethylamino)ethyl esterMTA -0.510151 3268-49-3 Propanal, 3-(methylthio)- UNK -0.667152 3319-31-11,2,4-Benzenetricarboxylic acid,tris(2-ethylhexyl) esterUNK -1.536153 3323-53-3Hexanedioic acid, compd. with1,6-hexanediamine (1:1)NPN 1.850154 3452-97-9 1-Hexanol, 3,5,5-trimethyl- NPN -0.385155 4016-24-4Hexadecanoic acid, 2-sulfo-, 1-methyl ester, sodium saltUNK -1.626156 4169-04-4 1-Propanol, 2-phenoxy- NPN 0.265 -0.693 0.958157 4454-05-12H-Pyran, 3,4-dihydro-2-methoxy-NPN 0.324158 4457-71-0 1,5-Pentanediol, 3-methyl- NPN 0.853159 4979-32-22-Benzothiazolesulfenamide,N,N-dicyclohexyl-NPN -1.262160 5102-83-0Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'-diyl)bis(azo)]bis[N-(2,4-NPN -1.110dimethylphenyl)-3-oxo-161 5392-40-5 2,6-Octadienal, 3,7-dimethyl- MTA -0.904162 5567-15-7Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'-diyl)bis(azo)]bis[N-(4-chloro-2,5-NPN -0.966dimethoxyphenyl)-3-oxo-163 6165-51-1Benzene, 1,4-dimethyl-2-(1-phenylethyl)-NPN -2.198164 6358-85-6Butanamide, 2,2'-[(3,3'-dichloro[1,1'-biphenyl]-4,4'- NPN -1.161diyl)bis(azo)]bis[3-oxo-N-phenyl-165 6386-38-5Benzenepropanoic acid, 3,5-bis(1,1-dimethylethyl)-4-hydroxy-, methyl esterEN -2.78997


Table X –Terra<strong>QSAR</strong> predictions <strong>for</strong> the SIDS test data (continued).ID CASN EINECS name MOALogLC50(mmol/l)PredictedLogLC50(mmol/l)Residual inprediction166 6422-86-21,4-Benzenedicarboxylic acid,bis(2-ethylhexyl) esterNPN -1.465167 6864-37-5Cyclohexanamine, 4,4'-methylenebis[2-methyl-NPN -2.100168 11070-44-31,3-Isobenz<strong>of</strong>urandione,tetrahydromethyl-UNK -0.741169 25154-52-3 Phenol, nonyl- PN -3.236 -3.205 -0.031170 25265-71-8 Propanol, oxybis- NPN 1.763171 25321-09-9 Benzene, bis(1-methylethyl)- NPN -0.990172 25321-14-6 Benzene, methyldinitro- UNK -1.030 -1.482 0.452173 31570-04-4Phenol, 2,4-bis(1,1-dimethylethyl)-, phosphite (3:1)NPN -3.629174 32534-81-9Benzene, 1,1'-oxybis-,pentabromo deriv.NPN -2.384175 32536-52-0Benzene, 1,1'-oxybis-, octabromoderiv.NPN -3.681176 56539-66-3 1-Butanol, 3-methoxy-3-methyl- NPN 0.838177 84852-15-3 Phenol, 4-nonyl-, branched PN -3.236 -2.750 -0.48698


Table XI – Model per<strong>for</strong>mance comparison.ModelN.TrainSIDSTrain2Q LOO2Q boostrapSDEP2RTestMOAN.TestUnknownSIDSpredictionsTotal SIDSpredictionsQ 2 extNPN 58 8 91.51 91.66 0.421 92.18PN 86 5 89.59 89.64 0.336 90.07N 144 13 87.06 87.11 0.461 87.55NPN 14 37 51 89.06Mixed 28 97 125 90.86PN 2 4 6 N.A.Mixed 25 98 123 86.66NPN+ PN13 41 54 92.18Mixed 24 97 121 91.63MIXED 114 9 75.94 75.83 0.495 77.57 Mixed 22 51 73 87.10E-State 121 8 68.28 9.30 0.505 84.04 Mixed 17 69 86 89.43Terra<strong>QSAR</strong>886 N.A N.A N.A N.A 94.56 Mixed 57 120 177 99.3999


100


APPENDIX I: TERMINOLOGY AND STATISTICAL BACKGROUNDBoot-strappingBy this validation technique, the original size <strong>of</strong> the data set (n) is preserved <strong>for</strong> thetraining set, by the selection <strong>of</strong> n objects with repetition; in this way the training setusually consists <strong>of</strong> repeated objects and the evaluation set <strong>of</strong> the objects left out [Efron,B. 1982; 1987]. The model is calculated on the training set and responses are predicted onthe evaluation set. All the squared differences between the true response and thepredicted response <strong>of</strong> the objects <strong>of</strong> the evaluation set are collected in PRESS. Thisprocedure <strong>of</strong> building training sets and evaluation sets is repeated thousands <strong>of</strong> time,PRESS are summed up and the average predictive power is calculated and the averagepredictive power is calculated (Q 2 Boot). Thus, the validation is per<strong>for</strong>med by randomlygenerating training sets with sample repetitions and then evaluating the predictedresponses <strong>of</strong> the samples not included in the training set.Chemical Domain <strong>of</strong> Model applicabilityThe chemical domain <strong>of</strong> a model applicability has been recently [Netzeva et al., 2005]defined as: “The applicability domain <strong>of</strong> a (Q)SAR model is the response and chemicalstructure space in which the model makes predictions with a given reliability. ”Where the chemical structure can be expressed by physicochemical and/or fragmentalin<strong>for</strong>mation, and response can be any physicochemical, biological or environmentaleffect that is being predicted. The relationship between chemical structure and theresponse can be developed by a variety <strong>of</strong> SARs and <strong>QSAR</strong>s. Thus, the chemical domain<strong>of</strong> applicability is a theoretical region in the space defined by the modeled response andthe descriptors <strong>of</strong> the model, <strong>for</strong> which a given <strong>QSAR</strong> should make reliable predictions.This region is defined by the nature <strong>of</strong> the chemicals in the training set, and can becharacterized in various ways: in this work the leverage approach has been used.Williams plot or Ordinary Least Squares (OLS) Outlier and Leverage Plot is the plot <strong>of</strong>jackknifed residuals versus leverages (hat diagonals). In this plot the horizontal andvertical straight lines indicate the limits <strong>of</strong> normal values: the first <strong>for</strong> the outliers and thesecond <strong>for</strong> influential chemicals.The jackknifed residuals, also called standardized residual in prediction, referred to as StdError, is calculated as the ordinary residual in prediction divided by the residual standarddeviation:eˆi / ieˆ i / s=s ⋅ 1−hwhere e ˆi/ iis the ordinary residual in prediction <strong>of</strong> the i-th object, s is the standard error <strong>of</strong>the estimate and h ii is the leverage value <strong>of</strong> the i-th object.It can be noted that, while the outliers <strong>for</strong> the response can be highlighted only <strong>for</strong>chemicals with known responses, the possibility <strong>of</strong> a chemical to be out <strong>of</strong> the structuralapplicability domain <strong>of</strong> a model, and thus the reliability <strong>of</strong> its predictions, can be verified<strong>for</strong> every new chemicals, the only knowledge needed being the molecular structure. Theii101


Williams plot <strong>of</strong> the regression allows a graphical detection <strong>of</strong> both the outliers <strong>for</strong> theresponse and the structurally influential chemicals in a model.External validationThe external validation technique makes use <strong>of</strong> a test set retained to per<strong>for</strong>m a furthercheck on the predictive capabilities <strong>of</strong> a model obtained from a training set and withpredictive power optimized by an evaluation set. By using the selected model the values<strong>of</strong> the response <strong>for</strong> the test objects are calculated and the quality <strong>of</strong> these predictions isdefined in terms <strong>of</strong> Q 2 ext, which is defined as:Q2ext= 1 −next∑1=1next∑1=1( yi( yi− yˆ)i− y)22where the sum runs over the test set objects (n ext ) and y is the average value <strong>of</strong> thetraining set responses.Fitness regression parametersThe per<strong>for</strong>mance <strong>of</strong> the <strong>QSAR</strong> model can be evaluated by several regression parameters.A first group <strong>of</strong> them are devoted to evaluate the goodness <strong>of</strong> fit, i.e. the model capabilityto fit the data <strong>of</strong> the training set, providing a measure <strong>of</strong> how well the regression modelaccounts <strong>for</strong> the variance <strong>of</strong> the response variable.Some <strong>of</strong> the ones more used and proposed <strong>for</strong> comparison or selection <strong>of</strong> the best subset<strong>of</strong> models are the following:• Residual Sum <strong>of</strong> Squares, RSS (: error sum <strong>of</strong> squares). The sum <strong>of</strong> squareddifferences between the observed (y) and estimated response ( ŷ ):RSS = ∑n( yˆi− yii=1being n the number <strong>of</strong> training objects.This quantity is minimized by the least square estimator.)2• Model Sum <strong>of</strong> Squares, MSS defined as the sum <strong>of</strong> the squared differencesbetween the estimated responses and the average response:n2MSS = ∑( yˆi− y)i=1This is a part <strong>of</strong> the total variance explained by the regression model as opposedto the residual sum <strong>of</strong> squares RSS.• Total Sum <strong>of</strong> Squares, TSS, defined as the sum <strong>of</strong> the squared differencesbetween the experimental responses and the average response102


nTSS = ∑( yi − y)i=1This is the total variance that a regression model has to explain and is used as ano-model reference quantity to calculate standard quality parameters such as thecoefficient <strong>of</strong> determination.• Coefficient <strong>of</strong> determination, R 2 . The squared multiple correlation coefficient thatis the total variance <strong>of</strong> the response explained by a regression model. It can becalculated from the model sum <strong>of</strong> squares MSS or from the residual sum <strong>of</strong>squares RSS:R2=MSSTSS= 1−RSSTSS= 1−2n∑i=1n∑i=1( yˆi( yi− y )i− y)where TSS is the total sum <strong>of</strong> squares around the mean. A value <strong>of</strong> one indicatesperfect fit, i.e. a model with zero error term.• Residual Mean Square, RMS or s2 (: mean square error, expected squared error).The estimate s2 <strong>of</strong> the error variance σ 2, defined as:2 RSSs =df Ewhere RSS is the residual sum <strong>of</strong> squares and df E is the error degrees <strong>of</strong> freedom,i.e. to n – p', where n is the number <strong>of</strong> objects (samples), p' the number <strong>of</strong> modelparameters (<strong>for</strong> example, n – p – 1 <strong>for</strong> a regression model with p variables and theintercept). The standard error <strong>of</strong> the estimate s is the square root <strong>of</strong> the residualmean square.• Standard Deviation Error in Calculation, SDEC also known as standard error incalculation, SEC. A function <strong>of</strong> the residual sum <strong>of</strong> squares, defined as:SDEC =n∑i=1( yˆi− y )ni2=RSSn• F Fisher function. Among the most known statistical tests, it is defined as theratio between the model sum <strong>of</strong> squares MSS and the residual sum <strong>of</strong> squaresRSS:MSS / dfMF =RSS / dfwhere df M and df E refer to the degrees <strong>of</strong> freedom <strong>of</strong> the model and error,respectively. The calculated value is compared with the critical value F crit <strong>for</strong> thecorresponding degrees <strong>of</strong> freedom. It is a comparison between the modelE22103


explained variance and the residual variance: high values <strong>of</strong> the F-ratio testindicate reliable models.• Adjusted R2. A fitness parameter adjusted <strong>for</strong> the degrees <strong>of</strong> freedom, so that itcan be used <strong>for</strong> comparing models with different numbers <strong>of</strong> predictor variables:2 RSS / dfE2 ⎛ n −1⎞Radj= 1−= 1−(1 − R ) ⎜ ⎟TSS / dfT⎝ n − p'⎠where RSS and TSS are the residual sum <strong>of</strong> squares and the total sum <strong>of</strong> squares,respectively; df T refers to the total degrees <strong>of</strong> freedom; R 2 is the coefficient <strong>of</strong>determination.• FIT Kubinyi function [Kubinyi, H. 1994]:2R ( n − p')FIT = ⋅22( 1−R ) ( n + p)where R 2 is the coefficient <strong>of</strong> determination.• Akaike In<strong>for</strong>mation Criterion, AIC. A model selection criterion <strong>for</strong> choosingbetween models with different parameters and defined as:( n + p')AIC = RSS ⋅2( n − p')Hotelling ellipseThe Hotelling’s T 2 statistic is the multivariate equivalent <strong>of</strong> the Student's t statistic, andprovides a check <strong>for</strong> observations adhering to multivariate normality. The Hotelling T2<strong>for</strong> observation i, based on p components is defined as:p 2t2 ijTi= ∑ 2j=1 stjs 2 tj = variance <strong>of</strong> t jFor a given observation, i, the Hotelling T2 is a combination <strong>of</strong> all the X-scores (t) in allp components. The Hotelling T2 control chart yields a summary <strong>of</strong> all the processvariables and all model dimensions, displaying how far away from the center (target) theprocess is along the PC model hyper plane.The significance level to compute the Hotelling T2 ellipse and the critical distance to themodel is <strong>of</strong>ten by default 0.05 (95% confidence).LeverageThe leverage <strong>of</strong> a chemical provides a measure <strong>of</strong> the distance <strong>of</strong> the chemical from thecentroid <strong>of</strong> X. Chemicals close to the centroid are less influential in model building thanextreme points. The leverages <strong>of</strong> all chemicals in the data set are generated by104


manipulating X to give the so-called Influence Matrix or Hat Matrix (H), a symmetricmatrix defined as:H = X ⋅(Xwhere X is the descriptor matrix, X T is the transpose <strong>of</strong> X, and (A) -1 is the inverse <strong>of</strong>matrix A. The leverages or hat values (h i ) <strong>of</strong> the chemicals (i) in the descriptor space arethe diagonal elements <strong>of</strong> H, and can be computed as:Th = x ⋅(XT X) −1 ⋅ xiiiiwhere x i is the descriptor row-vector <strong>of</strong> the query chemical.The leverage matrix is related to the response vector y by the following relationship:TX)y ˆ = Hywhere ŷ is the calculated response vector from the model.A “warning leverage” (h*) is generally fixed at 3p/n, where n is the number <strong>of</strong> trainingchemicals, and p the number <strong>of</strong> model variables plus one. A chemical with high leveragein the training set greatly influences the regression line: the fitted regression line is <strong>for</strong>cednear the observed value and its residual (observed-predicted value) is small, so thechemical does not appear to be an outlier, even though it may actually be outside the AD.In contrast, if a chemical in the test set has a hat value greater than the warning leverageh*, this means that the prediction is the result <strong>of</strong> substantial extrapolation and there<strong>for</strong>emay not be reliable.Leave-one-out cross-validationThe simplest and most general cross-validation procedure is the leave-one-out technique(LOO technique), where each object is taken away, one at a time. In this case, given nobjects, n reduced models have to be calculated.For each reduced data set, the model is calculated and responses <strong>for</strong> the deleted object arepredicted from the model. The squared differences between the true response and thepredicted response <strong>for</strong> the object left out are added to PRESS (predictive residual sum <strong>of</strong>squares). From the final PRESS, the Q 2 (or R 2 CV) and SDEP (standard deviation error <strong>of</strong>prediction) values are usually calculated.This technique is particularly important as this deletion scheme is unique and thepredictive ability <strong>of</strong> the different models can be compared accurately. However, inseveral cases, the predictive ability obtained is too optimistic, particularly when thenumber <strong>of</strong> objects is quite large. This is due to a too small perturbation <strong>of</strong> the data whenonly one object is left out.Multidimensional scaling (MDS)Multidimensional scaling (MDS) is a largely used multivariate technique <strong>for</strong> explorativedata analysis, which can be considered to be an alternative to factor analysis, typicallyused as an exploratory technique to visualize objects in a low dimensional space. Ingeneral, the analysis allows detecting meaningful underlying dimensions <strong>for</strong> similaritiesor dissimilarities (distances) between the investigated chemicals. In factor analysis, the−1⋅ XTi105


similarities between objects (e.g., variables) are expressed in the correlation matrix. WithMDS it is possible to analyze not only correlation matrices but also any kind <strong>of</strong> similarityor dissimilarity matrix. The Non-metric multidimensional scaling is works on thedistance matrix D obtained from the original multidimensional data matrix X, using theEuclidean distance; starting from a scaling <strong>of</strong> the objects in full-dimensional space itattempts to obtain a representation in a Cartesian coordinate system <strong>of</strong> a set <strong>of</strong> objectswhose relationships are measured by a dissimilarity coefficient, i.e. the selected distance.The principal coordinates are functions <strong>of</strong> the original variables, mediated through thesimilarity or distance function used and explaining the largest percentage <strong>of</strong> the totalvariance.OutlierAn object that is atypical (different from the average) <strong>of</strong> the rest <strong>of</strong> the objects in a dataset is deemed an outlier. A chemical may be an outlier with respect to the responsevariable (Y) and/or with respect to the independent variables. Thus, to make a decisionregarding the inclusion <strong>of</strong> a particular chemical in an model two aspects have to beaccounted <strong>for</strong>: whether or not that chemical is an outlier and the influence or weight thatthe chemical has on the results.Regarding the first aspect, since the assumption <strong>of</strong> normality <strong>of</strong> the residuals is a givenwith any regression equation, the two (or three) times standard deviation rule can be usedto identify a potential outlier, simply finding the Standard Deviation Error in Calculation(SDEC) and multiplying it by 2 (or 3) in order to get the bounds within which all <strong>of</strong>residuals should lie. There<strong>for</strong>e, if a particular residual lies outside <strong>of</strong> these bounds, it isdeemed to be an outlier. If it is close to three standard deviations, these chemicals shouldbe examined further. Once an observation has been established as an outlier, anotherdecision must be made, that is whether or not it can be retained in the equation. If theoutlier is due to miscoding, the user simply make the correction and proceeds from there.However, if the observation is atypical, the concept <strong>of</strong> leverage and/or influence entersthe analysis.Predictive regression parametersThis group <strong>of</strong> regression parameters are devoted to evaluate the goodness <strong>of</strong> prediction,i.e. the model capability to estimate future (test) data, providing a measure <strong>of</strong> how wellthe regression model estimates the response variable given a set <strong>of</strong> values <strong>for</strong> predictorvariables. These quantities are obtained using validation techniques and are also used ascriteria <strong>for</strong> model selection.The most important regression parameters are listed below:• Predictive Residual Sum <strong>of</strong> Squares, PRESS. The sum <strong>of</strong> squared differencesbetween the observed and estimated response by validation techniques:PRESS = ∑( y i− yˆini=12/ i)106


where y ˆi/ idenotes the response <strong>of</strong> the i-th object estimated by using a modelobtained without using the i-th object. Using validation techniques minimizes thisquantity.• Cross-validated R 2 , R 2 cv (or Q 2 ). The explained variance in prediction:R22cv= Q2= 1−PRESSTSS= 1−n∑i=1n∑i=1( yi( y− yˆii / i− y)where PRESS is the predictive error sum <strong>of</strong> squares and TSS the total sum <strong>of</strong>squares.• External Q 2 . The explained variance in prediction:Q2ext= 1 −next∑1=1next∑1=1( yi( yi− yˆ)i− y)where the sum runs over the test set objects (n ext ) and y is the average value <strong>of</strong> thetraining set responses.• Standard Deviation Error <strong>of</strong> Prediction, SDEP also known as standard error inprediction SEP or PSE. A function <strong>of</strong> the predictive residual sum <strong>of</strong> squares,defined as:SDEP =n∑i=1( yi− yˆn2i / i)=22PRESSn• External Standard Deviation Error <strong>of</strong> Prediction, SDEP ext . A function <strong>of</strong> thepredictive residual sum <strong>of</strong> squares, defined as:SDEPext= 1 −next∑1=1( yin− yˆ)where the sum runs over the test set objects (n ext ).exti2)22Principal component analysis (PCA)Principal component analysis is a statistical technique <strong>for</strong> exploratory data analysis,modelling the p variables in the data matrix X (n x p), where n is the number <strong>of</strong> objects,as linear combinations <strong>of</strong> the common factors T (n x M), called principal components t m :X= T⋅L T107


where T is the score matrix, L (p x M) is the loading matrix and M is the number <strong>of</strong>significant principal components (M ≤ p). The columns <strong>of</strong> the loading matrix are theeigenvectors l m ; the eigenvector coefficients l jm , called loadings, represent theimportance <strong>of</strong> each original variable in the considered eigenvector. The components arecalculated according to the maximum variance criterion, i.e. each successive componentis an orthogonal linear combination <strong>of</strong> the original variables such that it covers themaximum <strong>of</strong> the variance not accounted <strong>for</strong> by the previous components. The eigenvalueλ m associated with each m-th component represents the variance explained by theconsidered component.The principal components can also be viewed as linear combinations <strong>of</strong> the p originalvariables.The main advantages <strong>of</strong> principal components are that:1) each component is orthogonal to all the remaining components, i.e. the in<strong>for</strong>mationcarried by this component is unique;2) each component represents a macrovariable <strong>of</strong> the data;3) components associated with the lowest eigenvalues do not usually contain usefulin<strong>for</strong>mation (noise, spurious in<strong>for</strong>mation, etc.).When PCA is per<strong>for</strong>med on a set <strong>of</strong> compounds characterized by molecular descriptors(physico-chemical properties, structural variables, etc.) the significant principalcomponents are called principal properties PP because they summarize the mainin<strong>for</strong>mation <strong>of</strong> the original molecular descriptors:Principal component analysis is <strong>of</strong>ten used to identify groups <strong>of</strong> inter-related variables,reduce the number <strong>of</strong> variables, as well as discover extreme cases on one variable, or acombination <strong>of</strong> variables, which have a strong influence on the calculation <strong>of</strong> statistics(outlier detection).Validation techniquesValidation techniques constitute a fundamental tool <strong>for</strong> the assessment <strong>of</strong> the validity <strong>of</strong>models obtained by multivariate regression methods. Validation techniques are used tocheck the prediction power <strong>of</strong> the models, i.e. to give a measure <strong>of</strong> their capability toper<strong>for</strong>m reliable predictions <strong>of</strong> the modelled response <strong>for</strong> new cases where the response isunknown.A necessary condition <strong>for</strong> the validity <strong>of</strong> a regression model is that the multiplecorrelation coefficient R 2 is as close as possible to one and the standard error <strong>of</strong> theestimate s small. However, this condition (fitting ability) is not sufficient <strong>for</strong> modelvalidity as the models give a closer fit (smaller s and larger R 2 ) the larger the number <strong>of</strong>parameters and variables in the models. Moreover, un<strong>for</strong>tunately, these parameters arenot related to the capability <strong>of</strong> the model to make reliable predictions on future data.Other problems <strong>for</strong> the validity <strong>of</strong> the models arise when models, <strong>of</strong>ten with only fewvariables, are obtained by using procedures based on variable selection [Allen, D.M.1971]. In fact, when a set with a large number <strong>of</strong> descriptors to select from is available,simple models can be found with apparently good fitting properties due to chance108


correlation, i.e. collinearity without predictive ability [Topliss, J.G. and Edwards, R.P.1979; Wold S, et al. 1983; Clark M and Cramer IRD 1993].To avoid models with chance correlation, a check with different validation proceduresmust be adopted.The more common statistical techniques proposed to simulate the predictive ability <strong>of</strong> amodel are the following:• leave-one-out• bootstrap• Y-scrambling• external validationY-ScramblingThis validation technique is adopted to check models with chance correlation, i.e. modelswhere the independent variables are randomly correlated to the response variables. Thetest is per<strong>for</strong>med by calculating the quality <strong>of</strong> the model (usually R2 or, better, Q2)randomly modifying the sequence <strong>of</strong> the response vector y, i.e. by assigning to eachobject a response randomly selected from the true responses. Each scrambling ischaracterised in terms <strong>of</strong> the correlation <strong>of</strong> the scrambled response with the unperturbeddata (R 2 yy'). If the original model has no chance correlation, there is a significantdifference in the quality <strong>of</strong> the original model and that associated with a model obtainedwith random responses. The procedure is repeated several hundred <strong>of</strong> times.Once the model validation has been per<strong>for</strong>med the Y-scrambling parameters (a(R2) anda(Q2)) are calculated as the intercepts <strong>of</strong> the equations:22R = bo + bR ( yy')b = a(R22Q = bo + bQ ( yy')b = a(Q0022))<strong>Models</strong> which are unstable (that is, which change greatly with small changes inunderlying response values) are characterized by high intercept value. Stable models (thatis, which change proportionally with small changes in underlying data) have lowintercept value.ReferencesAllen, D.M. (1971). Mean square error <strong>of</strong> prediction as a criterion <strong>for</strong> selecting variables.Technometrics 13, 469-475.Clark M and Cramer IRD (1993). The probability <strong>of</strong> chance correlation using partial leastsquares (PLS). Quant Struct-Act Relat 12: 137-145.Efron, B. (1982). The Jackknife: the Bootstrap and Other Resampling Planes, Society <strong>for</strong>Industrial and Applied Mathematics, Philadelphia (PA).Efron, B. (1987).Better bootstrap confidence intervals. (with discussion). Journal <strong>of</strong>American Statistical Association, 82, 171-200.109


Kubinyi, H. (1994). Variable selection in <strong>QSAR</strong> studies. I. An evolutionary algorithm.Quant Struct-Act Relat. 13, 285-294.Netzeva,T., Worth, A., Aldenberg,T., Benigni, R., Cronin, M., Gramatica,P., Jaworska,J., Kahn, S., Klopman,G., Marchant, C., Myatt, G., Nikolova-Jeliazkova,N., Patlewicz,G., Perkins, R., Roberts, D., Schultz,T.W., Stanton, D., van de Sandt, J., Tong, W., Veith,G., Yang, C. (2005). Current Status <strong>of</strong> Methods <strong>for</strong> Defining the Applicability Domain <strong>of</strong>(Quantitative) Structure–Activity Relationships. ECVAM Workshop 52, ATLA 33, 1–19.Topliss, J. G. and Edwards, R. P. (1979). Chance Factors in Studies <strong>of</strong> QuantitativeStructure-Activity Relationships, J. Med. Chem. 22, 1238-1244.Wold S, Albano C, Dunn WJ III, Esbensen K, Hellberg S, Johansson E. (1983). Patternrecognition: finding and using regularities in multivariate data. In: Food Research andData Analysis (Martens H, Russwurm H, eds). Essex, UK: Applied Science Publishers,147-188.110


Mission <strong>of</strong> the JRCThe mission <strong>of</strong> the JRC is to provide customer-driven scientific and technical support <strong>for</strong> the conception,development, implementation and monitoring <strong>of</strong> EU policies. As a service <strong>of</strong> the European Commission,the JRC functions as a reference centre <strong>of</strong> science and technology <strong>for</strong> the Union. Close to the policymakingprocess, it serves the common interest <strong>of</strong> the Member States, while being independent <strong>of</strong> specialinterests, whether private or national.EUROPEAN COMMISSIONDIRECTORATE GENERALJOINT RESEARCH CENTRE

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!