15.11.2012 Views

Tutorial Modeling Corticosteroid Binding ... - Molecular Networks

Tutorial Modeling Corticosteroid Binding ... - Molecular Networks

Tutorial Modeling Corticosteroid Binding ... - Molecular Networks

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Tutorial</strong><br />

<strong>Modeling</strong> <strong>Corticosteroid</strong> <strong>Binding</strong> Globulin Receptor<br />

Activity with ADRIANA<br />

<strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie<br />

July 2008<br />

http://www.molecular-networks.com


Henkestr. 91<br />

91052 Erlangen<br />

Germany<br />

Phone: +49-9131-815668<br />

Fax: +49-9131-815669<br />

Email: info@molecular-networks.com<br />

WWW: www.molecular-networks.com<br />

This document is copyright © 2007 by <strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie. All rights<br />

reserved. Except as permitted under the terms of the Software Licensing Agreement of <strong>Molecular</strong><br />

<strong>Networks</strong> GmbH Computerchemie, no part of this publication may be reproduced or distributed in<br />

any form or by any means or stored in a database retrieval system without the prior written<br />

permission of <strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie.<br />

The software described in this document is furnished under a license and may be used and copied<br />

only in accordance with the terms of such license.<br />

ADRIANA is a registered trademark in the Federal Republic of Germany. Other product names<br />

and company names may be trademarks or registered trademarks of their respective owners, in<br />

the Federal Republic of Germany and other countries. All rights reserved.<br />

(Document version: CHS/LT-1.1-2008-07-31)


Contents<br />

Contents<br />

Introduction and Objective 1<br />

The Dataset 2<br />

Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code 4<br />

Step 1: Start ADRIANA.Code, Load Structure File and Set Output File Options 4<br />

Step 2: Select and Calculate the <strong>Molecular</strong> Descriptors 5<br />

Step 3 Calculate a Descriptor File with Experimental pK Values 7<br />

Classification of Compounds According to their Biological Activity with SONNIA 8<br />

Step 1: Start SONNIA, Load the Descriptor and the Structure File 8<br />

Step 2: Create and Train a Kohonen Neural Network 9<br />

Step 3: Create a Kohonen Map 11<br />

Step 4: Analyze a Kohonen Map 13<br />

Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA 16<br />

Step 1: Start SONNIA, Load the Descriptor and the Structure File 16<br />

Step 2: Create and Train a Counterpropagation Neural Network 16<br />

Step 3: Visualize the Trained Counterpropagation Network 17<br />

Step 4: Write and Analyze the Prediction File 19<br />

Tips and Tricks 21<br />

Preprocessing Data Files 21<br />

Training Parameters of a Neural Network 23<br />

Assessing the Quality of an Unsupervised Classification 25<br />

Problems and Help! 27<br />

References 28


Introduction and Objective<br />

Introduction and Objective<br />

Statistical or machine learning methods are widely used to establish relationships<br />

between biological activities, physical or chemical properties of a compound and its<br />

chemical structure. These methods, in combination with structure descriptors, are used<br />

to derive models that can be applied to predict properties of new compounds.<br />

The objective of this tutorial is to show with a simple example how the methods<br />

contained in the software bundle ADRIANA that consists of the tools<br />

• descriptor calculation package ADRIANA.Code [1], and<br />

• neural network package SONNIA [2]<br />

can be applied in the area of qualitative and quantitative structure-activity relationship<br />

(QSAR) studies. The tutorial guides the user through the entire workflow starting from a<br />

dataset of chemical structures with experimentally derived biological activities and<br />

describes<br />

• how to calculate molecular descriptors for a dataset of compounds with<br />

ADRIANA.Code,<br />

• how to classify compounds according to their biological activity with a Kohonen<br />

neural network implemented in SONNIA, and<br />

• how to quantitatively model a biological activity using the counterpropagation neural<br />

network implemented in SONNIA.<br />

In addition, the tutorial gives some hints, tips and tricks that are valuable and helpful<br />

when ADRIANA.Code and SONNIA are applied to other datasets and in QSAR<br />

studies.<br />

For further information about the usage as well as the methods that are implemented in<br />

the program packages ADRIANA.Code and SONNIA, please refer to the respective<br />

program manuals.<br />

The example "<strong>Modeling</strong> <strong>Corticosteroid</strong> <strong>Binding</strong> Globulin (CBG) Receptor Activity" is<br />

taken from the literature [3]. The dataset comprises 31 steroid compounds and their<br />

experimental CBG receptor binding affinity values (pK values). Based on the pK<br />

values, the compounds were pre-classified into the three different classes, high,<br />

medium and low CBG binding affinity. In the example study, each molecule of the<br />

dataset is represented by a vector of 12 autocorrelation coefficients that encode the<br />

spatial distribution of the electrostatic potential on the molecular surface (calculated by<br />

ADRIANA.Code). These descriptors are then used to classify the compounds<br />

according to the three different CBG activity classes using an unsupervised Kohonen<br />

neural network technique (implemented in SONNIA). Finally, a supervised neural<br />

network method (counterpropagation neural network implemented in SONNIA) is used<br />

to quantitatively model the pK values.<br />

The dataset of 31 steroid compounds can be downloaded from <strong>Molecular</strong> <strong>Networks</strong>'<br />

web server at http://www.molecular-networks.com.<br />

1


The Dataset<br />

The Dataset<br />

2<br />

The dataset of 31 steroid compounds and their CBG receptor binding affinity values are<br />

stored in MDL SDFile format [4]. All chemical structures are fully defined including<br />

hydrogen atoms and stereo information (atom parity flags). For each record, the<br />

experimentally determined biological activity (pK value) is contained in the SDF data<br />

field . Furthermore, the compounds are pre-classified into three<br />

different affinity classes:<br />

high affinity (class 1) medium affinity (class 2) low affinity (class 3)<br />

The binding affinity class is stored in the SDF data field .<br />

Figure 1 shows the structures of the dataset sorted by their CBG receptor binding<br />

affinity class.<br />

high affinity (class 1)<br />

medium affinity (class 2)


low affinity (class 3)<br />

Figure 1 Dataset of 31 steroid compounds<br />

Introduction and Objective<br />

3


Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />

Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />

In the following sections, the calculation of a set of molecular descriptors with<br />

ADRIANA.Code is described. A descriptor file will be generated that represents each<br />

molecule of the dataset by a vector of 12 autocorrelation coefficients encoding the<br />

spatial distribution of the electrostatic potential on the molecular surface.<br />

Step 1: Start ADRIANA.Code, Load Structure File and Set Output File<br />

Options<br />

4<br />

• Start the graphical user interface (GUI) of ADRIANA.Code by double-clicking on<br />

the desktop icon of ADRIANA.Code.<br />

• Load the structure file steroids31_act.sdf by clicking on the button ... in the<br />

section Input of the ADRIANA.Code GUI and selecting the file in the dialog box<br />

Choose a structure file to open (see Figure 2).<br />

Figure 2 Loading a chemical structure file.<br />

• Set the output file format in the drop down menu Format in the section Output to<br />

SONNIA.<br />

• Click on the button ... in the section Output and set the name of the output file to<br />

steroids31_actClass_mep_ac12.dat in the same directory where the input<br />

file is located in the dialog box Choose an output file to write to.


Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />

• Note: The full name of the output file (file name and path) is set automatically but<br />

can be changed by the user either in the field File in the section Output or by using<br />

the dialog box as described above.<br />

• Click on the button Select properties in the section Output, choose NAME in the<br />

drop down menu Compound ID property, check the box CBG_ACTIVITY_CLASS<br />

in the list Select properties to copy and confirm with the button OK (see Figure 3).<br />

Figure 3 Selecting the properties of the output file.<br />

Step 2: Select and Calculate the <strong>Molecular</strong> Descriptors<br />

• Select Autocorrelation of <strong>Molecular</strong> Surface Properties → molecular<br />

electrostatic potential (SurfACorr_ESP) in the list Available in the section<br />

Descriptors and press the button > to select the descriptor for calculation.<br />

• SurfACorr_ESP now appears in the list Selected. Use the default settings and<br />

parameters in the section Available Control Parameters (see Figure 4).<br />

5


Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />

6<br />

Figure 4 Selecting the descriptor.<br />

• Press the button Calculate.<br />

• Note: ADRIANA.Code now calculates for each compound a vector of 12<br />

autocorrelation coefficients that encode the spatial distribution of the electrostatic<br />

potential on the molecular surface.<br />

• After the descriptor calculation is finished a dialog box appears. Press the button<br />

View output file to display the output file in a table formatted view. The first 12<br />

columns contain the 12 autocorrelation coefficients. The last two columns contain<br />

the affinity class (CBG_ACTIVITY_CLASS, 1 = high affinity; 2 = medium affinity; 3 =<br />

low affinity) and the name of the compound with a leading "!" which SONNIA<br />

interprets as the compound name (see Figure 5).<br />

Figure 5 Viewing the output file.


Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />

Step 3 Calculate a Descriptor File with Experimental pK Values<br />

• Change the name of the output file to steroids31_actpK_mep_ac12.dat.<br />

• Select CBG_ACTIVITY_pK in the dialog box Select properties instead of<br />

CBG_ACTIVITY_CLASS and confirm with the button OK (see Figure 6).<br />

Figure 6 Selecting the properties of the output file.<br />

• Calculate the descriptors by pressing the button Calculate.<br />

7


Classification of Compounds According to their Biological Activity with SONNIA<br />

Classification of Compounds According to their Biological Activity<br />

with SONNIA<br />

The following section describes the classification of the steroid compounds according<br />

to their CBG binding affinity class (CBG_ACTIVITY_CLASS) using the Kohonen neural<br />

network algorithm implemented in SONNIA. The Kohonen algorithm is an<br />

unsupervised, non-linear mapping technique that projects the twelve-dimensional<br />

descriptor space (12 autocorrelation coefficients) into a two-dimensional plane<br />

(Kohonen map). The information about the CBG affinity class is not used for the<br />

projection (unsupervised learning). The neurons of the resulting Kohonen map are<br />

color-coded according to the CBG receptor binding affinity class (high, medium or low)<br />

of the compounds that are assigned to a specific neuron.<br />

Step 1: Start SONNIA, Load the Descriptor and the Structure File<br />

8<br />

• Start the graphical user interface (GUI) of SONNIA by double-clicking the desktop<br />

icon of SONNIA.<br />

• Select Read ... in the menu File in the main menu bar. The dialog box SONNIA<br />

Read appears (see Figure 7).<br />

Figure 7 Loading the descriptor and structure file into SONNIA.<br />

• Select in the list Directory the directory where the structure and descriptor files are<br />

located and select Data File in the drop down menu Object.<br />

• Select the file steroids31_actClass_mep_ac12.dat and press the button OK.


Classification of Compounds According to their Biological Activity with SONNIA<br />

• In order to load the structure file, repeat this procedure, but select Structure File in<br />

the drop down menu Object and select the file steroids31_act.sdf.<br />

Step 2: Create and Train a Kohonen Neural Network<br />

• Select Create ... in the menu Network in the main menu bar. The dialog box<br />

SONNIA Network appears (see Figure 8).<br />

Figure 8 Creating a Kohonen neural network.<br />

• Ensure that Kohonen is set in the drop-down menu in the section Algorithm and<br />

Topology is set to toroidal.<br />

• The size of the network is set automatically by SONNIA. In this example, the<br />

network has a size of 5 (width) x 3 (height) = 15 neurons.<br />

• Enter the number 12 in the field Input in the section Network Dimensions (this is<br />

the number of descriptors of each molecule). Use the default settings for all other<br />

parameters. In this case the network (plane) has a dimension of 5 (width) x 3<br />

(height) = 15 neurons. Press the button Create.<br />

• Select Train ... in the menu Network in the main menu bar. The dialog box SONNIA<br />

Training appears (see Figure 9).<br />

9


Classification of Compounds According to their Biological Activity with SONNIA<br />

10<br />

Figure 9 Setting the training parameters for a Kohonen neural network.<br />

• Use the default settings for all parameters (see Figure 9) and press the button<br />

Train. The window SONNIA Monitor appears which shows the changes of the<br />

dynamic error (distance between input vectors and neuron weights) with the number<br />

of training cycles (see Figure 10).<br />

Figure 10 Training a Kohonen neural network.<br />

• The training is finished if the button Stop in the window SONNIA Monitor changes<br />

to OK (see also Figure 10).


Step 3: Create a Kohonen Map<br />

Classification of Compounds According to their Biological Activity with SONNIA<br />

• Select Palette Editor ... in the menu Maps in the main menu bar. The dialog box<br />

SONNIA Palette Editor appears.<br />

• Choose 3 in the drop down menu Colors (this is the number of classes) and 1<br />

(default, this is the position of the affinity class in the input vector) in the field Output<br />

(see Figure 11 left). Confirm the settings by pressing the button Apply.<br />

Figure 11 Setting the number and type of used colors for the Kohonen map.<br />

• Note: The default colors can be changed by clicking on a color in the section<br />

Palette of the dialog box SONNIA Palette Editor. The dialog box SONNIA Color<br />

Editor appears (see Figure 11 right). The color can now be changed by using the<br />

sliders or by entering color values for Red, Green and Blue. Confirm by pressing<br />

the button Apply.<br />

• Select Selected Maps in the menu Maps in the main menu bar. The Kohonen maps<br />

are generated and displayed (see Figure 12). Each colored square in the map<br />

corresponds to one neuron.<br />

• Note: By default, two Kohonen maps are generated. The first map is color-coded by<br />

the most frequent pattern that has been mapped into a neuron. In this example, this<br />

is the most frequent CBG binding affinity class. For instance, if two compounds with<br />

high and one compound with medium affinity were mapped into one single neuron<br />

the neuron gets color-coded with the color for high affinity (class 1, red). The second<br />

map additionally shows all neurons that contain compounds of at least two different<br />

classes (collision or conflict neurons). These neurons are marked in black color (see<br />

Figure 12, right map).<br />

• Note: The number and type of default maps can be changed by selecting Selected<br />

Maps ... in the menu Maps in the main menu bar. By default, the map types most<br />

frequent output and average output (conflicts) are checked (selected). Check<br />

further map types to add them to the default maps which are generated when<br />

selecting Selected Maps in the menu Maps in the main menu bar.<br />

11


Classification of Compounds According to their Biological Activity with SONNIA<br />

12<br />

Figure 12 Visualizing the Kohonen map colored by the most frequent activity<br />

class in each neuron (left) and with marked collision neurons (at<br />

least two molecules of different classes in the same neuron).<br />

• Note: The generated Kohonen maps have a toroidal geometry (see also Figure 26<br />

on page 24). Therefore, each neuron in the map has the same number of neighbors<br />

(8), also the neurons at the edges. By clicking on the map and holding the left<br />

mouse button, the maps can be shifted in x and y direction. Note that only the<br />

selected map is shifted. All other maps remain unchanged.<br />

• Right-click on a map and select Tile ... in the context menu. The window SONNIA<br />

Tiling appears (see Figure 13). Due to the toroidal geometry of the maps they can<br />

be tiled. Tile more maps by changing the size of the window SONNIA Tiling with the<br />

mouse.<br />

• Note: Tiled maps often better visualize the result of the Kohonen mapping and help<br />

to better assess the quality of the classification.


Classification of Compounds According to their Biological Activity with SONNIA<br />

Figure 13 Tiling of a Kohonen map.<br />

Step 4: Analyze a Kohonen Map<br />

• In order to visualize which compounds were mapped into which neurons, left-click<br />

on a neuron while keeping the Crtl key pressed. The neuron is now selected and is<br />

marked in light-grey color.<br />

• Right-click on the selected neuron and select Export Structures ... in the context<br />

menu. The Structure Browser appears and displays the compounds that have<br />

been mapped into the selected neurons (see Figure 14).<br />

Figure 14 Displaying the chemical structures that are mapped to a specific<br />

neuron.<br />

13


Classification of Compounds According to their Biological Activity with SONNIA<br />

14<br />

• Note: The structure file must have been loaded into SONNIA (see also Figure 7) to<br />

use this functionality.<br />

• Note: More than one neuron can be selected by a left-click on the map while<br />

keeping the Crtl key pressed and dragging the mouse over the map. The focus of<br />

the selection is shown by a temporary rectangle while dragging the mouse. All<br />

selected neurons are finally marked in light-gray color.<br />

• Note: Neurons can be de-selected by left-clicking on the neuron while keeping the<br />

Crtl and the Shift key pressed.<br />

• Note: Additional properties that are stored in the structure file (e.g., compound<br />

names, CBG affinity classes) can be displayed in the Structure Browser by<br />

selecting Chemical Properties ... in the menu Display of the main menu bar of the<br />

structure browser (Prop tabs in the Browser Annotation Display Style).<br />

• Right-click on a map and select Export Centroids ... in the context menu. The<br />

Structure Browser appears. The browser now displays the centroid compounds of<br />

all neurons (see Figure 15). The arrangement of the structure browser always<br />

reproduces the size of the network (here: 5 x 3).<br />

• Note: The centroid compound of a neuron is the compound having a descriptor<br />

descriptor vector (twelve dimensions) most similar to the weights of the neuron<br />

vector (also twelve dimensions). The descriptor vector of the centroid compound has<br />

the minimum Euclidean distance to the vector of the neuron weights of all<br />

compounds that have been mapped to this neuron.<br />

Figure 15 Displaying the centroid structures of all neurons.<br />

• In order to export the contents of all neurons (i.e., the information which compounds<br />

are mapped into which neurons), select Export Contents ... in the menu Analyze in<br />

the main menu bar. The dialog box SONNIA Write appears (see Figure 16).


Classification of Compounds According to their Biological Activity with SONNIA<br />

Figure 16 Exporting the contents of all neurons.<br />

• Select a directory in the list Directory and select CSV File (Contents Maps) in the<br />

drop down menu Object. Enter a file name, e.g., steroids31_contentMap.csv,<br />

in the field Files and confirm with the button OK.<br />

• Note: The ASCII csv file (csv: comma separated values) can be displayed with a<br />

standard ASCII file browser or loaded into spreadsheet programs (e.g., Microsoft<br />

Excel). Figure 17 shows the content of the csv file (displayed in Microsoft WordPad).<br />

Figure 17 Displaying a contents maps file (csv).<br />

15


Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />

Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />

16<br />

The following section describes the quantitative modeling of the CBG receptor binding<br />

affinity (CBG_ACTIVITY_pK) of the 31 steroid compounds using the<br />

counterpropagation neural network algorithm implemented in SONNIA. Again, each<br />

compound of the dataset is represented by a twelve-dimensional autocorrelation vector<br />

that encodes the spatial distribution of the electrostatic potential on the molecular<br />

surface. The counterpropagation algorithm is a supervised learning technique. In<br />

contrast to the Kohonen algorithm, the pK values of the CBG receptor binding affinity<br />

are now used to derive a model expressing the relationship between the descriptors<br />

(independent variables) and the biological activity (dependent variables).<br />

Step 1: Start SONNIA, Load the Descriptor and the Structure File<br />

• Start the graphical user interface (GUI) of SONNIA by double-clicking the desktop<br />

icon.<br />

• Select Read ... in the menu File in the main menu bar. The dialog box SONNIA<br />

Read appears (see also Figure 7).<br />

• Select in the list Directory the directory where the structure and descriptor files are<br />

located and select Data File in the drop down menu Object.<br />

• Select the file steroids31_actpK_mep_ac12.dat and press the button OK.<br />

• In order to load the structure file, repeat this procedure, but select Structure File in<br />

the drop down menu Object and select the file steroids31_act.sdf.<br />

Step 2: Create and Train a Counterpropagation Neural Network<br />

• Select Create ... in the menu Network in the main menu bar. The dialog box<br />

SONNIA Network appears (see Figure 18).


Figure 18 Creating a counterpropagation network.<br />

Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />

• Select Counterprop. in the drop down menu of the section Algorithm. Ensure that<br />

Topology is set to toroidal.<br />

• Enter the number 12 (dimension of descriptor vector) in the field Input and 1 in the<br />

field Output (dimension of the property to model, single value of CBG binding<br />

affinity) in the section Network Dimensions. Use the default settings for all other<br />

parameters and press the button Create.<br />

• Select Train ... in the menu Network in the main menu bar. The dialog box SONNIA<br />

Training appears (see also Figure 9).<br />

• Use the default settings for all parameters and press the button Train. The window<br />

SONNIA Monitor appears which shows the changes of the dynamic error (distance<br />

between input vectors and neuron weights) with the number of training cycles (see<br />

also Figure 10).<br />

• The training is finished when the button Stop in the window SONNIA Monitor<br />

changes to OK (see also Figure 10).<br />

Step 3: Visualize the Trained Counterpropagation Network<br />

• Note: The trained counterpropagation network can be visualized in a style similar to<br />

a Kohonen map. In this example, a continuous value (pK value) is modeled which<br />

ranges from about -7.8 to -5.0. The number of colors that are available in SONNIA<br />

is limited to 10. Therefore, only ranges of the predicted values can be color-coded<br />

by a single color.<br />

17


Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />

18<br />

• Select Palette Editor ... in the menu Maps in the main menu bar. The dialog box<br />

SONNIA Palette Editor appears.<br />

• Choose 3 in the drop-down menu Colors and 13 (= 12 autocorrelation coefficients+<br />

1 activity value) in the field Output (see Figure 19; the 13 th column in the input data<br />

file steroids31_actpK_mep_ac12.dat is the pK value).<br />

• Note: The entire range of the pK values from about -7.8 to -5.0 is now represented<br />

by three colors in equidistant ranges, i.e., red: pK values from -7.8 to -6.9; yellow:<br />

pK values from about -6.8 to -5.9; green: pK values from -5.8 to -5.0.<br />

• Confirm the settings by pressing the button Apply.<br />

Figure 19 Setting the number and type of colors for displaying the map of<br />

the counterpropagation network.<br />

• Select Selected Maps in the menu Maps in the main menu bar. The two default<br />

maps are generated and displayed (see Figure 20).<br />

Figure 20 Displaying the maps of the counterpropagation network.


Step 4: Write and Analyze the Prediction File<br />

Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />

• In order to write out the predicted pK values by the counterpropagation network<br />

select Write in the menu File in the main menu bar. The dialog box SONNIA Write<br />

appears (see Figure 21).<br />

• Select Prediction File in the drop down menu Object and enter a file name in the<br />

field Files (e.g., steroids31.prd).<br />

• Confirm with the button OK. The dialog box Prediction appears and suggests in the<br />

field Input Dimensionality the figure 12 (see Figure 21; number of descriptors of<br />

each compound). Confirm with the button Apply.<br />

Figure 21 Writing the prediction file.<br />

• Note: The prediction file steroids31.prd is an ASCII file which lists the input Y<br />

variable(s) (experimental pK values), the predicted Y variable(s) (predicted pK<br />

values) and the name of the compound. The file can be loaded in spreadsheet<br />

applications or standard ASCII text browser for further analysis (see Figure 22 and<br />

Figure 23).<br />

19


Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />

20<br />

Figure 22 Loading a prediction file into a spreadsheet application (here: MS<br />

Excel).<br />

Figure 23 Analyzing the prediction of SONNIA (here: MS Excel).


Tips and Tricks<br />

Preprocessing Data Files<br />

Merging Structure and Property Data<br />

Tips and Tricks<br />

Often, the chemical structure data is stored in an MDL SDFile whereas any additional<br />

information related to the chemical structures (e.g., any measured or experimental<br />

data) is stored in a separate file, e.g., in a table-like formatted ASCII file. A primary key<br />

(e.g., a unique name or number of the chemical structures) that is present in both the<br />

SD and the ASCII file is the only link between the structure and the additional data.<br />

In order to merge chemical structure and additional data into a single SDFile, <strong>Molecular</strong><br />

<strong>Networks</strong>' tool MN.MERGE (www.molecular-networks.com/software/split_join_merge/)<br />

can be used. Figure 24 shows a part of an SDFile (left) and an ASCII file (right) that<br />

contains some experimental (Exp1, Exp2), a categorical value (Class1) and the<br />

compound name (CpdName) organized as a table. The primary key is the given in the<br />

column CpdName that can be present in the correspondent SDFile either in the name<br />

field (see Figure 24) or in a data field. The command line of MN.MERGE to merge the<br />

files is:<br />

mn.merge –tablefile tablefile.txt –tablekey CpdName –outfile<br />

outfile_merged.sdf infile.sdf<br />

compound_1<br />

CS 02280711002D 0<br />

<strong>Molecular</strong> <strong>Networks</strong> 28.02.2007<br />

54 58 0 0 0 0 0 0 0 0999 V2000<br />

2.8729 -2.0044 0.0000 C 0 0 0<br />

2.8920 -0.9195 0.0000 C 0 0 0<br />

.<br />

.<br />

.<br />

25 33 1 0 0 0 0<br />

26 34 1 0 0 0 0<br />

M END<br />

$$$$<br />

.<br />

.<br />

Exp1 Exp2 Class1 CpdName<br />

-6.279 71.5 2 compound_1<br />

-5.316 63.4 3 compound_2<br />

-5.334 69.7 3 compound_3<br />

-5.763 65.7 3 compound_4<br />

.<br />

.<br />

.<br />

-5.613 79.5 3 compound_30<br />

-7.881 69.0 1 compound_31<br />

Figure 24 Merging SDFiles and data files with MN.MERGE.<br />

The resulting SDFile outfile_merge.sdf will contain the values of Exp1, Exp2 and<br />

Class1 in the SDF data fields "", "", "" and "".<br />

Note: Any data field that is already present in the input SDFile is written to the output.<br />

21


Tips and Tricks<br />

22<br />

Standardization and Checking Structural State and Integrity of Structure Files<br />

Chemical structure files may originate from different sources. Therefore, the chemical<br />

structures may differ in the way they are coded in their connection table representation<br />

or even show some errors. For instance, functional groups such as nitro groups may be<br />

coded with a pentavalent nitrogen atom or as a charged species, hydrogen atoms may<br />

be given implicitly or explicitly or charges in salts may not be balanced correctly.<br />

However, for a corporate compound database or a dataset under investigation it may<br />

be mandatory that all chemical structures and their connection table representation<br />

comply with a certain standard, i.e., are coded in a consistent and pre-defined fashion.<br />

<strong>Molecular</strong> <strong>Networks</strong>' tool MN.CHECK (www.molecular-networks.com/software/check/)<br />

can be helpful to standardize chemical structure data by applying a set of business<br />

rules that can be selected by the user. MN.CHECK supports batch mode execution<br />

and is able to process large chemical files fast and efficient. Furthermore, MN.CHECK<br />

can be used to detect and correct errors in the structure coding (e.g., missing charges<br />

at counter ions in salts) and to identify and remove duplicate structures in large<br />

collections of chemical compounds (based on a 64bit hashcoding technique).<br />

For example, the MN.CHECK command line<br />

mn.check -hydrogen add -nitrostyle ionic -chargebalance -<br />

pedantic -unique -outfile outfile_checked.sdf infile.sdf<br />

will read in the file infile.sdf, add implicit hydrogen atoms, re-code all nitro groups<br />

(and similar functional units) as charge pairs (with a tetravalent, positively charged<br />

nitrogen atom, and a negatively charged oxygen atom or another ligand atom), balance<br />

charges in salts, pedantically check the file formatting and structure coding and write<br />

out a message when errors are detected, identify and remove duplicate structures and<br />

write out the normalized and checked structures to the file outfile_checked.sdf.<br />

Complementary Software<br />

Another helpful and valuable tool in this area is <strong>Molecular</strong> <strong>Networks</strong>' file format<br />

converter MN.CONVERT that supports over 50 different file formats for chemical<br />

structure and reaction information and interconverts them with high conversion rates<br />

and reliability. A complete list of all supported file formats can be found at the product<br />

page of MN.CONVERT at www.molecular-networks.com/software/convert/.<br />

2D structure diagrams (2D coordinates) in publishing quality can be generated with<br />

<strong>Molecular</strong> <strong>Networks</strong>' tool MN.2DCOOR. The tool offers a variety of options and<br />

features to customize the layout of 2D structure plots. For instance, structures can be<br />

aligned to their main x or y axes or to a template structure provided in a separate file<br />

(e.g., to align all structures in a combinatorial library to a predefined orientation of their<br />

common scaffold). Further information about MN.2DCOOR can be found at its product<br />

page at www.molecular-networks.com/software/2dcoor.


Training Parameters of a Neural Network<br />

Network Size<br />

Tips and Tricks<br />

By default, SONNIA suggests a ratio of approximately one neuron per two<br />

compounds/patterns (1:2) which usually works fine for initial tests. Another possibility is<br />

to start with a ratio of 1:1 and to gradually reduce the size in following runs. If the size<br />

of the network gets too large there is a high likelihood that it will only memorize the<br />

input data without showing the maximum of the actual neighborhood relationship of the<br />

data patterns (e.g., by conflict neurons, neurons with patterns of more than one class,<br />

e.g., known actives and unknown).<br />

Smaller networks (high neuron/pattern ratio) tend to produce more conflict neurons<br />

which might be of interest for some applications, e.g., for lead-hopping. However, in too<br />

small networks the data has to be compressed in a few neurons. This may lead to<br />

conflict neurons that are not very meaningful. A balanced ratio should be achieved.<br />

Another example for a rather high neuron/pattern ratio is the visualization of large<br />

chemical spaces. Figure 25 shows the projection of about 404,000 chemical<br />

compounds from different sources into a Kohonen map of the size 80 x 60 neurons.<br />

# of compounds: 404,449<br />

# of neurons: 4,800 (80 x 60)<br />

# of occupied neurons: 4,799<br />

Chemical supplier<br />

databases (139,961)<br />

NCI database (193,339)<br />

MDDR (71,149)<br />

Color coding: most frequent<br />

pattern in neuron,<br />

scaled<br />

Figure 25 Visualization of large chemical spaces with SONNIA.<br />

Network Topology<br />

SONNIA offers two different types of network topology, a toroidal and a rectangular<br />

topology<br />

Toroidal topology. All neurons have the same neighbor relationship, i.e., eight direct<br />

neighbors. This means that in the resulting Kohonen map the neurons at the corners<br />

and edges are adjacent to the neurons at the opposite site of the map. This can be<br />

illustrated by a torus that is cut two times to obtain a plane (see Figure 26).<br />

23


Tips and Tricks<br />

24<br />

Figure 26 Toroidal topology of a Kohonen neural network.<br />

Rectangular topology. The neurons at the corners and the edges form the boundary<br />

of the network. Therefore, a neuron at a corner of the network has three only neurons<br />

as direct neighbors, an edge neuron five neurons and all other neurons have eight<br />

neighbors.<br />

Rectangular topologies are better for classification purposes since, e.g., "outliers" are<br />

more pushed to the edges and corners.<br />

Toroidal topologies are better if the data under investigation represents a "closed"<br />

system, e.g., if a molecular surface and its property is mapped into a two-dimensional<br />

plane by a Kohonen network.<br />

Training and Learning Parameters<br />

SONNIA (Network window, see Figure 9) makes some reasonable suggestions for the<br />

number of training cycles (epochs) and intervals, i.e., how often the data set is<br />

presented to the network before the weights of the neurons are adapted to the input<br />

data. Furthermore, the initial spans and steps (the distance in x and y direction in the<br />

network to which the weights of the neurons are adapted to a central/winning neuron;<br />

this distance is gradually reduced during the training) are set automatically according to<br />

the size of the network.<br />

Reasonable, new training parameters for span and step can be calculated as following<br />

(see Figure 27).<br />

Width<br />

Span(<br />

x)<br />

=<br />

2<br />

Span(<br />

x)<br />

Step(<br />

x)<br />

=<br />

Epochs<br />

Height<br />

Span(<br />

y)<br />

=<br />

2<br />

Span(<br />

y)<br />

Step(<br />

y)<br />

=<br />

Epochs<br />

Figure 27 Calculation of training parameters for a neural network.<br />

Learning rates (Rate in SONNIA Training window, see Figure 9) of about 0.5 are<br />

recommended. In general, it's preferable to train longer (i.e., higher number of epochs)


Tips and Tricks<br />

but with lower learning rates. High learning rates may cause problems if several input<br />

patterns compete for one neuron.<br />

The rate factor (Rate Factor in SONNIA Training window, see Figure 9) reduces the<br />

learning rate after each epoch by multiplying the learning rate with the rate factor. At<br />

the beginning of the training<br />

In general, Kohonen (or SOM) mapping is quite powerful since you can very quickly do<br />

a visual inspection of a high dimensional space and it allows for a rapid assessment<br />

and evaluation if the used descriptors are able to reveal trends and patterns in the<br />

data.<br />

Assessing the Quality of an Unsupervised Classification<br />

Basically, there are three different criteria which can be used to assess the quality of a<br />

classification done by a Kohonen mapping. These three criteria, visual inspection,<br />

occupancy and number of collisions (conflict neurons) are described in the following.<br />

Note that all three criteria should be taken into account to support the decision whether<br />

a generated Kohonen map shows a "good" classification.<br />

Visual Inspection<br />

The strength of Kohonen maps is that they can be generated rather quickly and the<br />

results can be visually inspected. The visual inspection allows for a rapid assessment<br />

and evaluation if the used descriptors are able to reveal trends and patterns in the data<br />

("... human inspection building on the powerful pattern recognition capabilities of the<br />

human mind") [7].<br />

A Kohonen map that shows a clear separation of different classes of compounds in a<br />

dataset can be regarded as an indicator that there is a relationship between the used<br />

descriptor(s) and the property under investigation.<br />

Occupancy<br />

A well-trained Kohonen network should also show a balanced and even distribution of<br />

the patterns (i.e., compounds) over the resulting map as well as a low fraction of<br />

unoccupied neurons (shown as white squares in the map). The distribution of the<br />

patterns and the occupancy of each individual neuron can be checked with an<br />

"occupancy map" (menu Maps in the main menu bar of SONNIA, see Figure 28, right<br />

map). The occupancy map is color-coded by the number of patterns/compounds that<br />

are assigned to each neurons.<br />

25


Tips and Tricks<br />

26<br />

Figure 28 Occupancy map of a Kohonen neural network (right map).<br />

A Kohonen map with an unbalanced occupancy of the neurons (e.g., more than the<br />

half of the input patterns are located in only 10% of the total number of neurons) may<br />

have several reasons, e.g.,<br />

• The training of the network was stopped too early: Train a newly created network<br />

and increase the number of Epochs (adjust the values for Step(x) and Step(y)<br />

accordingly).<br />

• The input values of one or a few input patterns are rather different from the rest of<br />

the input patterns of the dataset ("outliers"): remove these patterns from your<br />

training set and train a newly created network with the reduced dataset.


Problems and Help!<br />

Problems and Help!<br />

If there are any difficulties with the installation of ADRIANA.Code or SONNIA or if any<br />

problems occur while running ADRIANA.Code or SONNIA please send all inquiries to<br />

the following address:<br />

<strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie<br />

Henkestr. 91<br />

91052 Erlangen<br />

Germany<br />

or contact us by email support@molecular-networks.com<br />

or by Fax +49-9131-815669<br />

27


References<br />

References<br />

[1] Descriptor Calculation Package ADRIANA.Code, developed and distributed by<br />

<strong>Molecular</strong> <strong>Networks</strong> GmbH, Erlangen, Germany (http://www.molecular-networks.com).<br />

[2] Neural <strong>Networks</strong> Package SONNIA, developed and distributed by <strong>Molecular</strong> <strong>Networks</strong><br />

GmbH, Erlangen, Germany (http://www.molecular-networks.com).<br />

[3] Wagener, M.; Sadowski, J.; Gasteiger, J. Autocorrelation of <strong>Molecular</strong> Surface<br />

Properties for <strong>Modeling</strong> <strong>Corticosteroid</strong> <strong>Binding</strong> Globulin and Cytosolic Ah Receptor<br />

Activity by Neural <strong>Networks</strong>. J. Am. Chem. Soc. 1995, 117, 7769-7775.<br />

[4] a) Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B.<br />

A.; Laufer, J. Description of Several Chemical Structure File Formats Used by<br />

Computer Programs Developed at <strong>Molecular</strong> Design Limited. J. Chem. Inf. Comput.<br />

Sci. 1992, 32, 244-255. b) A detailed description of MDL file formats (Mol, SDF and<br />

RDF) is available for download as a PDF document at http://www.mdli.com.<br />

[5] Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three-Dimensional<br />

Model Builders Using 639 X-Ray Structures. J. Chem. Inf. Comput. Sci. 1994, 34,<br />

1000-1008.<br />

[6] 3D Structure Generator CORINA, developed and distributed by <strong>Molecular</strong> <strong>Networks</strong><br />

GmbH, Erlangen, Germany (http://www.molecular-networks.com).<br />

[7] Zupan, J.; Gasteiger, J. Neural Network in Chemistry and Drug Design. Second<br />

Edition, Wiley-VCH, Weinheim, 1999, 380 pages, ISBN 3-527-29778-2.<br />

28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!