Tutorial Modeling Corticosteroid Binding ... - Molecular Networks
Tutorial Modeling Corticosteroid Binding ... - Molecular Networks
Tutorial Modeling Corticosteroid Binding ... - Molecular Networks
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Tutorial</strong><br />
<strong>Modeling</strong> <strong>Corticosteroid</strong> <strong>Binding</strong> Globulin Receptor<br />
Activity with ADRIANA<br />
<strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie<br />
July 2008<br />
http://www.molecular-networks.com
Henkestr. 91<br />
91052 Erlangen<br />
Germany<br />
Phone: +49-9131-815668<br />
Fax: +49-9131-815669<br />
Email: info@molecular-networks.com<br />
WWW: www.molecular-networks.com<br />
This document is copyright © 2007 by <strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie. All rights<br />
reserved. Except as permitted under the terms of the Software Licensing Agreement of <strong>Molecular</strong><br />
<strong>Networks</strong> GmbH Computerchemie, no part of this publication may be reproduced or distributed in<br />
any form or by any means or stored in a database retrieval system without the prior written<br />
permission of <strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie.<br />
The software described in this document is furnished under a license and may be used and copied<br />
only in accordance with the terms of such license.<br />
ADRIANA is a registered trademark in the Federal Republic of Germany. Other product names<br />
and company names may be trademarks or registered trademarks of their respective owners, in<br />
the Federal Republic of Germany and other countries. All rights reserved.<br />
(Document version: CHS/LT-1.1-2008-07-31)
Contents<br />
Contents<br />
Introduction and Objective 1<br />
The Dataset 2<br />
Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code 4<br />
Step 1: Start ADRIANA.Code, Load Structure File and Set Output File Options 4<br />
Step 2: Select and Calculate the <strong>Molecular</strong> Descriptors 5<br />
Step 3 Calculate a Descriptor File with Experimental pK Values 7<br />
Classification of Compounds According to their Biological Activity with SONNIA 8<br />
Step 1: Start SONNIA, Load the Descriptor and the Structure File 8<br />
Step 2: Create and Train a Kohonen Neural Network 9<br />
Step 3: Create a Kohonen Map 11<br />
Step 4: Analyze a Kohonen Map 13<br />
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA 16<br />
Step 1: Start SONNIA, Load the Descriptor and the Structure File 16<br />
Step 2: Create and Train a Counterpropagation Neural Network 16<br />
Step 3: Visualize the Trained Counterpropagation Network 17<br />
Step 4: Write and Analyze the Prediction File 19<br />
Tips and Tricks 21<br />
Preprocessing Data Files 21<br />
Training Parameters of a Neural Network 23<br />
Assessing the Quality of an Unsupervised Classification 25<br />
Problems and Help! 27<br />
References 28
Introduction and Objective<br />
Introduction and Objective<br />
Statistical or machine learning methods are widely used to establish relationships<br />
between biological activities, physical or chemical properties of a compound and its<br />
chemical structure. These methods, in combination with structure descriptors, are used<br />
to derive models that can be applied to predict properties of new compounds.<br />
The objective of this tutorial is to show with a simple example how the methods<br />
contained in the software bundle ADRIANA that consists of the tools<br />
• descriptor calculation package ADRIANA.Code [1], and<br />
• neural network package SONNIA [2]<br />
can be applied in the area of qualitative and quantitative structure-activity relationship<br />
(QSAR) studies. The tutorial guides the user through the entire workflow starting from a<br />
dataset of chemical structures with experimentally derived biological activities and<br />
describes<br />
• how to calculate molecular descriptors for a dataset of compounds with<br />
ADRIANA.Code,<br />
• how to classify compounds according to their biological activity with a Kohonen<br />
neural network implemented in SONNIA, and<br />
• how to quantitatively model a biological activity using the counterpropagation neural<br />
network implemented in SONNIA.<br />
In addition, the tutorial gives some hints, tips and tricks that are valuable and helpful<br />
when ADRIANA.Code and SONNIA are applied to other datasets and in QSAR<br />
studies.<br />
For further information about the usage as well as the methods that are implemented in<br />
the program packages ADRIANA.Code and SONNIA, please refer to the respective<br />
program manuals.<br />
The example "<strong>Modeling</strong> <strong>Corticosteroid</strong> <strong>Binding</strong> Globulin (CBG) Receptor Activity" is<br />
taken from the literature [3]. The dataset comprises 31 steroid compounds and their<br />
experimental CBG receptor binding affinity values (pK values). Based on the pK<br />
values, the compounds were pre-classified into the three different classes, high,<br />
medium and low CBG binding affinity. In the example study, each molecule of the<br />
dataset is represented by a vector of 12 autocorrelation coefficients that encode the<br />
spatial distribution of the electrostatic potential on the molecular surface (calculated by<br />
ADRIANA.Code). These descriptors are then used to classify the compounds<br />
according to the three different CBG activity classes using an unsupervised Kohonen<br />
neural network technique (implemented in SONNIA). Finally, a supervised neural<br />
network method (counterpropagation neural network implemented in SONNIA) is used<br />
to quantitatively model the pK values.<br />
The dataset of 31 steroid compounds can be downloaded from <strong>Molecular</strong> <strong>Networks</strong>'<br />
web server at http://www.molecular-networks.com.<br />
1
The Dataset<br />
The Dataset<br />
2<br />
The dataset of 31 steroid compounds and their CBG receptor binding affinity values are<br />
stored in MDL SDFile format [4]. All chemical structures are fully defined including<br />
hydrogen atoms and stereo information (atom parity flags). For each record, the<br />
experimentally determined biological activity (pK value) is contained in the SDF data<br />
field . Furthermore, the compounds are pre-classified into three<br />
different affinity classes:<br />
high affinity (class 1) medium affinity (class 2) low affinity (class 3)<br />
The binding affinity class is stored in the SDF data field .<br />
Figure 1 shows the structures of the dataset sorted by their CBG receptor binding<br />
affinity class.<br />
high affinity (class 1)<br />
medium affinity (class 2)
low affinity (class 3)<br />
Figure 1 Dataset of 31 steroid compounds<br />
Introduction and Objective<br />
3
Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />
Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />
In the following sections, the calculation of a set of molecular descriptors with<br />
ADRIANA.Code is described. A descriptor file will be generated that represents each<br />
molecule of the dataset by a vector of 12 autocorrelation coefficients encoding the<br />
spatial distribution of the electrostatic potential on the molecular surface.<br />
Step 1: Start ADRIANA.Code, Load Structure File and Set Output File<br />
Options<br />
4<br />
• Start the graphical user interface (GUI) of ADRIANA.Code by double-clicking on<br />
the desktop icon of ADRIANA.Code.<br />
• Load the structure file steroids31_act.sdf by clicking on the button ... in the<br />
section Input of the ADRIANA.Code GUI and selecting the file in the dialog box<br />
Choose a structure file to open (see Figure 2).<br />
Figure 2 Loading a chemical structure file.<br />
• Set the output file format in the drop down menu Format in the section Output to<br />
SONNIA.<br />
• Click on the button ... in the section Output and set the name of the output file to<br />
steroids31_actClass_mep_ac12.dat in the same directory where the input<br />
file is located in the dialog box Choose an output file to write to.
Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />
• Note: The full name of the output file (file name and path) is set automatically but<br />
can be changed by the user either in the field File in the section Output or by using<br />
the dialog box as described above.<br />
• Click on the button Select properties in the section Output, choose NAME in the<br />
drop down menu Compound ID property, check the box CBG_ACTIVITY_CLASS<br />
in the list Select properties to copy and confirm with the button OK (see Figure 3).<br />
Figure 3 Selecting the properties of the output file.<br />
Step 2: Select and Calculate the <strong>Molecular</strong> Descriptors<br />
• Select Autocorrelation of <strong>Molecular</strong> Surface Properties → molecular<br />
electrostatic potential (SurfACorr_ESP) in the list Available in the section<br />
Descriptors and press the button > to select the descriptor for calculation.<br />
• SurfACorr_ESP now appears in the list Selected. Use the default settings and<br />
parameters in the section Available Control Parameters (see Figure 4).<br />
5
Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />
6<br />
Figure 4 Selecting the descriptor.<br />
• Press the button Calculate.<br />
• Note: ADRIANA.Code now calculates for each compound a vector of 12<br />
autocorrelation coefficients that encode the spatial distribution of the electrostatic<br />
potential on the molecular surface.<br />
• After the descriptor calculation is finished a dialog box appears. Press the button<br />
View output file to display the output file in a table formatted view. The first 12<br />
columns contain the 12 autocorrelation coefficients. The last two columns contain<br />
the affinity class (CBG_ACTIVITY_CLASS, 1 = high affinity; 2 = medium affinity; 3 =<br />
low affinity) and the name of the compound with a leading "!" which SONNIA<br />
interprets as the compound name (see Figure 5).<br />
Figure 5 Viewing the output file.
Calculating <strong>Molecular</strong> Descriptors with ADRIANA.Code<br />
Step 3 Calculate a Descriptor File with Experimental pK Values<br />
• Change the name of the output file to steroids31_actpK_mep_ac12.dat.<br />
• Select CBG_ACTIVITY_pK in the dialog box Select properties instead of<br />
CBG_ACTIVITY_CLASS and confirm with the button OK (see Figure 6).<br />
Figure 6 Selecting the properties of the output file.<br />
• Calculate the descriptors by pressing the button Calculate.<br />
7
Classification of Compounds According to their Biological Activity with SONNIA<br />
Classification of Compounds According to their Biological Activity<br />
with SONNIA<br />
The following section describes the classification of the steroid compounds according<br />
to their CBG binding affinity class (CBG_ACTIVITY_CLASS) using the Kohonen neural<br />
network algorithm implemented in SONNIA. The Kohonen algorithm is an<br />
unsupervised, non-linear mapping technique that projects the twelve-dimensional<br />
descriptor space (12 autocorrelation coefficients) into a two-dimensional plane<br />
(Kohonen map). The information about the CBG affinity class is not used for the<br />
projection (unsupervised learning). The neurons of the resulting Kohonen map are<br />
color-coded according to the CBG receptor binding affinity class (high, medium or low)<br />
of the compounds that are assigned to a specific neuron.<br />
Step 1: Start SONNIA, Load the Descriptor and the Structure File<br />
8<br />
• Start the graphical user interface (GUI) of SONNIA by double-clicking the desktop<br />
icon of SONNIA.<br />
• Select Read ... in the menu File in the main menu bar. The dialog box SONNIA<br />
Read appears (see Figure 7).<br />
Figure 7 Loading the descriptor and structure file into SONNIA.<br />
• Select in the list Directory the directory where the structure and descriptor files are<br />
located and select Data File in the drop down menu Object.<br />
• Select the file steroids31_actClass_mep_ac12.dat and press the button OK.
Classification of Compounds According to their Biological Activity with SONNIA<br />
• In order to load the structure file, repeat this procedure, but select Structure File in<br />
the drop down menu Object and select the file steroids31_act.sdf.<br />
Step 2: Create and Train a Kohonen Neural Network<br />
• Select Create ... in the menu Network in the main menu bar. The dialog box<br />
SONNIA Network appears (see Figure 8).<br />
Figure 8 Creating a Kohonen neural network.<br />
• Ensure that Kohonen is set in the drop-down menu in the section Algorithm and<br />
Topology is set to toroidal.<br />
• The size of the network is set automatically by SONNIA. In this example, the<br />
network has a size of 5 (width) x 3 (height) = 15 neurons.<br />
• Enter the number 12 in the field Input in the section Network Dimensions (this is<br />
the number of descriptors of each molecule). Use the default settings for all other<br />
parameters. In this case the network (plane) has a dimension of 5 (width) x 3<br />
(height) = 15 neurons. Press the button Create.<br />
• Select Train ... in the menu Network in the main menu bar. The dialog box SONNIA<br />
Training appears (see Figure 9).<br />
9
Classification of Compounds According to their Biological Activity with SONNIA<br />
10<br />
Figure 9 Setting the training parameters for a Kohonen neural network.<br />
• Use the default settings for all parameters (see Figure 9) and press the button<br />
Train. The window SONNIA Monitor appears which shows the changes of the<br />
dynamic error (distance between input vectors and neuron weights) with the number<br />
of training cycles (see Figure 10).<br />
Figure 10 Training a Kohonen neural network.<br />
• The training is finished if the button Stop in the window SONNIA Monitor changes<br />
to OK (see also Figure 10).
Step 3: Create a Kohonen Map<br />
Classification of Compounds According to their Biological Activity with SONNIA<br />
• Select Palette Editor ... in the menu Maps in the main menu bar. The dialog box<br />
SONNIA Palette Editor appears.<br />
• Choose 3 in the drop down menu Colors (this is the number of classes) and 1<br />
(default, this is the position of the affinity class in the input vector) in the field Output<br />
(see Figure 11 left). Confirm the settings by pressing the button Apply.<br />
Figure 11 Setting the number and type of used colors for the Kohonen map.<br />
• Note: The default colors can be changed by clicking on a color in the section<br />
Palette of the dialog box SONNIA Palette Editor. The dialog box SONNIA Color<br />
Editor appears (see Figure 11 right). The color can now be changed by using the<br />
sliders or by entering color values for Red, Green and Blue. Confirm by pressing<br />
the button Apply.<br />
• Select Selected Maps in the menu Maps in the main menu bar. The Kohonen maps<br />
are generated and displayed (see Figure 12). Each colored square in the map<br />
corresponds to one neuron.<br />
• Note: By default, two Kohonen maps are generated. The first map is color-coded by<br />
the most frequent pattern that has been mapped into a neuron. In this example, this<br />
is the most frequent CBG binding affinity class. For instance, if two compounds with<br />
high and one compound with medium affinity were mapped into one single neuron<br />
the neuron gets color-coded with the color for high affinity (class 1, red). The second<br />
map additionally shows all neurons that contain compounds of at least two different<br />
classes (collision or conflict neurons). These neurons are marked in black color (see<br />
Figure 12, right map).<br />
• Note: The number and type of default maps can be changed by selecting Selected<br />
Maps ... in the menu Maps in the main menu bar. By default, the map types most<br />
frequent output and average output (conflicts) are checked (selected). Check<br />
further map types to add them to the default maps which are generated when<br />
selecting Selected Maps in the menu Maps in the main menu bar.<br />
11
Classification of Compounds According to their Biological Activity with SONNIA<br />
12<br />
Figure 12 Visualizing the Kohonen map colored by the most frequent activity<br />
class in each neuron (left) and with marked collision neurons (at<br />
least two molecules of different classes in the same neuron).<br />
• Note: The generated Kohonen maps have a toroidal geometry (see also Figure 26<br />
on page 24). Therefore, each neuron in the map has the same number of neighbors<br />
(8), also the neurons at the edges. By clicking on the map and holding the left<br />
mouse button, the maps can be shifted in x and y direction. Note that only the<br />
selected map is shifted. All other maps remain unchanged.<br />
• Right-click on a map and select Tile ... in the context menu. The window SONNIA<br />
Tiling appears (see Figure 13). Due to the toroidal geometry of the maps they can<br />
be tiled. Tile more maps by changing the size of the window SONNIA Tiling with the<br />
mouse.<br />
• Note: Tiled maps often better visualize the result of the Kohonen mapping and help<br />
to better assess the quality of the classification.
Classification of Compounds According to their Biological Activity with SONNIA<br />
Figure 13 Tiling of a Kohonen map.<br />
Step 4: Analyze a Kohonen Map<br />
• In order to visualize which compounds were mapped into which neurons, left-click<br />
on a neuron while keeping the Crtl key pressed. The neuron is now selected and is<br />
marked in light-grey color.<br />
• Right-click on the selected neuron and select Export Structures ... in the context<br />
menu. The Structure Browser appears and displays the compounds that have<br />
been mapped into the selected neurons (see Figure 14).<br />
Figure 14 Displaying the chemical structures that are mapped to a specific<br />
neuron.<br />
13
Classification of Compounds According to their Biological Activity with SONNIA<br />
14<br />
• Note: The structure file must have been loaded into SONNIA (see also Figure 7) to<br />
use this functionality.<br />
• Note: More than one neuron can be selected by a left-click on the map while<br />
keeping the Crtl key pressed and dragging the mouse over the map. The focus of<br />
the selection is shown by a temporary rectangle while dragging the mouse. All<br />
selected neurons are finally marked in light-gray color.<br />
• Note: Neurons can be de-selected by left-clicking on the neuron while keeping the<br />
Crtl and the Shift key pressed.<br />
• Note: Additional properties that are stored in the structure file (e.g., compound<br />
names, CBG affinity classes) can be displayed in the Structure Browser by<br />
selecting Chemical Properties ... in the menu Display of the main menu bar of the<br />
structure browser (Prop tabs in the Browser Annotation Display Style).<br />
• Right-click on a map and select Export Centroids ... in the context menu. The<br />
Structure Browser appears. The browser now displays the centroid compounds of<br />
all neurons (see Figure 15). The arrangement of the structure browser always<br />
reproduces the size of the network (here: 5 x 3).<br />
• Note: The centroid compound of a neuron is the compound having a descriptor<br />
descriptor vector (twelve dimensions) most similar to the weights of the neuron<br />
vector (also twelve dimensions). The descriptor vector of the centroid compound has<br />
the minimum Euclidean distance to the vector of the neuron weights of all<br />
compounds that have been mapped to this neuron.<br />
Figure 15 Displaying the centroid structures of all neurons.<br />
• In order to export the contents of all neurons (i.e., the information which compounds<br />
are mapped into which neurons), select Export Contents ... in the menu Analyze in<br />
the main menu bar. The dialog box SONNIA Write appears (see Figure 16).
Classification of Compounds According to their Biological Activity with SONNIA<br />
Figure 16 Exporting the contents of all neurons.<br />
• Select a directory in the list Directory and select CSV File (Contents Maps) in the<br />
drop down menu Object. Enter a file name, e.g., steroids31_contentMap.csv,<br />
in the field Files and confirm with the button OK.<br />
• Note: The ASCII csv file (csv: comma separated values) can be displayed with a<br />
standard ASCII file browser or loaded into spreadsheet programs (e.g., Microsoft<br />
Excel). Figure 17 shows the content of the csv file (displayed in Microsoft WordPad).<br />
Figure 17 Displaying a contents maps file (csv).<br />
15
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />
16<br />
The following section describes the quantitative modeling of the CBG receptor binding<br />
affinity (CBG_ACTIVITY_pK) of the 31 steroid compounds using the<br />
counterpropagation neural network algorithm implemented in SONNIA. Again, each<br />
compound of the dataset is represented by a twelve-dimensional autocorrelation vector<br />
that encodes the spatial distribution of the electrostatic potential on the molecular<br />
surface. The counterpropagation algorithm is a supervised learning technique. In<br />
contrast to the Kohonen algorithm, the pK values of the CBG receptor binding affinity<br />
are now used to derive a model expressing the relationship between the descriptors<br />
(independent variables) and the biological activity (dependent variables).<br />
Step 1: Start SONNIA, Load the Descriptor and the Structure File<br />
• Start the graphical user interface (GUI) of SONNIA by double-clicking the desktop<br />
icon.<br />
• Select Read ... in the menu File in the main menu bar. The dialog box SONNIA<br />
Read appears (see also Figure 7).<br />
• Select in the list Directory the directory where the structure and descriptor files are<br />
located and select Data File in the drop down menu Object.<br />
• Select the file steroids31_actpK_mep_ac12.dat and press the button OK.<br />
• In order to load the structure file, repeat this procedure, but select Structure File in<br />
the drop down menu Object and select the file steroids31_act.sdf.<br />
Step 2: Create and Train a Counterpropagation Neural Network<br />
• Select Create ... in the menu Network in the main menu bar. The dialog box<br />
SONNIA Network appears (see Figure 18).
Figure 18 Creating a counterpropagation network.<br />
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />
• Select Counterprop. in the drop down menu of the section Algorithm. Ensure that<br />
Topology is set to toroidal.<br />
• Enter the number 12 (dimension of descriptor vector) in the field Input and 1 in the<br />
field Output (dimension of the property to model, single value of CBG binding<br />
affinity) in the section Network Dimensions. Use the default settings for all other<br />
parameters and press the button Create.<br />
• Select Train ... in the menu Network in the main menu bar. The dialog box SONNIA<br />
Training appears (see also Figure 9).<br />
• Use the default settings for all parameters and press the button Train. The window<br />
SONNIA Monitor appears which shows the changes of the dynamic error (distance<br />
between input vectors and neuron weights) with the number of training cycles (see<br />
also Figure 10).<br />
• The training is finished when the button Stop in the window SONNIA Monitor<br />
changes to OK (see also Figure 10).<br />
Step 3: Visualize the Trained Counterpropagation Network<br />
• Note: The trained counterpropagation network can be visualized in a style similar to<br />
a Kohonen map. In this example, a continuous value (pK value) is modeled which<br />
ranges from about -7.8 to -5.0. The number of colors that are available in SONNIA<br />
is limited to 10. Therefore, only ranges of the predicted values can be color-coded<br />
by a single color.<br />
17
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />
18<br />
• Select Palette Editor ... in the menu Maps in the main menu bar. The dialog box<br />
SONNIA Palette Editor appears.<br />
• Choose 3 in the drop-down menu Colors and 13 (= 12 autocorrelation coefficients+<br />
1 activity value) in the field Output (see Figure 19; the 13 th column in the input data<br />
file steroids31_actpK_mep_ac12.dat is the pK value).<br />
• Note: The entire range of the pK values from about -7.8 to -5.0 is now represented<br />
by three colors in equidistant ranges, i.e., red: pK values from -7.8 to -6.9; yellow:<br />
pK values from about -6.8 to -5.9; green: pK values from -5.8 to -5.0.<br />
• Confirm the settings by pressing the button Apply.<br />
Figure 19 Setting the number and type of colors for displaying the map of<br />
the counterpropagation network.<br />
• Select Selected Maps in the menu Maps in the main menu bar. The two default<br />
maps are generated and displayed (see Figure 20).<br />
Figure 20 Displaying the maps of the counterpropagation network.
Step 4: Write and Analyze the Prediction File<br />
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />
• In order to write out the predicted pK values by the counterpropagation network<br />
select Write in the menu File in the main menu bar. The dialog box SONNIA Write<br />
appears (see Figure 21).<br />
• Select Prediction File in the drop down menu Object and enter a file name in the<br />
field Files (e.g., steroids31.prd).<br />
• Confirm with the button OK. The dialog box Prediction appears and suggests in the<br />
field Input Dimensionality the figure 12 (see Figure 21; number of descriptors of<br />
each compound). Confirm with the button Apply.<br />
Figure 21 Writing the prediction file.<br />
• Note: The prediction file steroids31.prd is an ASCII file which lists the input Y<br />
variable(s) (experimental pK values), the predicted Y variable(s) (predicted pK<br />
values) and the name of the compound. The file can be loaded in spreadsheet<br />
applications or standard ASCII text browser for further analysis (see Figure 22 and<br />
Figure 23).<br />
19
Quantitative <strong>Modeling</strong> of Biological Activities with SONNIA<br />
20<br />
Figure 22 Loading a prediction file into a spreadsheet application (here: MS<br />
Excel).<br />
Figure 23 Analyzing the prediction of SONNIA (here: MS Excel).
Tips and Tricks<br />
Preprocessing Data Files<br />
Merging Structure and Property Data<br />
Tips and Tricks<br />
Often, the chemical structure data is stored in an MDL SDFile whereas any additional<br />
information related to the chemical structures (e.g., any measured or experimental<br />
data) is stored in a separate file, e.g., in a table-like formatted ASCII file. A primary key<br />
(e.g., a unique name or number of the chemical structures) that is present in both the<br />
SD and the ASCII file is the only link between the structure and the additional data.<br />
In order to merge chemical structure and additional data into a single SDFile, <strong>Molecular</strong><br />
<strong>Networks</strong>' tool MN.MERGE (www.molecular-networks.com/software/split_join_merge/)<br />
can be used. Figure 24 shows a part of an SDFile (left) and an ASCII file (right) that<br />
contains some experimental (Exp1, Exp2), a categorical value (Class1) and the<br />
compound name (CpdName) organized as a table. The primary key is the given in the<br />
column CpdName that can be present in the correspondent SDFile either in the name<br />
field (see Figure 24) or in a data field. The command line of MN.MERGE to merge the<br />
files is:<br />
mn.merge –tablefile tablefile.txt –tablekey CpdName –outfile<br />
outfile_merged.sdf infile.sdf<br />
compound_1<br />
CS 02280711002D 0<br />
<strong>Molecular</strong> <strong>Networks</strong> 28.02.2007<br />
54 58 0 0 0 0 0 0 0 0999 V2000<br />
2.8729 -2.0044 0.0000 C 0 0 0<br />
2.8920 -0.9195 0.0000 C 0 0 0<br />
.<br />
.<br />
.<br />
25 33 1 0 0 0 0<br />
26 34 1 0 0 0 0<br />
M END<br />
$$$$<br />
.<br />
.<br />
Exp1 Exp2 Class1 CpdName<br />
-6.279 71.5 2 compound_1<br />
-5.316 63.4 3 compound_2<br />
-5.334 69.7 3 compound_3<br />
-5.763 65.7 3 compound_4<br />
.<br />
.<br />
.<br />
-5.613 79.5 3 compound_30<br />
-7.881 69.0 1 compound_31<br />
Figure 24 Merging SDFiles and data files with MN.MERGE.<br />
The resulting SDFile outfile_merge.sdf will contain the values of Exp1, Exp2 and<br />
Class1 in the SDF data fields "", "", "" and "".<br />
Note: Any data field that is already present in the input SDFile is written to the output.<br />
21
Tips and Tricks<br />
22<br />
Standardization and Checking Structural State and Integrity of Structure Files<br />
Chemical structure files may originate from different sources. Therefore, the chemical<br />
structures may differ in the way they are coded in their connection table representation<br />
or even show some errors. For instance, functional groups such as nitro groups may be<br />
coded with a pentavalent nitrogen atom or as a charged species, hydrogen atoms may<br />
be given implicitly or explicitly or charges in salts may not be balanced correctly.<br />
However, for a corporate compound database or a dataset under investigation it may<br />
be mandatory that all chemical structures and their connection table representation<br />
comply with a certain standard, i.e., are coded in a consistent and pre-defined fashion.<br />
<strong>Molecular</strong> <strong>Networks</strong>' tool MN.CHECK (www.molecular-networks.com/software/check/)<br />
can be helpful to standardize chemical structure data by applying a set of business<br />
rules that can be selected by the user. MN.CHECK supports batch mode execution<br />
and is able to process large chemical files fast and efficient. Furthermore, MN.CHECK<br />
can be used to detect and correct errors in the structure coding (e.g., missing charges<br />
at counter ions in salts) and to identify and remove duplicate structures in large<br />
collections of chemical compounds (based on a 64bit hashcoding technique).<br />
For example, the MN.CHECK command line<br />
mn.check -hydrogen add -nitrostyle ionic -chargebalance -<br />
pedantic -unique -outfile outfile_checked.sdf infile.sdf<br />
will read in the file infile.sdf, add implicit hydrogen atoms, re-code all nitro groups<br />
(and similar functional units) as charge pairs (with a tetravalent, positively charged<br />
nitrogen atom, and a negatively charged oxygen atom or another ligand atom), balance<br />
charges in salts, pedantically check the file formatting and structure coding and write<br />
out a message when errors are detected, identify and remove duplicate structures and<br />
write out the normalized and checked structures to the file outfile_checked.sdf.<br />
Complementary Software<br />
Another helpful and valuable tool in this area is <strong>Molecular</strong> <strong>Networks</strong>' file format<br />
converter MN.CONVERT that supports over 50 different file formats for chemical<br />
structure and reaction information and interconverts them with high conversion rates<br />
and reliability. A complete list of all supported file formats can be found at the product<br />
page of MN.CONVERT at www.molecular-networks.com/software/convert/.<br />
2D structure diagrams (2D coordinates) in publishing quality can be generated with<br />
<strong>Molecular</strong> <strong>Networks</strong>' tool MN.2DCOOR. The tool offers a variety of options and<br />
features to customize the layout of 2D structure plots. For instance, structures can be<br />
aligned to their main x or y axes or to a template structure provided in a separate file<br />
(e.g., to align all structures in a combinatorial library to a predefined orientation of their<br />
common scaffold). Further information about MN.2DCOOR can be found at its product<br />
page at www.molecular-networks.com/software/2dcoor.
Training Parameters of a Neural Network<br />
Network Size<br />
Tips and Tricks<br />
By default, SONNIA suggests a ratio of approximately one neuron per two<br />
compounds/patterns (1:2) which usually works fine for initial tests. Another possibility is<br />
to start with a ratio of 1:1 and to gradually reduce the size in following runs. If the size<br />
of the network gets too large there is a high likelihood that it will only memorize the<br />
input data without showing the maximum of the actual neighborhood relationship of the<br />
data patterns (e.g., by conflict neurons, neurons with patterns of more than one class,<br />
e.g., known actives and unknown).<br />
Smaller networks (high neuron/pattern ratio) tend to produce more conflict neurons<br />
which might be of interest for some applications, e.g., for lead-hopping. However, in too<br />
small networks the data has to be compressed in a few neurons. This may lead to<br />
conflict neurons that are not very meaningful. A balanced ratio should be achieved.<br />
Another example for a rather high neuron/pattern ratio is the visualization of large<br />
chemical spaces. Figure 25 shows the projection of about 404,000 chemical<br />
compounds from different sources into a Kohonen map of the size 80 x 60 neurons.<br />
# of compounds: 404,449<br />
# of neurons: 4,800 (80 x 60)<br />
# of occupied neurons: 4,799<br />
Chemical supplier<br />
databases (139,961)<br />
NCI database (193,339)<br />
MDDR (71,149)<br />
Color coding: most frequent<br />
pattern in neuron,<br />
scaled<br />
Figure 25 Visualization of large chemical spaces with SONNIA.<br />
Network Topology<br />
SONNIA offers two different types of network topology, a toroidal and a rectangular<br />
topology<br />
Toroidal topology. All neurons have the same neighbor relationship, i.e., eight direct<br />
neighbors. This means that in the resulting Kohonen map the neurons at the corners<br />
and edges are adjacent to the neurons at the opposite site of the map. This can be<br />
illustrated by a torus that is cut two times to obtain a plane (see Figure 26).<br />
23
Tips and Tricks<br />
24<br />
Figure 26 Toroidal topology of a Kohonen neural network.<br />
Rectangular topology. The neurons at the corners and the edges form the boundary<br />
of the network. Therefore, a neuron at a corner of the network has three only neurons<br />
as direct neighbors, an edge neuron five neurons and all other neurons have eight<br />
neighbors.<br />
Rectangular topologies are better for classification purposes since, e.g., "outliers" are<br />
more pushed to the edges and corners.<br />
Toroidal topologies are better if the data under investigation represents a "closed"<br />
system, e.g., if a molecular surface and its property is mapped into a two-dimensional<br />
plane by a Kohonen network.<br />
Training and Learning Parameters<br />
SONNIA (Network window, see Figure 9) makes some reasonable suggestions for the<br />
number of training cycles (epochs) and intervals, i.e., how often the data set is<br />
presented to the network before the weights of the neurons are adapted to the input<br />
data. Furthermore, the initial spans and steps (the distance in x and y direction in the<br />
network to which the weights of the neurons are adapted to a central/winning neuron;<br />
this distance is gradually reduced during the training) are set automatically according to<br />
the size of the network.<br />
Reasonable, new training parameters for span and step can be calculated as following<br />
(see Figure 27).<br />
Width<br />
Span(<br />
x)<br />
=<br />
2<br />
Span(<br />
x)<br />
Step(<br />
x)<br />
=<br />
Epochs<br />
Height<br />
Span(<br />
y)<br />
=<br />
2<br />
Span(<br />
y)<br />
Step(<br />
y)<br />
=<br />
Epochs<br />
Figure 27 Calculation of training parameters for a neural network.<br />
Learning rates (Rate in SONNIA Training window, see Figure 9) of about 0.5 are<br />
recommended. In general, it's preferable to train longer (i.e., higher number of epochs)
Tips and Tricks<br />
but with lower learning rates. High learning rates may cause problems if several input<br />
patterns compete for one neuron.<br />
The rate factor (Rate Factor in SONNIA Training window, see Figure 9) reduces the<br />
learning rate after each epoch by multiplying the learning rate with the rate factor. At<br />
the beginning of the training<br />
In general, Kohonen (or SOM) mapping is quite powerful since you can very quickly do<br />
a visual inspection of a high dimensional space and it allows for a rapid assessment<br />
and evaluation if the used descriptors are able to reveal trends and patterns in the<br />
data.<br />
Assessing the Quality of an Unsupervised Classification<br />
Basically, there are three different criteria which can be used to assess the quality of a<br />
classification done by a Kohonen mapping. These three criteria, visual inspection,<br />
occupancy and number of collisions (conflict neurons) are described in the following.<br />
Note that all three criteria should be taken into account to support the decision whether<br />
a generated Kohonen map shows a "good" classification.<br />
Visual Inspection<br />
The strength of Kohonen maps is that they can be generated rather quickly and the<br />
results can be visually inspected. The visual inspection allows for a rapid assessment<br />
and evaluation if the used descriptors are able to reveal trends and patterns in the data<br />
("... human inspection building on the powerful pattern recognition capabilities of the<br />
human mind") [7].<br />
A Kohonen map that shows a clear separation of different classes of compounds in a<br />
dataset can be regarded as an indicator that there is a relationship between the used<br />
descriptor(s) and the property under investigation.<br />
Occupancy<br />
A well-trained Kohonen network should also show a balanced and even distribution of<br />
the patterns (i.e., compounds) over the resulting map as well as a low fraction of<br />
unoccupied neurons (shown as white squares in the map). The distribution of the<br />
patterns and the occupancy of each individual neuron can be checked with an<br />
"occupancy map" (menu Maps in the main menu bar of SONNIA, see Figure 28, right<br />
map). The occupancy map is color-coded by the number of patterns/compounds that<br />
are assigned to each neurons.<br />
25
Tips and Tricks<br />
26<br />
Figure 28 Occupancy map of a Kohonen neural network (right map).<br />
A Kohonen map with an unbalanced occupancy of the neurons (e.g., more than the<br />
half of the input patterns are located in only 10% of the total number of neurons) may<br />
have several reasons, e.g.,<br />
• The training of the network was stopped too early: Train a newly created network<br />
and increase the number of Epochs (adjust the values for Step(x) and Step(y)<br />
accordingly).<br />
• The input values of one or a few input patterns are rather different from the rest of<br />
the input patterns of the dataset ("outliers"): remove these patterns from your<br />
training set and train a newly created network with the reduced dataset.
Problems and Help!<br />
Problems and Help!<br />
If there are any difficulties with the installation of ADRIANA.Code or SONNIA or if any<br />
problems occur while running ADRIANA.Code or SONNIA please send all inquiries to<br />
the following address:<br />
<strong>Molecular</strong> <strong>Networks</strong> GmbH Computerchemie<br />
Henkestr. 91<br />
91052 Erlangen<br />
Germany<br />
or contact us by email support@molecular-networks.com<br />
or by Fax +49-9131-815669<br />
27
References<br />
References<br />
[1] Descriptor Calculation Package ADRIANA.Code, developed and distributed by<br />
<strong>Molecular</strong> <strong>Networks</strong> GmbH, Erlangen, Germany (http://www.molecular-networks.com).<br />
[2] Neural <strong>Networks</strong> Package SONNIA, developed and distributed by <strong>Molecular</strong> <strong>Networks</strong><br />
GmbH, Erlangen, Germany (http://www.molecular-networks.com).<br />
[3] Wagener, M.; Sadowski, J.; Gasteiger, J. Autocorrelation of <strong>Molecular</strong> Surface<br />
Properties for <strong>Modeling</strong> <strong>Corticosteroid</strong> <strong>Binding</strong> Globulin and Cytosolic Ah Receptor<br />
Activity by Neural <strong>Networks</strong>. J. Am. Chem. Soc. 1995, 117, 7769-7775.<br />
[4] a) Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B.<br />
A.; Laufer, J. Description of Several Chemical Structure File Formats Used by<br />
Computer Programs Developed at <strong>Molecular</strong> Design Limited. J. Chem. Inf. Comput.<br />
Sci. 1992, 32, 244-255. b) A detailed description of MDL file formats (Mol, SDF and<br />
RDF) is available for download as a PDF document at http://www.mdli.com.<br />
[5] Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three-Dimensional<br />
Model Builders Using 639 X-Ray Structures. J. Chem. Inf. Comput. Sci. 1994, 34,<br />
1000-1008.<br />
[6] 3D Structure Generator CORINA, developed and distributed by <strong>Molecular</strong> <strong>Networks</strong><br />
GmbH, Erlangen, Germany (http://www.molecular-networks.com).<br />
[7] Zupan, J.; Gasteiger, J. Neural Network in Chemistry and Drug Design. Second<br />
Edition, Wiley-VCH, Weinheim, 1999, 380 pages, ISBN 3-527-29778-2.<br />
28