Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation 

of predicted active sites: 

combining PDB and literature mining 

Kevin Nagel 

Wolfson College 

A dissertation submitted to the University of Cambridge 

for the degree of Doctor of Philosophy 

European Molecular Biology Laboratory, 

European Bioinformatics Institute, 

Wellcome Trust Genome Campus, Hinxton, 

Cambridge CB10 1SD, United Kingdom. 

Email: kevin5jan@googlemail.com 

January 2009

Declaration 

This dissertation is the result of my own work, and includes nothing which is the outcome 

of work done in collaboration, except where specifically indicated in the text. The dissertation 

does not exceed the specified length limit of 300 pages as defined by the Biology 

Degree Committee. This thesis has been typeset in 12pt font using L A TEX 2εaccording 

to the specifications defined by the Board of Graduate Studies and the Biology Degree 

Committee. 

1

Summary 

Kevin Nagel 

European Bioinformatics Institute 

University of Cambridge 

Dissertation title: Automatic functional annotation of predicted active sites: 

combining PDB and literature mining. 

Proteins are essential to cell functions, which is mainly identified in biological experiments. 

The structural models for proteins help to explain their function, but are not direct 

evidence for their function. Nonetheless, we can mine structural databases, such as Protein 

Data Bank (PDB), to filter out shared structural components that are meaningful with 

regards to the protein function. 

This thesis applied mining techniques to PDB to identify evolutionary conserved structural 

patterns, e.g. active sites. This analysis retrieved 3- and 4-bodies with assumed twoand 

three-way residue interaction that have been selected from a distribution analysis of 

residue triplets. A subset of the mined patterns is assumed to represent an active site, 

which should be confirmed by annotations gathered by automatic literature analysis. 

Literature analysis for the functional annotation of proteins relies on the extraction 

of GO terms from the context of a protein mention. The annotation of protein residues 

2

equires the identification of chemical functions, which could be found in the context 

of residue mentions. MEDLINE abstracts have been processed to identify protein mentions 

in combination with species and residues (F1-measure 0.52; the F1-measure is a 

statistical measure of a test’s accuracy based on the precision and recall of a test). The 

identified protein-species-residue triplets have been validated and benchmarked against 

reference data resources. Then, contextual features were extracted through shallow and 

deep parsing and the features have been classified into predefined categories (F1-measure 

ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annotation 

types in UniProtKB to assess the relevance of the annotations for ongoing curation 

projects. 

Altogether, the annotations have been assessed automatically and manually 

against reference data resources. 

All MEDLINE has been processed to filter out annotations for residues. A subset of 

identified catalytic sites could be cross-validated against the Catalytic Site Atlas (CSA; 

44 out of 221). 429 out of 512 protein residues from MSDsite was then annotated with 

contextual data. Altogether, MEDLINE does not provide sufficient data to fully annotate 

the content from PDB. Conversely, residue annotation is achieved with a different feature 

set than provided from GO, and incomplete annotations in the reference datasets can be 

filled from public literature. 

3

Acknowledgements 

This thesis would not have been possible without the support, direction, and love of a multitude 

of people. First, I would like to thank my supervisor Dietrich Rebholz-Schuhmann 

for his trust, encouragements, and for all his unconditional support and guidance. Dietrich 

has throughout given me opportunity and a sound research methodology. Working 

with him I have learned the value of vision, and persistence in achieving it. 

I am blessed to have had Tom Oldfield for my second supervisor. Ever since I was 

interviewed by Tom, he has been inspiring, helpful and most of all patient. I will look back 

fondly on our discussions, the ”insights” in protein science he gave me, and the cheerful 

and motivational chats. I am deeply indebted for his belief in me. 

I would like to thank my thesis committee members for their valuable and constructive 

comments and valuable criticism; Michael Ashburner, Kim Henrick, and Rob Russell. 

They all seemed to find time for me despite their busy schedules. 

A special thank you must go to Kim Henrick; had he not encouraged me to pursue a 

research position I would not be a scientist now. 

I would also like to acknowledge Antonio Jimeno for his time, patience, and suggestions 

and especially for reminding me to keep my focus always. But most of all I will remember 

the great times we had cycling to and from work. 

I would like to thank the past and present members of the Rebholz Group (Text 

Mining). During my years of research, the group has expanded and I have had the chance 

to learn from them as well as to have fun with them within the group. 

4

I am also thankful to the European Molecular Biology Laboratoy EMBL for the scholarship 

and the organised EMBL International PhD programme, throughout which I have 

had the chance to meet many talented and cheerful PhD students from the EMBL/EBI 

Hinxton. 

A special thank you to Christina Granroth and Dagmar Harzheim, who have done the 

proofreading of this thesis. Thank you Dagmar for becoming clearer what I want to say. 

Finally, I would like to acknowledge my wife Almut Nagel and my daughter Juli Nagel. 

Without Almut I would have become a working maniac with no joy in life; she helped me 

to maintain balance during my PhD research and also for the future. My special thanks 

and love will go to Juli, aged one, from whom I have learned so much. 

5

Contents 

1 Introduction 15 

1.1 Proteins and functional sites . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

1.6 Guide to remaining chapters . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

2 Background 26 

2.1 Protein related data resources . . . . . . . . . . . . . . . . . . . . . . . . . 26 

2.1.1 Protein Data Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

2.1.2 Universal Protein Knowledge base . . . . . . . . . . . . . . . . . . . 31 

2.1.3 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

2.1.4 Biomedical literature . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

2.2 Protein structure data mining . . . . . . . . . . . . . . . . . . . . . . . . . 35 

2.2.1 Hypothesis-driven data analysis . . . . . . . . . . . . . . . . . . . . 36 

2.2.2 Discovery-driven data mining . . . . . . . . . . . . . . . . . . . . . 37 

2.3 Biomedical literature mining . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.3.1 Biological entity recognition . . . . . . . . . . . . . . . . . . . . . . 38 

2.3.2 Biological relation extraction . . . . . . . . . . . . . . . . . . . . . . 39 

6

2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

3 Mining residue interactions as triads from PDB 42 

3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

3.1.1 Structural feature extraction . . . . . . . . . . . . . . . . . . . . . . 44 

3.1.2 Detection of significant configurations as interactions . . . . . . . . 47 

3.1.3 Grouping and selecting frequent configurations . . . . . . . . . . . . 52 

3.2 Analysing available non-redundant protein structure sets . . . . . . . . . . 53 

3.3 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.4.1 Identification of residue interactions is dependent on data selection 55 

3.4.2 The interaction distance correlates with the distribution of residue 

triads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

3.4.3 Interaction classification is sensitive to the size of cross-validation . 59 

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

4 Prediction of functions for mined residue triads 63 


4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

4.2.1 Identification of homologous metal binding sites . . . . . . . . . . . 66 

4.2.2 Validation of convergent metal binding sites . . . . . . . . . . . . . 67 

4.2.3 Recovering active sites and catalytic triads from the dataset . . . . 73 

4.2.4 Discovering the conserved serine residue in the catalytic triad (quartet) 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

7

5 Identification of protein residues in MEDLINE 79 

5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

5.1.1 Protein and organism entity recognition . . . . . . . . . . . . . . . 81 

5.1.2 Entity recognition of protein residue . . . . . . . . . . . . . . . . . 82 

5.1.3 Association identification of the entity triplet organism, protein, 

and residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

5.2 The construction of evaluation test corpora . . . . . . . . . . . . . . . . . . 86 


5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

5.4.1 Evaluation of organism, protein, and residue entity recognition . . . 90 

5.4.2 Performance study on the entity triplet association . . . . . . . . . 92 

5.4.3 Cross-validation of identified residues with UniProtKB . . . . . . . 93 

5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins . . . . . 94 

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

6 Information extraction from the context of a residue in text 101 

6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

6.1.1 Extraction of contextual features . . . . . . . . . . . . . . . . . . . 103 

6.1.2 Categorisation of contextual features . . . . . . . . . . . . . . . . . 110 


6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

6.3.1 Contextual feature extraction evaluated . . . . . . . . . . . . . . . . 117 

6.3.2 Performance analysis of the classifiers . . . . . . . . . . . . . . . . . 118 

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

8

7 Extraction of functional annotation for protein residues from MED- 

LINE 124 


7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

7.2.1 Evaluation of the developed functional annotation extraction system 126 

7.2.2 Studying mined functional annotations for the proteins p53 and Jak2129 

7.2.3 Cross-validation of mined catalytic residues with CSA . . . . . . . . 132 

7.2.4 Annotation of protein residues in MSDsite . . . . . . . . . . . . . . 134 

7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 

8 Combining active site prediction with mined functional annotations 137 

8.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

8.1.1 Combining protein structure data with literature data . . . . . . . . 138 


8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

8.3.1 Protein residue mapping between three data resources . . . . . . . . 140 

8.3.2 Rediscovery of active sites and catalytic residues . . . . . . . . . . . 142 

8.3.3 Search for novel catalytic residues . . . . . . . . . . . . . . . . . . . 145 

8.3.4 General correlation found between predicted functional sites and 

extract functional annotations. . . . . . . . . . . . . . . . . . . . . 146 

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

9 Conclusions and future work 150 

9.1 Summary of main contributions . . . . . . . . . . . . . . . . . . . . . . . . 150 

9.2 Limitations and future works . . . . . . . . . . . . . . . . . . . . . . . . . 152 

A Examples of errors in relation extraction. 171 

9

B Examples of extracted functional annotations compared with UniProtKB173 

C Examples of extracted functional annotations for the protein p53 177 

D Examples of extracted functional annotations for the protein Jak2 183 

E Examples of extracted functional annotations of the category binding 

event 186 

F Examples of extracted functional annotations of active site residues 189 

G Glossary 192 

10

List of Figures 

1.1 The standard amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.2 Examples of functional sites in proteins . . . . . . . . . . . . . . . . . . . . 18 

1.3 The protein universe and its knowledge representation . . . . . . . . . . . . 20 

2.1 Data banks in the protein universe . . . . . . . . . . . . . . . . . . . . . . 28 

2.2 Three hyperlinked protein data banks . . . . . . . . . . . . . . . . . . . . . 29 

2.3 Categories for protein sequence annotation UniProtKB . . . . . . . . . . . 32 

2.4 GO terms are not suitable for protein residue annotation . . . . . . . . . . 34 

3.1 Overview of processes and evaluation methods of the developed 3D pattern 

identification system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

3.2 Four classes of interactions within a 3-body . . . . . . . . . . . . . . . . . . 49 

3.3 Non-redundant structure set for 3D pattern mining . . . . . . . . . . . . . 53 

3.4 Distribution analysis of extracted residue triplets . . . . . . . . . . . . . . 57 

3.5 Comparison of extracted residue triplets based on their interaction type . . 58 

3.6 The effect of varying the cross-validation sample size on significance testing 

of residue interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4.1 A metal binding site with the 3Cys pattern in OLDFIELD . . . . . . . . . 68 

4.2 A metal binding site with the Cys-2His pattern in OLDFIELD . . . . . . . 69 

4.3 A metal binding site with the 3Cys pattern in SCOP40 . . . . . . . . . . . 70 

4.4 A metal binding site with the Cys-2His pattern in SCOP40 . . . . . . . . . 71 

11

4.5 Re-discovery of the catalytic triad as Asp-His-Ser pattern in OLDFIELD . 75 

5.1 Overview of processes and evaluation methods for the developed protein 

residue identification system . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

5.2 Test corpora for information extraction evaluation . . . . . . . . . . . . . . 87 

5.3 Identified protein residues in MEDLINE . . . . . . . . . . . . . . . . . . . 95 

5.4 Cross-validation of citations from identified protein residues with UniProtKB/PDB 97 

6.1 Overview of processes and evaluation methods of the developed contextual 

feature extraction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

7.1 Performance evaluation of the functional annotation extraction system . . 127 

7.2 Cross-validation of text mined catalytic residues with CSA . . . . . . . . . 133 

7.3 Cross-validation of text mined binding residues with MSDsite . . . . . . . 134 

8.1 Overview of processes and evaluation methods of combining the protein 

structure dataset and literature dataset . . . . . . . . . . . . . . . . . . . . 138 

8.2 Lookup table for PDB/UniProtKB mapping . . . . . . . . . . . . . . . . . 140 

8.3 Overview of the combined datasets from protein structure data and biomedical 

literature data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

12

List of Tables 

3.1 Study on the effect of varying the interaction distance threshold in structure 

triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.1 Summary of extracted data at each protein structure data mining step . . 65 

4.2 Identification of metal binding sites in OLDFIELD . . . . . . . . . . . . . 66 

4.3 Convergent metal binding sites identified in SCOP40 . . . . . . . . . . . . 72 

4.4 List of cross-validated active site residues . . . . . . . . . . . . . . . . . . . 74 

4.5 Extending the catalytic triad into 4-bodies . . . . . . . . . . . . . . . . . . 76 

5.1 Regular expression patterns for the detection of residue mentions in text . 84 

5.2 Performance evaluation of residue entity recognition . . . . . . . . . . . . . 90 

5.3 Performance evaluation of protein entity recognition . . . . . . . . . . . . . 91 

5.4 Performance evaluation of organism entity recognition . . . . . . . . . . . . 91 

5.5 Performance evaluation of residue-protein-organism entity association detection 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

5.6 Performance evaluation of protein-organism and protein-residue entity association 

detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

5.7 A specialised performance evaluation between GC and XC2. . . . . . . . . 94 

6.1 Biological categories for the classification of protein residue related information 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

6.2 Category distribution in the text feature reference set . . . . . . . . . . . . 115 

13

6.3 Evaluation of syntactical language parser performance . . . . . . . . . . . . 117 

6.4 Performance analysis of the classifiers (confusion matrix) . . . . . . . . . . 119 

6.5 Performance evaluation of the classifiers (precision, recall, F1 measure) . . 120 

8.1 Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 

8.2 Identified catalytic residues from MEDLINE extraction . . . . . . . . . . . 144 

8.3 Catalytic triad residues available from the mined functional annotations . . 145 

8.4 Functional annotations of protein residues in predicted functional sites. . . 147 

8.5 Homology-based transfer of extracted functional annotations for protein 

residues in the mined pattern data. . . . . . . . . . . . . . . . . . . . . . . 148 

A.1 Examples of errors in the relation extraction for the detection of contextual 

features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 

B.1 Comparison of extracted functional annotations from GC with UniProtKB. 174 

C.1 Examples of literature mined annotations of protein residues in p53. . . . . 178 

D.1 Examples of literature mined annotations of protein residues in Jak2. . . . 184 

E.1 Mined functional annotations of protein residues with information on binding 

events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 

F.1 Identified catalytic triad residues from MEDLINE exraction. . . . . . . . . 190 

14

Chapter 1 

Introduction 

1.1 Proteins and functional sites 

The genomic information encodes the blueprint to build an organism. The decoding and 

implementation of genetic information depends on the functions of the proteins. Each protein 

is the result of transcribing a gene into mRNA, which is translated into a polypeptide. 

Hence, a protein is a gene product. The elementary units of a protein are the 20 natural 

standard amino acids, each with four invariant parts: a central chiral alpha carbon 

(Cα), an amine group (NH2), a carboxylic acid group (COOH), hydrogen (H), and a 

characteristic side chain (R). Apart from the invariant amine and carboxylic acid group, 

which gives every amino acid the property of a zwitterion, distinctive physicochemical 

properties are defined by the side chain group. These can be polar, acidic/basic, aromatic, 

bulky, conformational flexible, contain cross-linking ability, show hydrogen-bond 

capability, or chemical reactivity. Figure 1.1 lists all the standard amino acids and their 

common classification on the basis of the nature of the side chain group. 

During biosynthesis, ribosomes catalyses the polymerisation of amino acids through 

condensation and form peptide bonds between the NH2 and COOH groups of two consecutive 

amino acids. The backbone (main chain) of the resulting polypeptide is the repeating 

sequence of NH2-C-CO-[NH-C-CO] n -NH-C-CO. This is the primary structure of a protein 

15

Amino Acid 3-Letter 1-Letter Side-chain polarity 

Alanine Ala A nonpolar 

Arginine Arg R polar 

Asparagine Asn N polar 

Aspartic acid Asp D polar 

Cysteine Cys C nonpolar 

Glutamic acid Glu E polar 

Glutamine Gln Q polar 

Glycine Gly G nonpolar 

Histidine His H polar 

Isoleucine Ile I nonpolar 

Leucine Leu L nonpolar 

Lysine Lys K polar 

Methionine Met M nonpolar 

Phenylalanine Phe F nonpolar 

Proline Pro P nonpolar 

Serine Ser S polar 

Threonine Thr T polar 

Tryptophan Trp W nonpolar 

Tyrosine Tyr Y polar 

Valine Val V nonpolar 

Figure 1.1: The standard amino acids. The trivial names, 3-letter and 1-letter abbreviations are listed 

along with the physicochemical properties of their side chains. 

and it will fold spontaneously due to different interactions of its amino acid composition 

with environmental factors, e.g. solvent, salt, chaperones. The most prominent formation 

during the folding process is the hydrophobic core, which stabilises the protein structure. 

Amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, and methionine, 

are clustered in the interior of a protein, while charged or polar side chains are turned to 

the solvent-exposed surface and interact with surrounding water molecules. Minimising 

the exposition of hydrophobic side chains to water is the principal driving force of folding. 

The process of protein folding involves the formation of regular secondary structure 

elements (SSE), such as alpha helix and beta strand, which are stabilised by intramolecular 

hydrogen bonds and contacts between side chain atoms (van der Waals interaction). By 

following a helical path, the carboxyl group of residue i and the amino group of residue i+4 

of the main chain are arranged in alignment and stabilise the local structure by hydrogenbond 

formation. The side chains protrude out from the helically coiled backbone and 

define the surface of the helix. In contrast, beta strands are formed by hydrogen bonds 

between distant regions on the peptide. Depending on the direction of the peptide region, 

16

two adjacent strands can be characterised as parallel or antiparallel. Because the backbone 

adopts almost a fully extended conformation, every side chain of i + 2 residue is facing 

the same direction. A set of interacting strands is called a sheet. Within the process of 

intramolecular stabilisation of the main chain, the regions between secondary structure 

elements adopt a loosely defined conformation such as turns and random coils or loops. 

The attractive and repulsive forces (e.g. ionic or van der-Waals interaction between 

residues) among the SSEs balance each other during the folding process and lead to a 

relatively stable and complex three-dimensional structure. Stabilisation of the conformation 

may involve covalent bonding, e.g. disulphide bridges between two cysteine residues 

or the formation of metal binding-motifs. The spatial arrangement of sequentially proximate 

or distant residues allows the generation of biochemical functional sites. To identify 

those and other novel biologically functional regions in the protein is one of the greatest 

research interests in the protein bioinformatics community, because they explain phenological 

data, e.g. cellular processes. Figure 1.2 lists some of the well known functional 

sites in various proteins classified according to my own designed categorisation scheme. 

Finally, the formation of quaternary structure is the assembly of tertiary structures 

within a multi-chain protein. In this respect, each polypeptide chain is regarded as an 

individual functional unit (subunit or domain). Within the interfaces of the subunits, 

a multi-domain based functional site can be formed, which is not present or functional 

in the individual domains. For example, the proteins cAMP-dependent protein kinase 

(PDBID:1rdq), hexokinase (PDBID:1bdq), or maltodextrin phosphorylase (PDBID:1l5w) 

contain ligand binding sites consisting of more than one protein structure domain (A. 

Kahraman, pers. comm.). The identification of these multi-domain functional sites is 

another great challenge in protein bioinformatics. 

First, the prediction system has to 

find the correct assembly of tertiary structures (a crystal structure of a protein does 

not necessarily reflect the biological state of assembly). 

Second, the structure models 

have to be adjusted (proteins are not rigid molecules and have flexible parts), and finally 

17

site 

1. evolutionary site 

1.1. conserved site 

2. functional site 

2.1. interaction site 

2.1.1. active site 

2.1.1.1. catalytic site / reactive site 

2.1.1.1.1. catalytic residue 

2.1.1.1.2. donor site 

2.1.1.1.3. acceptor site 

2.1.1.2. binding site / contact site / substrate binding site / ligand binding site / 

binding site / recognition site 

2.1.1.2.1. specificity residue / specific site 

2.1.1.2.1.1. high affinity binding site 

2.1.1.2.1.2. low affinity binding site 

2.1.1.2.2. peptide binding site 

2.1.1.2.3. protein binding / receptor site 

2.1.1.2.3.1. nf kappab site 

2.1.1.2.3.2. antibody binding site 

2.1.1.2.3.3. antigen binding site 

2.1.1.2.3.4. actin binding site 

2.1.1.2.4. sugar binding 

2.1.1.2.5. lipid binding 

2.1.1.2.6. nucleic acid binding 

2.1.1.2.6.1. atp binding site 

2.1.1.2.7. metal binding site 

2.1.1.2.7.1. calcium binding site / ca(2+) binding site 

2.1.1.2.7.2. copper site 

2.1.2. passive site / target site 

2.1.2.1. cleavage site / lesion site / processing site / proteolytic cleavage site 

2.1.2.2. PTM site 

2.1.2.2.1. phosphorylation site 

2.1.2.2.1.1. tyrosine phosphorylation site 

2.1.2.2.2. glycosylation site 

2.1.2.2.3. regulatory site 

2.1.2.2.4. inhibitory site 

2.1.2.2.5. activation site 

2.2. structural site 

2.2.1 hydrophobic site 

2.2.1.1 hydrophobic core 

2.2.1.2. hydrophobic patch 

2.2.2. n terminal site 

2.2.3. c terminal site 

2.2.4. transmembrane site 

2.2.5. intracellular site / cellular site 

2.2.6. extracellular site 

2.2.7. anionic site 

2.2.8. cationic site 

2.2.9. nucleation site 

Figure 1.2: Examples of functional sites in proteins. A proposition of a classification scheme (excerpt) 

is represented based on my own perspective of biomolecular function of specific residue configurations in 

protein structures. 

18

co-factors, e.g. metal ions, have to be considered. 

1.2 Motivation 

The understanding of the biological function of proteins remains a central challenge in 

biology. 

Our knowledge of the protein universe can be partitioned into at least three 

knowledge spaces (cf. figure 1.3): protein sequence space, protein structure space, and 

protein function space. Each space represents a specific view of proteins. For example, the 

protein structure space contains information about the number of biological conformations 

of protein structures (cf. figure 1.3, top panel). Whereas, the function space describes the 

spectrum of protein function. Although information from each space partially overlaps, 

only little data are available to explain their relationship. 

For example, site-directed 

mutational analysis is often reported in context of gain or loss of a protein function, 

while the biological correlation between sequence and function is not understood. This is 

because the mechanism of protein function is not explained by information within sequence 

space. In contrast, structural data are more expressive than sequence data, because a 

protein structure provides spatial context of residues. Proteins are physical entities and 

as such, they perform interactions with other proteins or ligands. The shape of a protein, 

or more precisely, the spatial configuration of a set of residues in a functional site, is 

one explanation for protein function. While protein structure data mining is concerned 

with the prediction of novel functional sites in proteins, a mined structural pattern has 

no evidences of biological function. 

In contrast, biomedical literature reports a range 

of biological function of protein residues without a structural context and explanation of 

molecular mechanism (cf. figure 1.3, middle panel). The combination of information from 

protein structure space and protein function space seems to be an obvious approach in 

order to gain new knowledge on protein function. 

19

Figure 1.3: The protein universe and its knowledge representation. Information on a protein can be collected 

from at least three different knowledge domains: crystallography provides the spatial coordinate of 

a protein, protein sequencing determines the linear composition of amino acids in a protein, and biochemical 

experiments characterises the biological function (top panel). In principle protein function prediction 

can be done based on information from each domain knowledge spaces, however the combination of them 

can overcome some domain specific limitations (middle panel). 

20

1.3 Objective 

This thesis aims to discover hypothetical functional sites from Protein Data Bank (PDB) 

and annotate them with functional information from biomedical literature. 

The main 

idea is to combine the information from currently two detached data resources, protein 

structure information from PDB, and functional annotations of residues from MEDLINE 

(cf. figure 1.3, lower panel). More specifically, this research focuses on the prediction of 

active sites by data mining recurrent spatial residue configurations (3D pattern) in proteins. 

Contextual features of residues are extracted from biomedical literature to provide 

functional annotations. The results from both datasets are then combined to verify predicted 

functional sites by evidences of biological function. While existing approaches in 

protein structure data mining and biomedical literature mining has been used to generate 

data for each research domain, the combination of the datasets is a novel approach in 

protein bioinformatics research. 

1.4 Related works 

To verify a predicted protein function with functional annotations extracted from biomedical 

literature, two different levels have to be considered: the protein level, and the residue 

level (i.e. groups of residues forming a functional site). 

The recent publication of [JGLRS08] is one example for case (1): The prediction of 

protein function is based on the search for a conserved and connected subgraph (CCS) in 

protein-protein interaction graphs, generated from several biological databases. Within 

the set of CCS, all available functional annotations of a protein in a database are transferred 

to homologous proteins. The annotations consist of Gene Ontology (GO) terminologies 

and the transfer is the prediction of protein function. The verification of a predicted 

function was done by identifying GO terms in abstract texts of the corresponding protein. 

The approach of this thesis has some similarities to this report [JGLRS08], e.g. in 

21

oth approaches, results from data mining were verified by information extracted from 

biomedical literature. However, there are crucial differences between the two that need 

to be considered when assessing the result of this thesis. First, in contrast to the CCS 

identification, the data mining part in this work does not aim to identify known patterns, 

but wants to discover new structural features that may represent a novel functional site. 

Secondly, in [JGLRS08] the prediction of protein function utilises terminologies of a welldeveloped 

public resource, the Gene Ontology, while the same resource is not suitable 

for annotation of protein residues. This is because GO is designed to describe function 

of genes and gene products. From a conceptual point of view, terminologies in GO describe 

a high level of biological function, while the description of residue function are of a 

lower level. For example, description of protein-protein interaction is found in context of 

metabolomics, signal-transduction or other cellular processes. In contrast, the function of 

a protein residue can be explained in light of molecular interactions or chemical reaction 

mechanisms. Finally, the distribution of information on biological function is expected to 

be different in biomedical publications. Because protein function is conceptually a high 

level of biological function, it is likely that abstract texts of biomedical articles contain 

information on this level. Conversely, the interaction of protein residues is a detailed description 

of protein function, and key information are expected to be mentioned in results 

or discussion sections of full-text articles. To my knowledge, the most related relevant 

work in terms of functional annotation of protein residues (case (2)) is the system called 

Mutation extraction and STRucture Annotation Pipeline (mSTRAP) [KCRB07]. The key 

feature of mSTRAP is the visualisation of mutation annotations, which is projected onto 

a structure of a protein of interest. The advantage of mSTRAP is to interpret impacts of 

mutation in context of the protein structure. However, the prediction of functional sites 

is done by visual analysis of the protein structure. The provided annotations are sets 

of complete sentences extracted from MEDLINE, which means that the interpretation of 

the information requires expert knowledge. 

22

The developed system in this work differs from mSTRAP, in that the extracted information 

is not exclusively used to annotate point mutations, but rather other functional 

descriptions of wild-type residues are also collected. Another distinction to mSTRAP is, 

the mined information is represented in a so called predicate-argument structure (PAS) 

format; only relevant text segments from sentences are extracted that describe a biological 

function or a biological context of a mentioned residue. The structured format allows to 

some extent queries for specific information in the extracted annotation dataset. 

In conclusion, only few related works have been reported that describe an automated 

system to verify a predicted protein function by using functional annotations extracted 

from the literature. This work retains its originality, because it aims to find novel functional 

sites in proteins by mining the PDB, and by extracting functional annotations from 

a wide range of biomedical literature data. 

1.5 Challenges 

Is it possible to identify a functional site, e.g. 

an active site, on the basis of mining 

PDB and the literature, and then combine the information of both? 

We can expect 

that a significant population of similarly arranged residues in a protein can be identified 

from a non-redundant protein set, if this evolutionary conserved interaction provides a 

functional or structural advantage. We can also expect that residues are mentioned in 

conjunction with their corresponding protein, and that the biological role of a protein 

residue is reported in context of gain or loss of function of the overall protein in biomedical 

literature. 

One task presented in this thesis is the identification of textual features as functional 

annotation. The problem differs from other information extraction tasks, e.g. the annotation 

of proteins, because the target is to provide knowledge on the biological role of a 

residue. For example, to extract protein-protein interactions from text, a list of protein 

names is used, and the task is reduced to finding only associations between listed pro- 

23

teins. In contrast, to extract a protein residue and its corresponding biological function 

is difficult, because an adequate dictionary of terms is not available. 

1.6 Guide to remaining chapters 

Chapter 2 presents background knowledge that are important for this work. Four different 

data resources are reviewed and their limitations discussed in context of this 

thesis. Then follows an explanation of methods in the field of protein structure data 

mining and biomedical literature mining. Some of the introduced methodologies are 

reused in this work, while ideas and approaches of others were adopted to develop 

task specific extraction systems. 

Chapter 3 describes the developed protein structure data mining system for the identification 

of 3D patterns in PDB. Algorithms for the identification of conserved 

spatial residue configurations are explained and the effects of algorithm-related and 

data-related parameters are discussed. 

Chapter 4 demonstrates the biological implication of the mined 3D patterns from chapter 

3. Two examples of rediscovered functional sites in proteins are shown to justify 

the presented data mining approach. The first biological validation is the identification 

of metal binding sites, while the second validation is the rediscovery of catalytic 

triad from the mined data. 

Chapter 5 is the first of three text mining chapters in this thesis. It explains the developed 

protein residue identification system, which consists of two main modules: 

biological entity recognition of residue, protein, and organism, and association detection 

of the entity triplet. 

Chapter 6 describes the approach to detect contextual features of a mentioned residue in 

text. An automatic method is introduced to assign semantic labels to the extracted 

24

textual features. 

Chapter 7 presents the third part of the three text mining chapters. Both text mining 

modules from the previous chapters (protein residue identification, and contextual 

feature extraction) are combined to form the functional annotation extraction system. 

The overall performance of this information extraction system is studied. The 

validity of the extracted information as functional annotation is demonstrated by 

manual analysis on two example proteins (p53 and Jak2), and by cross-validation 

of identified catalytic or binding residues with two reference databases: CSA and 

MSDsite. 

Chapter 8 presents results on combining protein structure data with literature data. 

The validity is studied by examining the correlation of predicted active site residues 

with enzyme-related functional annotations. 

Chapter 9 summarises the thesis and presents limitations and open questions for follow 

up research. 

25

Chapter 2 

Background 

In the previous chapter, I have presented the motivation and objective of this thesis. The 

purpose of this chapter is to familiarise the reader with relevant concepts in protein science, 

data mining, and literature mining. The limitations of each reviewed data resource or 

methodology are discussed in context of this research work. 

2.1 Protein related data resources 

Proteins are both building blocks of cellular structures and the major machinery in cells. 

In order to perform their functions, proteins need to fold into their three-dimensional 

structures and thereby form functional sites. The prediction of a structural pattern associated 

with a biological function is an important aspect in protein bioinformatics. To 

interpret the multiple functions of proteins, annotations are linked with results from 

bioinformatics analysis tools. In addition, data are extracted from generic and specific 

databases, biological knowledge accumulated in literature, and data from genome-wide 

experiments, such as transcriptomics and proteomics, are collected. One major goal is to 

describe protein function within biological context by using a standardised hierarchical 

classification scheme and controlled vocabulary. 

The biological community has developed databases and functional annotation schemes 

26

that are not only used to archive protein data, but also to describe protein function on 

a molecular, cellular and phenotypical level. Figure 2.1 shows some of the most popular 

and relevant databases in the field of protein bioinformatics. These protein-related data 

resources are hyperlinked in order to foster bioinformatical research works. A statistic of 

three example databanks and their hyperlinked references is given in figure 2.2. 

2.1.1 Protein Data Bank 

The Protein Data Bank (PDB) is an archive of 3D structures of large biological molecules, 

such as proteins and nucleic acids. Currently, PDB lists 43,099 proteins determined by 

crystallography (version November 2008). 

Despite the large amount of structure data 

available for a range of proteins, the information in the PDB has three significant limitations. 

First of all, the structure data have a low correlation with sequence data. In 

comparison to the sequence data in UniProtKB (cf. section 2.1.2), the coverage of the sequence 

space is much larger than the structure space. Therefore, the derived information 

from PDB is only applicable to a limited set of proteins. 

The second limitation is the coverage of annotation available for proteins. 

In the 

PDB, there are some facilities to annotate proteins, for example the SITE record is used 

to annotate protein residues that are part of active sites. However, annotations are not 

mandatory and many other sites are not updated, although new evidences of biological 

functionality of these residues were found. An automatically derived database called PDB- 

SITE [IPGK05] stores the SITE record information and makes the search for these data 

accessible. Another, rather predictive, database of functional sites in protein structures is 

the MSDmotif [GH08], which provides information about ligands, sequence and structure 

motifs, their relative position, and their neighbour environment. Another database of predicted 

functional sites is MSDtemplate [Old02], which contains small fragments generated 

by data mining on a structurally unique protein set from PDB. Examples of biologically 

relevant fragments were identified in this data collection, such as the catalytic triad and 

27

Figure 2.1: Data banks in the protein universe. This figure shows my interpretation of how our knowledge 

about proteins can be categorised. A selection of the most relevant data resources and web services 

are reproduced in this figure. UniProtKB = Universal Protein Knowledge base [WAB + 06]; PIR = Protein 

Information Resource [BGH + 00]; PDB SELECT = representative list of PDB chain identifiers [HSSS92]; 

PISCES = Protein Sequence Culling Server [WD03]; UniqueProt = web-service to create representative 

protein sequence sets [MR03]; MEROPS = the Peptidase Database [RMK + 07]; CAZy = Carbohydrate- 

Active enZYmes [CCR + 08]; TC-DB = Membrane Transport Protein Classification Database [STB06]; 

PMD = Protein Mutant Database [KON99]; Phospho.ELM = a database of S/T/Y phosphorylation sites 

[DCG + 04]; PROSITE = Database of protein domains, families and functional sites [HBB + 08]; PRINTS 

= Protein Motif Fingerprint Database [Att02]; BMC = Biomedical Center [BMC08]; PMC = PubMed 

Central [PMC08]; PDB = Protein Data Bank [BWF + 00]; SCOP = Structural Classification of Proteins 

[HMBC97]; CATH = Class, Architecture, Topology, Homologous superfamily - Protein structure classification 

[OMJ + 97]; Relibase = database of protein-ligand complexes [HBGK03]; CSA = Catalytic Site 

Atlas [PBT04]; MSDmotif = an integrated resource of protein structure motifs. 

28

Figure 2.2: Three hyperlinked protein data banks. Illustrated is the size of three databanks, PDB, 

UniProtKB, and MEDLINE, along with their cross-references. For example, the PDB contains in total 

42,943 PDB identifiers (version November 2008) with cross-references to 42,085 out of 333,445 Uniprot 

identifiers, which in return points to 10,466 biomedical journal articles (PMIDs). Notice that PDB 

also holds for each record a small number of primary citations, however, these are mainly pointers to 

crystallographic publications and provide little hints of biological function of the protein or annotation 

of functional sites. 

29

various metal binding sites. The Catalytic Site Atlas (CSA) [PBT04] is another database 

documenting active sites in enzymes of 3D structures. 

The data are either manually 

curated or predicted, based on searches for homologous proteins. 

Another serious limitation of PDB is its use for statistical analysis of structure data. 

The PDB represents a redundant and biased snapshot of the protein universe. Redundancy 

is due to the fact that many highly similar structures or identical folds are deposited 

in the database leading to an over-representation of some proteins. In the past, structure 

determination has been guided by hypothesis-driven experiments, short-listed target 

proteins in the medical or commercial field, and by the methodologically tractable small 

proteins for crystallisation. 

Consequently, the fold-space has not been fully explored 

yet. Although techniques in protein crystallography are improving, there are still other 

underrepresented proteins, e.g. membrane proteins or large proteins, which define the 

boundaries of representativeness of the structure data. 

While there is little we can do about exploring the complete ensemble of folds from 

a bioinformatics point of view, the over-representation can be filtered. 

For example, 

protein sequence based clustering [AGM + 90] [AMS + 97] is the principle method to produce 

the following datasets: PDB SELECT [HSSS92], PISCES [WD03], UniqueProt [MR03]. 

However, this approach is limited by the assertion of sequence-structure relation in the 

so called twilight zone, i.e. below 30 per cent sequence identity proteins may or may not 

have similar folds [Ros99]. Another critical issue with sequence based clustering is the 

comparison of protein chain sequences rather than the alignment of segments defined by 

protein domain boundaries. 

Structure based approaches cluster the data on the basis of domain structures. Several 

databases of domain based structure clustering were created with the most prominent 

ranging from entirely manual work (SCOP [HMBC97]), semi-automatic approach (CATH 

[OMJ + 97]), to entirely non-supervised methods (FSSP-Dali, [HS94]). Differences in these 

classification were studied by [HJ99] and [DBAD03]. 

30

2.1.2 Universal Protein Knowledge base 

The major repository of protein sequence data is the Universal Protein Knowledge base 

(UniProtKB). Along with the collection of sequence data is the listing of protein names 

and synonyms, taxonomic data, citation references, and other manually curated information 

from literature survey. 

One important aspect of UniProtKB when evaluating 

structure-function relationships is the annotation of protein residues. In the feature table 

the biological function of a residue site is described along with several other key categories 

(cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific 

annotations (version from January 2008). 

Despite the high quality data contained in UniProtKB, the process of extracting functional 

annotations from literature remains a laborious human expert curation work. The 

curator surveys the biomedical literature, represents the experimentally determined functional 

information, and formulates the precise functional role by utilising standardised 

semantic resources (cf. section 2.1.3). Despite the highly reliable quality of manual curation, 

this approach is evidently inefficient considering the amount of full-text publications 

curators have to distil. According to Frishman, if we assume 

”[...] that one needs on average roughly 30 min to assess published fact 

and bioinformatics evidence for one protein, one thousand annotators would 

have to work 1 year long, 8 h a day, to annotate all 5 million sequences that 

are currently known. However, since the size of the protein database has been 

consistently doubling every 18 months, the moving target of annotating all 

proteins will never be achieved.” [Fri07] 

Considering that the estimated total number of proteins is in excess of 10 10 [CK06], 

an automatic or semi-automatic solution is needed to facilitate the laborious human expert 

work. 

Currently, methods for the automatic expansion of citation set [YLPV07] 

[HLC04] [LHC07] and the automatic annotation of protein function with GO terminologies 

[CSL + 06] [GJYLRS08] [RSKA + 07] are being developed in the field of text mining. 

31

Key 

INIT MET 

SIGNAL 

PROPEP 

TRANSIT 

CHAIN 

PEPTIDE 

TOPO DOM 

TRANSMEM 

DOMAIN 

REPEAT 

CA BIND 

ZN FING 

DNA BIND 

NP BIND 

REGION 

COILED 

MOTIF 

COMPBIAS 

ACT SITE 

METAL 

BINDING 

SITE 

NON STD 

MOD RES 

LIPID 

CARBOHYD 

DISULFID 

CROSSLNK 

VAR SEQ 

VARIANT 

MUTAGEN 

CONFLICT 

Description 

Initiator methionine. 

Extent of a signal sequence (prepeptide). 

Extent of a propeptide. 

Extent of a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome etc.). 

Extent of a polypeptide chain in the mature protein. 

Extent of a released active peptide. 

Topological domain. 

Extent of a transmembrane region. 

Extent of a domain, which is defined as a specific combination of secondary structures organised 

into a characteristic three-dimensional structure of fold. 

Extent of an internal sequence repetition. 

Extent of a calcium-binding region. 

Extent of a zinc finger region. 

Extent of a DNA-binding region. 

Extent of a nucleotide phosphate-binding region. 

Extent of a region of interest in the sequence. 

Extent of a coiled-coil region. 

Short (up to 20 amino acids) sequence motif of biological interest. 

Extent of a compositionally biased region. 

Amino acid(s) involved in the activity of an enzyme. 

Binding site for a metal ion. 

Binding site for any chemical group (co-enzyme, prosthetic group, etc.). 

Any interesting single amino-acid site on the sequence, that is not defined by another feature 

key. It can also apply to an amino acid bond which is represented by the positions of the 

two flanking amino acids. 

Non-standard amino acid. 

Posttranslational modification of a residue. 

Covalent binding of a lipid moiety. 

Glycosylation site. 

Disulfide bond. 

Posttranslationally formed amino acid bonds. 

Description of sequence variants produced by alternative splicing, alternative promoter usage, 

alternative initiation and ribosomal frameshifting. 

Authors report that sequence variants exist. 

Site which has been experimentally altered by mutagenesis. 

Different sources report differing sequences. 

Figure 2.3: Categories for protein sequence annotation in UniProtKB. Key categories used to describe 

regions or sites of interest in a protein sequence are listed. The key and the corresponding information 

(value) are stored in the feature table (FT line) in UniProtKB. Along with the listed categories are their 

definitions presented in this figure. 

32

Clearly, the annotation for a whole protein cannot be transferred to residue site annotation, 

because different groups of residues in the protein structure have different function. 

In this respect, the biological community is missing an information extraction system for 

the annotation of proteins at residue level. 

2.1.3 Gene Ontology 

The Gene Ontology (GO) [AL02] [GOC06] is one of the most widely used functional 

classification scheme including all of the most important criteria for annotations of biological 

data [PKS06]. Currently, the ontology lists a total of 26,302 terms with 15,643 

biological process terms, 2,233 cellular component terms, and 8,426 molecular function 

terms (version November 2008). The UniProtKB/InterPro group at the European Bioinformatics 

Institute (EBI) belongs to the Gene Ontology Consortium, and use its standard 

vocabulary to the annotation of protein function. The vocabulary is meant to describe 

biological phenomenology of genes and gene products (proteins). This is the reason why 

terminologies in GO are not suitable to describe the function and property of a protein 

residue. Figure 2.4 lists some examples where the identification of GO terms [GJYLRS08] 

did not find the more relevant keywords for the annotation of residues. At the moment, 

an ontology dedicated solely for the functional annotation of protein residues has not been 

developed. However, terminologies can be in general collected from other considerable resources, 

such as the Open Biomedical Ontologies [SAR + 07] which contains, for example, 

REX (an ontology of physico-chemical processes), and PSI-MOD (an ontology describing 

protein chemical modifications). 

2.1.4 Biomedical literature 

Biomedical research tackles biological questions from a number of perspectives and the 

published experimental data are always heterogeneous. The sum of description of biological 

phenomenon enables scientists to understand mechanisms in biology within various 

33

Annotation 

Sentence Manual GO 

”The catalytic mechanism of the 

non-phosphorylating glyceraldehyde- 

3-phosphate dehydrogenase and the 

other aldehyde dehydrogenases resembles 

a thioester mechanism involving 

the universally conserved cysteine 298 

(pea GAPN).” (PMID:9461340) 

thioester mechanism, conserved 

cysteine 

glyceraldehyde-3-phosphate 

dehydrogenase (NADP+) 

(phosphorylating activity), 

glyceraldehyde-3-phosphate 

biosynthesis, glyceraldehyde- 

3-phosphate catabolism, phosphoglycerate 

dehydrogenase 

activity 

Annotation 


”However, mutations of a key residue, 

His48, show significant deviation from 

the relationship, implying a role 

for the side chain in protection of 

the complex from hydroxide attack.” 

(PMID:2690955) 

protection of the complex from 

hydroxide attack 

AT DNA binding, tRNA, tyrosine 

tRNA ligase activity 

Annotation 


”Second, this reactive cysteinyl 

residue, which is required for L- 

cysteine desulfurization activity, was 

identified as Cys325 by the specific 

alkylation of that residue and by sitedirected 

mutagenesis experiments.” 

(PMID:81615929) 

L-cysteine desulfurization activity 

pyridoxal biosynthesis, phosphate 

binding, mutagenesis, 

nitrogenase activity, L-alanine 

biosynthesis, pyridoxal phosphate 

binding 

Figure 2.4: GO terms are not suitable for protein residue annotation. The presented examples demonstrate 

that predicted GO terms are not always suitable for protein residue annotation. The prediction of 

GO terms was done with an information theory based parser [GJYLRS08]. 

34

contexts. This summary of text has also been compared with an ”unstructured knowledge 

database”, where information is present, but difficult to retrieve due to the complexity of 

natural language. According to Sidhu, 

”[...] it is generally acknowledged that only 20 per cent of biological knowledge 

and data is available in a structured format or a database. The remaining 

80 per cent of biological information is hidden in the unstructured, free text 

of scientific publications.” [SDC06] 

In context of information extraction, the data to be extracted from an article are 

words (keywords) regarding biological concepts that could summarise the key message 

of the article. 

At first glance, abstract texts have a high density of keywords but a 

low coverage of information, while full-texts cover a larger but disperse quantity of data 

[FKY + 01] [YHF + 02] [SPIBA03] [SWS + 04] [NBD + 06]. 

Another key distinction between abstract texts and full-texts is the availability of 

data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE 

without restriction, while full-texts from various journals are only available for subscribed 

customers. 

Although some full-text articles are accessible through various initiatives 

[BMC08] [Plo08] [PMC08], the extraction of information from a whole document is expected 

to be much more complex than from an abstract text. For example, a biological 

feature of a residue may be expressed over several sentences, requiring a co-reference 

resolution of the residue and the feature. 

2.2 Protein structure data mining 

Data mining is an analytic method to identify valid, and novel patterns in data. A general 

data mining solution does not exist. Instead human data mining expertise and human 

domain expertise are required to solve each specific data mining problem. A data mining 

35

process consists of the following main processes: data selection, feature extraction, and 

correlation analysis. 

In respect of protein structure data mining, data selection means the identification 

of a non-redundant set of protein structures from PDB (cf. section 2.1.1). Although a 

protein structure contains only geometrical information, it is important to distinguish 

the types of structural features to be analysed. Following are the options of structural 

feature as target: the configuration of amino acids as Cα, the configuration of backbone 

atoms, the spatial arrangement of chemical groups [JIDG03] [YEC + 07] [Rus98] [SSR03] 

[Old02], and the physicochemical environments [OCR01] [YEC + 07]. In order to discover 

new information from the data, a developed data mining algorithm must not contain any 

biochemical knowledge. The target should be a mathematical model and not a biological 

template. 

2.2.1 Hypothesis-driven data analysis 

”Within the field of bioinformatics research, the term data mining is used very loosely to 

describe any type of data analysis. (T. Oldfield, pers. comm.).” Hypothesis-driven data 

analysis consists of defining a biological target (hypothesis), and searching for the target. 

Consequently, the result of a hypothesis-driven data analysis is not the discovery of new 

information. 

A number of methods were published that predicts a known protein function on the 

basis of protein structure information. Initially, the research work focused on global fold 

recognition [HS96] [WR97] [MB99] [KH04] [HPS + 03] [AZP + 05] to identify evolutionary 

distant, but structurally conserved homologues. Once a match is found functional annotations 

are transferred from the target to the query. Another more specific approach 

focuses on the search for matching local substructures in the proteins. The rational is, 

that a biological function can be mapped to a particular residue configuration in the 

protein, which is independent in function from the global fold of the structure. One obvi- 

36

ous approach was to design structure templates, which contains all the essential residues 

for a biological function. Several specific types of sites or motifs have been studied in 

detail to capture metal binding sites [Glu91], the catalytic triad of the serine proteases 

[FWLN94] [WBT97], and binding sites for anions such as sulphate and phosphate [Cha93] 

[CB94]. Computer assisted methods were developed in the following to help experts to 

design templates by analysing motifs over large sets of proteins corresponding to active 

sites [APG + 94] [Rus98] [SSR03] [Kle99] [FS98] [FGS98] [WBT97] [BT03] [PB06], surface 

patches or clefts [Las95] [KJ94] [LEW98] [SPNW04] [BFL04] or structural binding site 

locations [GPP + 03] [KN03]. 

2.2.2 Discovery-driven data mining 

The key feature in a discovery-driven data mining is the search for common characteristics 

(pattern) in the data, without providing any domain knowledge. More specifically, the 

target is mathematically defined and the system aims to identify over-representations, 

data variations, or singularities in the dataset. Hence discovery-driven data mining can 

deliver novel information, while the biological significance of the result is not trivial. 

One important aspect in identifying residue interactions in protein structures is the 

consideration of contextual information, such as interaction distance, chemical environment, 

and evolutionary conservation, in the data mining algorithm. The systems called 

ET/MA [CFK + 05] and ConSurf uses evolutionary information in combination with structural 

and chemical data, in order to highlight region of local structures with functional 

importance. In contrast, the systems PINTS [Rus98] [SSR03] and SIDEMINE [Old02] find 

patterns within the distribution of non-redundant structure set, by using solely mathematical 

model of interactions. One critical issue in the development of these data mining 

methods was the improvement of the signal/noise ratio. In order to boost the signal frequency, 

two structural features are merged if one is biologically equivalent to the other. 

While the analysis showed that the mined output contained biological valid data, the 

37

esult actually incurs some bias, because biological knowledge was introduced. 

2.3 Biomedical literature mining 

Biomedical text mining extracts information from text for the integration into biological 

databases. Due to the complexity of natural language, text processing involves structuring 

the text input by means of parsing and the annotation of some linguistic features, 

e.g. part-of-speech tags. The majority of biological text analysis is concerned about the 

extraction of explicitly stated facts from text; a task referred as biological information extraction 

[Hob02]. Biomedical text mining processes typically consist of two main analysis 

steps: biological entity recognition, and biological relation extraction. 

The vast amount of published biomedical articles contains phenomenological data on 

proteins, such as their molecular function. The information is encoded in unstructured 

text and requires different level of complexity to mine the data. There are several levels 

of text mining challenges to extract functional annotation: the identification of mutations 

[LHC07] [WK07] [BW05] [RSMA + 04] [HLC04] or genetic sequences [MG03], identification 

of gene or protein names [RSAG + 08] [PJYLRS08] [TMA08] [Fuk98] and chemical entities 

[CMR06], the extraction of annotation of molecular function [GJYLRS08] [RSKA + 07] 

[DS05] [KNT05] [GDAW03] [HNR + 05], and the identification of semantic relations between 

the biological entities [BLK + 08] [LCM03] [SB06]. 

2.3.1 Biological entity recognition 

The process of entity recognition (ER) can be split into three parts: location of the mentioned 

entity in text, classification of the entity into a predefined category, and normalising 

the entity by referencing to an entry in a database. 

Biological entities are often ambiguous in terms of their boundaries and categories. 

Probably the most challenging task is the correct identification of protein or gene names. 

38

For example, ”hunchback” is a protein in Drosophila, while it is also a general English 

term. Furthermore, protein names consist mostly of multiple words, e.g. ”Rho-like protein” 

or ”HIV-1 envelope glycoprotein gp120”. An ER system needs to identify all the 

constituents of a protein name in order to relate the detected entity to its reference entry 

in a database. The BioCreAtIvE challenge addressed this problem with the 1B subtask; 

the target is the identification of protein/gene names in text, and the annotation of their 

correct gene identifier. Various solutions were published ranging from rule-based methods 

[HFM + 05] [TW02] [Fuk98] to machine learning approaches [CMP05]. The developed 

methods are, in general, reusable for any other biological entity recognition or terminology 

identification problem. 

Works have also been published that focused on the extraction of protein point mutations 

[RSMA + 04] [HLC04] [BW05] [LHC07] [YLPV07], which is one category of protein 

residue terminology. Other categories are residue sequence or residue interaction pair. 

The most widely adopted method to identify these terminologies is the design of regular 

expression patterns. 

2.3.2 Biological relation extraction 

Relation extraction (RD) aims to find associations between entities, or between an entity 

and a terminology within a text phrase. One objective in biomedical information 

extraction is the mining of biological facts from text. An example of biological fact is 

the semantic relation between two biological entities, such as protein-protein interaction 

[TOT04]. 

Until now, three strategies have been investigated for biological relation extraction: the 

co-occurrence based analysis [LC05] [SB05], pattern-based approach [HZH + 04] [LCM03], 

and machine learning based methods [BM05] [BM06]. The common limitation of all of 

these extraction systems is, that only the relation targets, e.g. proteins within a proteinprotein 

interaction, are extracted. By no means are contextual information considered in 

39

the extraction that would describe or explain the association of the entities. Within the 

information extraction community, a consensus has been reached, that deeper analysis of 

sentence structures is required in order to adequately acquire biomolecular relations from 

text [WSC04]. 

In respect of biological relation extraction, two classes of syntactical parsers were studied. 

The first is the shallow parsing technique, which aims in detecting main constituents 

of a sentence, without determining the complete syntactical structure. Results were published, 

where protein-protein interactions [KNT05] and general biological entity relations 

[LCM03] were extracted based on shallow parsing. The second class of syntactical parser 

is the full parser, which attempts a deep analysis of the syntactical structure of a sentence. 

Several systems have been reported [NED03] [FKY + 01] that utilises full parsing 

for relation extraction from biomedical literature. One interesting full parser is ENJU 

[YMTT05] [MT05], a so called head-driven phrase structure grammar (HPSG) parser, 

which identifies predicate-argument structure (PAS) from a text sentence. 

The use of PAS, as template for biomolecular relation extraction, was firstly reported 

in [TOT04] [YMTT05]. Recently, two proposition bank were reported, that are designed 

to capture relations in molecular biology: PASBio [WSC04] and BioProp [TCS + 07]. 

Within this work, there are two types of semantic relations to be extracted. 

The 

first is the residue-protein association. 

The system called MEMA [RSMA + 04] uses a 

word distance metric to associate a list of residue-protein pairs with the smallest word 

distance. 

Another approach is to look up valid associations between a residue and a 

protein in context of a predetermined association of a protein and an organism. Three 

systems have been reported, that adopt this approach: MuteXt [HLC04], MutationMiner 

[BW05], and MutationGraB [LHC07]. 

The other semantic relation to be extracted in this work is the association between 

a residue entity and its description of function. The systems MuteXt [HLC04], MEMA 

[RSMA + 04], MutationMiner [BW05], and MutationGraB [LHC07] are all dedicated to 

40

the extraction of point mutations, but provide no extraction of functional annotation. In 

a recent publication [WK07], an ontological model was proposed that should hold information 

extracted from MutationMiner as well as point mutation annotations. However, 

the author did not provide any results of feature extraction nor was a strategy proposed. 

2.4 Conclusion 

In this chapter, I have reviewed some of the most relevant data resources and research 

works in the field of protein structure data mining and text mining. Some of the data 

resources are used in this thesis. In the following, I will present the extraction systems I 

have developed during my PhD. 

41

Chapter 3 

Mining residue interactions as triads 

from PDB 

In this chapter, I present a novel approach in mining 3D patterns from protein structures. 

More specifically, a pattern is defined as the irreducible interaction of a chemical and 

spatial configuration of residues. The goal is to identify new information from a nonredundant 

dataset on the basis of using solely mathematical targets. 

The mined 3D 

patterns represent prediction of functional sites in proteins. 

3.1 Algorithms 

The novelty of this presented 3D pattern mining approach is based on the classification of 

residue triplets into one of four interaction classes. The idea of analysing side chain interactions 

within a residue triplet is based on the work of [Old02], while the classification of 

residue interaction relies on the methodology developed by [JB04]. The developed data 

mining method consists of three processing steps: structural feature extraction, detection 

of significant configurations as interactions, and grouping and selection of frequent configurations. 

Figure 3.1 illustrates the procedures of the entire protein structure data mining 

system developed in this thesis. 

42

Figure 3.1: Overview of processes and evaluation methods of the developed 3D pattern identification 

system. 

43

3.1.1 Structural feature extraction 

Theory 

Residue triplet as spatial pattern unit. 

The presented protein structure data mining 

algorithm aims to identify significant interaction of residues within a triplet configuration. 

The rational of analysing residue triplets is described in the following. In order to 

form a functional site in a protein structure, residues need to be physically in closed contact. 

In other words there exists a mutual dependency or interaction among the residues. 

The interaction can be studied on a two-residue basis (doublet 3D pattern). However, 

regarding the size of structure data the probability of any two-residue configurations is 

too high to be detected as specific. Hence, the signal/noise ratio issue is the reason why 

a two-residue 3D pattern is not the target of protein structure data mining [Old02]. 

A two residue contact is completely defined by a scalar property, while a three residue 

contact is defined by vectors. Consequently, a three residue constellation encodes much 

more information. This makes information theory based methods tractable to find conserved 

residue interactions as signals. 

In reality, functional sites can be composed of more than three residues, e.g. various 

metal binding sites used four coordinative cysteine residues. However, data sparseness 

and the mathematical complexity [CL64] [Sin04] in modelling four or larger residue interactions 

makes it infeasible. In principle, the more variables are introduced in modelling 

residue interactions, the more specific the data mining. It should be noted, that the identification 

of N-body interactions of residues can be solved from a combinatorial approach. 

Two triplets are combined, if there is equality in two out of three residues from each 

triplet [Old02]. This approach was adopted in this study to demonstrate that larger interaction 

configurations are extractable. However, this investigation concentrates mainly 

on the identification of three residue interactions. The assumption is, that if the output 

of a data mining provides valid result, the approach is justified and more complex residue 

configurations may inherit this property. 

44

Side chain interaction model. 

The determination of residue interactions requires a 

transformation of a full atom model into a simpler representation. This is because the 

mathematical model, that needs to describe all combinations of atom interactions of two 

residues, would be too complex. The solution is to replace the all-atom structure model 

with a coarse grained model, by reducing each residue to a single point. In principle, 

a residue point can be calculated either by the centre of mass, or the geometric centre 

(centroid). Each representation can be calculated from main chain atoms, main and side 

chain atoms, or side chain atoms only. 

The focus in this study is the side chain interactions within residue triplet configuration. 

For this reason, a protein structure is represented as a point spread of side chain 

centroids. 

Protein structure triangulation. 

The extraction of residue triplets from a protein is 

based on triangulation of structures. Here structures are triangulated on the basis of three 

criteria. The first is the compositional constraint. Each residue in a triplet must be an 

element of the 20 natural amino acids, while hetero atoms are excluded. One prominent 

reason is that there are not many examples of residue-hetero atom interactions in the 

dataset that would support a statistical analysis. 

The second condition of triplet extraction requires that none of the residues are direct 

neighbours in the protein sequence. The assumption made here is, that any covalently 

bonded residues have a higher likelihood than any other two residues being next to each 

other in space that are not bonded. Similarly, the probability of finding three residues in 

space that are connected, is higher than finding unconnected triplets of residues. Consequently, 

the distribution of interacting residues in space would be over-represented. The 

definition of residue neighbourhood affects the data mining result, e.g. by requiring a 

pair interaction in the triplet to have a distance of more than one residue, patches of 

residues at one side of a beta-sheet may not be discovered. While tuning this parameter 

can modify the result of the data mining, the objective here is to discover new knowledge 

45

from the input data set by providing as little as possible of biological information. 

The last criterion in triplet extraction is concerned with the geometrical property of a 

triplet. The Euclidean distances between the residues must fulfil the triangular inequality, 

while only two interaction distances of less than 6Åwere allowed. Although the interaction 

distance threshold is based on an empirical study of a number of protein structures, this 

value may not be adequate, because it would prefer close contacts of large side chains 

of residue pairs. For example the pair interaction of two tryptophans may have a near 

maximal allowed interaction distance of the centroids, while the distance of the contacting 

atoms are actually very close. The alternative is to set up a threshold system for residue 

pairs or triplets, which depends on the types of residues. Although this approach was not 

studied in this thesis, future work could improve the developed algorithm. Yet another 

approach in selecting residue interactions from a protein structure is based on the analysis 

of surface contacts of the side chain groups. While not all functional sites require their 

constituents to be in physicochemical contact (e.g. a metal binding site consists of metal 

ion coordinating residues without physical contacts), a protein binding site is an example 

where residues of two different proteins are in non-covalent interaction. 

However, the 

presented data mining approach aims in the unbiased search for residue interactions from 

a dataset of monomeric protein structure domains, and therefore a surface-based selection 

criterion will biased the analysis. 

Implementation 

A coarse grained representation is used in this protein structure analysis. From a full atom 

model of a protein structure, centroid positions of each protein residue were calculated 

on the basis of their side chain atoms. The resulting simplified structure model is then 

triangulated based on three criteria: (1) each residue in a triplet must be an element of 

the 20 natural amino acids; (2) pairs of residues in the triplet must not have a sequential 

relation in respect of their protein sequence position; and (3) only two pairs of residues 

46

can have a maximal interaction distance of 6Å, and only one pair with an interaction 

distance of less than 12Å. 

For the interaction analysis it is necessary to define a hash table, based on integer 

values of centroid distances, and the name of residue. The integer value of a distance is 

calculated by dividing the measured distance by a precision value (hash precision), which 

was set at +/- 0.5Å. Given a 3-body with 

trip = (A, B, C), (3.1) 

a three-dimensional hash table is defined as 

HT (A, B, C) = 3D hash bin[i][j][k], (3.2) 

where i, j, and k are the integer values of measured distances between two spatial coordinate 

of residues. The integer values are given by the equation 

i = INT (dist(A, B)/hash precision) 

j = INT (dist(B, C)/hash precision) 

(3.3) 

k = INT (dist(A, C)/hash precision). 

For a detailed definition of the implemented hashtable cf. [Old01]. 

3.1.2 Detection of significant configurations as interactions 

Theory 

The method for residue interaction detection relies on the comparison of two probabilistic 

models: the reductionistic part-to-whole approximation model, and the holistic reference 

model. Part-to-whole approximation is modelled with a collection of marginal distributions 

defined by subsets of the variables. Formally, a 3-body consists of three variables 

(cf. equation 3.1). To verify whether the probability of a triplet, P (A, B, C), can be 

47

factorised, we attempt to approximate it by using all attainable marginals 

M = {P (A, B), P (A, C), P (B, C), P (A), P (B), P (C)}. (3.4) 

If the approximation fits the data, i.e. 

the probability of finding a particular triplet 

is explained by the approximation model, then there is no evidence for an interaction. 

In other words, a significant interaction is given when the two models are significantly 

different. 

The difference between two joint probability density functions O and M is 

measured by the Kullback-Leibler divergence 

D(O||M) 

= ∑ O(i) 

i 

O(i)log( ). (3.5) 

M(i) 

In this context O usually refers to the observed probability or the reference model, 

while M is the approximation model. The null hypothesis in testing the interaction model 

is that the part-to-whole approximation matches the observed data. The alternative one 

is that the approximation does not fit and that there is an interaction. Three cases can 

be listed: 

D(O||M) > 0 : there is a pattern among k attributes 

D(O||M) = 0 : there is no pattern of order k 

(3.6) 

D(O||M) < 0 : there is redundancy among the parts. 

Within a 3-body system, four different configurations of interactions can be defined 

(cf. figure 3.2): no-interaction, one-pair interaction, two-pair interactions, and three-pair 

interactions. For each of these configurations it is possible to formulate a part-to-whole 

approximation model, i.e. the interaction can be factorised. In the case of no-interaction, 

the probability of the observable is expected to be estimated by its singlet probabilities 

{ 

k = 0 : 

ˆP 0 (A, B, C) = P (A)P (B)P (C) , (3.7) 

48

Figure 3.2: Four classes of interactions within a 3-body. A circle represents a protein residue, and an 

intersection resembles an interaction between two residues. k=0: no-interaction; k=1: one-way or one 

pair interaction; k=2: two-way or two-pair interactions; k=3: three-way or three-pair interactions. 

49

whereas in a system with one-pair interaction, two variables are dependent on each other. 

Consequently, within a 3-body state there are three isoforms of one-pair interactions: 

⎧ 

⎪⎨ 

k = 1 : 

⎪⎩ 

ˆP 1,1 (A, B, C) 

ˆP 1,2 (A, B, C) 

ˆP 1,3 (A, B, C) 

= 

= 

= 

P (A,B)P (C) 

P (A)P (B) 

P (A,C)P (B) 

P (A)P (C) 

P (B,C)P (A) 

P (B)P (C) 

. (3.8) 

There are two forms of three variable interactions, but with different dependencies: 

two-pair interaction (k=2) and three-pair interaction (k=3). These interactions represent 

the target of 3D pattern mining. In a two-pair interaction, two pairs of variables are dependent 

on each other, while sharing a common attribute. For example, given A interacts 

with B, and B interacts with C, there is no clear observation that A also interacts with 

C. Three isoforms are formulated for this interaction: 

⎧ 

⎪⎨ 

k = 2 : 

⎪⎩ 

ˆP 2,1 (A, B, C) = 

ˆP 2,2 (A, B, C) = 

ˆP 2,3 (A, B, C) = 

P (A,B)P (B,C) 

P (B) 

P (A,C)P (A,B) 

P (A) 

P (B,C)P (A,C) 

P (C) 

. (3.9) 

In case of a three-pair interaction, all three variables are dependent on each other, and 

the approximation model is defined as 

{ 

k = 3 : 

ˆP 3,1 (A, B, C) = 

P (A,B)P (B,C)P (A,C) 

P (A)P (B)P (C) 

. (3.10) 

If the state is disturbed, e.g. by exchanging one variable, a partial interaction will 

not be observed. In respect of protein biology, this could mean that a residue mutation 

abolishes an intramolecular stabilising network. However, as this does not provide an 

evolutionary advantage the conservation of this residue is likely to be promoted and can 

be detected as a recurrent structural feature. 

The determined sets of two-way (k=2) and three-way (k=3) interactions are the targets 

in this data mining. 

50


Triplets of residues are classified into one of the four defined interaction configurations. 

The classification is based on a non-parametric cross-validation sampling method described 

by [JB04]. A significant interaction is given when the two models O and M are 

significantly different. Because the data can be regarded as a sample of a multinomial distribution, 

the representativeness of the approximation model can be tested by the self-loss 

function D(P ′ ||P ). Here, P ′ and P are the probability distributions from two equal sample 

sizes. The weight of evidence of accepting the null hypothesis, i.e. the approximation 

model, can be estimated by p cv -values from a 2-fold cross-validation. For each random 

sampling the dataset is partitioned into two equally sized subsets: one training set and 

one test set. From these subsets two joint probability distribution functions, P ′ and P 

are determined from the training and test set, respectively. The marginal distributions, 

singlets and doublets, are determined from P ′ to construct the part-to-whole approximation 

ˆP ′ . The p cv -value is defined as the probability where the self-loss is greater or equal 

to the approximation loss 

p cv {D(P ||P ′ ) ≥ D(P || ˆP ′ )}. (3.11) 

On the basis of p cv -values, an interaction is discovered if p cv ≤ α, and an interaction 

is rejected when p cv > α. High threshold values of α, e.g. 0.95, will bias towards an 

interaction and risk overfitting, while lower values, e.g. 0.05, moves the bias towards nointeraction 

model and risk underfitting. In this study, a reductionistic bias approach was 

chosen, to prefer a simpler no-interaction model, by selecting α = 0.05. The used value 

of α is based on the research work of [JB04]. 

51

3.1.3 Grouping and selecting frequent configurations 

Theory 

The result of data mining protein structures can be a large set of 3D pattern. 

The 

data needs to be clustered in order to select the most frequent pattern. The assumption 

behind data clustering is, that residue configurations in protein structures are unlikely 

to be absolute and static. By grouping spatially similar configurations, the geometrical 

variation of patterns can be compensated and their frequencies improved. 


The objective in this section is to identify frequent groups of geometrically similar triplets 

with identical chemical configurations. Data clustering was done in two steps. For each 

residue triplet combinations, the initial step is to group geometrically similar patterns, 

and then count the combined frequencies 

i+1 j+1 

∑ ∑ ∑k+1 

G(HT (i, j, k)) = 

HT (i, j, k), (3.12) 

i−1 j−1 k−1 

where HT is a hash table of the residue triplets (cf. equation 3.2). Then local geometrical 

peaks were searched by comparing the frequencies of the grouped triplets 

arg max G(HT (a, b, c)) < G(HT (i, j, k)), (3.13) 

where HT (a, b, c) ≠ HT (i, j, k) with a = {i − 1, i, i + 1}, b = {j − 1, j, j + 1} and 

c = {k − 1, k, k + 1}. 

The second step in data clustering finds subgroups of triplets from a local peak, based 

on an all atom structure alignment. The determined clusters are ranked by their proba- 

52

Dataset PDBIDs Domains Domain definition Data selection Properties 

OLDFIELD 1,442 2,320 mathematical Sequence alignment 

SCOP40 3,449 4,734 human expert Sructure comparison 

Homologous structural 

features of divergent 

proteins. 

Convergent structural 

features of divergent 

proteins. 

Figure 3.3: Non-redundant structure set for 3D pattern mining. The dataset OLDFIELD is based on 

the publication of [Old02], and SCOP40 was obtained from ASTRAL Compendium [BKL00]. The size 

of the datasets, the method for data selection, and key properties are summarised. 

bility scores, which is defined as: 

P (cluster) = 

#cluster member 

#peak member . (3.14) 

On the basis of P (cluster) a cluster of residue interaction is selected if P (cluster) ≥ 

τ. In this study, the threshold tau for selecting a cluster was set to 0.66. 

3.2 Analysing available non-redundant protein structure 

sets 

The significance of this data mining result is greatly dependent on the representativeness 

of the data. For the frequencies of structural features to be true, they would have to be 

taken from protein structures of all of the naturally occurring protein folds. However, 

such a data resource is not available at present (cf. section 2.1.1). This effectively means 

that protein structure data mining is bound by the availability of fold examples. While 

from a bioinformatical point of view, little can be done to improve the coverage of the 

fold space, a number of efforts have been dedicated to the compilation of non-redundant 

datasets from PDB. 

The results in this thesis are based on the study of two non-redundant protein structure 

sets: OLDFIELD [Old02] and SCOP40 [HMBC97] [BKL00]. Table 3.3 summarises 

53

key features of each dataset. The major distinction between both datasets lies in the 

definition of a non-redundant dataset. The purpose in compiling OLDFIELD is to create 

a dataset that allows the detection of interesting structural equivalence from the 

non-specific structural features. The primary data selection is in sequence space. The 

resulting dataset contains only sequentially dissimilar protein fragments, while common 

fold motifs are preserved. This allows the detection of homologous structural components 

of divergent proteins. In contrast, SCOP represents a biased view of protein data by defining 

classes in structure space. The assignment to a class, of a novel protein, is based on 

structure and sequence comparisons. SCOP40 is the data subset of SCOP, where sequentially 

divergent proteins with convergent structural features are retained. Because the 

classification contains structurally divergent proteins, any identified recurrent structural 

feature in SCOP40 is an indication of convergent evolution. 

Another distinction between OLDFIELD and SCOP40 is the method of identifying 

domain structures. In OLDFIELD, protein fragmentation was done mathematically by 

analysis Cα distances [Old01], while in SCOP40 human experts were recruited to process 

a batch of protein structures. Both approaches have their advantages and caveats. On one 

hand, an automatic structure domain identification system can deliver reproducible data, 

while the results may not be justified in some cases. On the other hand, expert curated 

data represent a single precision view, but the information is difficult to be reproduced 

as new data become available. 

The difference in automatic and manual data selection is also reflected in the size of 

the datasets. In 2002, the compiled non-degenerated domain structure set from OLD- 

FIELD listed 2,320 domain structures, corresponding to 1,442 PDB identifiers. In contrast, 

SCOP40 contained 4,734 domain structures determined from 3,449 PDB identifiers 

in the same year. 

54

3.3 Evaluation methods 

The presented 3D pattern identification system is a discovery-driven data mining solution. 

The assessment of performance is done on two levels: the study of parameter dependency 

(presented in this chapter), and the validation of biological significance of the data (cf. 

chapter 4). 

The effect of data-related parameters was studied by comparing the mined results from 

OLDFIELD and SCOP. In the first part of the analysis, the distributions of extracted 

residue triplets were compared. Then the determined sets of k=2 and k=3 interactions 

were studied. 

The developed data mining method is a three step process, and the study of algorithmrelated 

parameter effects was studied on two levels. Although, the developed data mining 

method is controlled by many different parameters, the following key parameters were 

studied: residue interaction distance, and size of cross-validation to compute p-values. 

The effect of the interaction distance parameter was studied by varying the maximal 

distance between the centroids of residues. Three different distance settings were tested: 

4Å, 6Å, and 8Å. 

Repeated cross-validation sampling was used to determine confidence values for residue 

triplet classification. Various iterations were tested (from 100 to 1,500 in steps of 100) to 

study the effect on the size of interaction datasets. 

3.4 Results 

3.4.1 Identification of residue interactions is dependent on data 

selection 

The result of a data mining analysis is greatly dependent on the input dataset. 

The 

objective in this section is to study the effect of data-related parameters by comparing 

55

esults from data mining on OLDFIELD and SCOP40. 

With 590,255 unique triplet configurations in SCOP40 and 429,471 in OLDFIELD, 

the common set of triangulated triplets is 381,578 (cf. figure 3.4). Due to the difference 

in the probability distributions of both datasets, the classification of residue interactions 

resulted in different sizes of interaction classes. A set analysis on the classification data 

shows, that the classes have different sizes of overlaps (cf. figure 3.5). For example, 

OLDFIELD/k=3 and SCOP40/k=3 have a large common set of residue configurations of 

around 89 per cent for OLDFIELD and 44 per cent for SCOP. In contrast, the common 

set of k=2 interaction is much lower, i.e. 21 per cent for OLDFIELD and 13 per cent for 

SCOP40. The analysis also found two proportions of non-agreed classifications (k2/k3 

between OLDFIELD/SCOP40). 

These results highlight the effect of data selection on the data mining result. A different 

probability distribution of residue triplets, singlets and doublets is the reason, why certain 

residue configurations were classified as k=2 in one dataset, and k=3 in another dataset. 

3.4.2 The interaction distance correlates with the distribution 

of residue triads 

The extraction of residue configurations is controlled by the data representation, feature 

extraction, and by the feature selection method. Structural features were extracted by 

triangulation of a protein structure, which was modelled by a point spread of side chain 

centroids. The goal in this section is to study the effect of varying the interaction distance 

parameter. For this analysis the dataset OLDFIELD was used. 

Table 3.1 summarises the determined set of residue triplets by using three different 

maximal interaction distances. With the change of the distance threshold, the amount 

of extracted triplets, and the probability distributions of the singlets and doublets are 

changed (data not shown). Consequently, the testing of significance of residue interactions 

returns different results. It must be noted, that a complete analysis with 8 Åinteraction 

56

Figure 3.4: Distribution analysis of extracted residue triplets. The determined residue triplet distribution 

from OLDFIELD is compared with SCOP40. The upper panel shows a set analysis of the extracted 

residue triplets (numbers are the unique counts of the residue configuration). The middle panel illustrates 

the frequency of each triplet (t) (represented as information, I(t)) from the set of triplets (T). For 

a better visualisation the difference of the distributions is measured by the Kullback-Leibler divergence 

(lower panel). 

57

Figure 3.5: Comparison of extracted residue triplets based on their interaction type. The determined 

k=2 and k=3 classification sets from OLDFIELD and SCOP40 are compared by a set analysis. Due to 

the interaction classification (k=2, and k=3) there is no intersection of all four datasets. 

Triplets 

Distance Total Unique k=2 k=3 

4 2,938 1,799 16 165 

6 1,379,545 429,471 9,681 134,465 

8 7,128,886 2,016,306 N/A N/A 

Table 3.1: Study on the effect of varying the interaction distance threshold in structure triangulation. 

The different determined sets of residue triplet configurations in OLDFIELD were achieved by using the 

interaction distance thresholds: 4Å, 6Å, and 8Å. 

58

distance was not done in this study. 

In conclusion, the effect of varying the interaction distance on the triangulation output 

is in agreement with the expected result. While the frequencies of ”small” triplet 

configurations are the same for incrementing interaction distance threshold, the calculated 

probabilities are different, because of the different distributions. This also affects 

the result of interaction classification. 

3.4.3 Interaction classification is sensitive to the size of crossvalidation 

Significance testing of residue interactions is a method for assigning confidence values to 

the classification of residue triplets. The p-values were calculated from a two-fold crossvalidation 

with n-iterations of random data sampling. Here, the effect of varying the size 

of iterations is studied. OLDFIELD is used as dataset for this analysis. 

Figure 3.6 shows the logarithmic dependency between iteration size and determined 

classification sets. 

Regression analysis indicates, that the finite classification set was 

not found after 1,500-iterations. The study of classified residue interactions from each 

iteration revealed, that the set from iteration i is always a subset from the iteration j 

with i < j. 

In conclusion, the result of varying the iteration sizes indicates, that the classification 

sets are stable and reproducible. With the increase of iteration size, the determined sets do 

not altered, meaning classification result is reliable but additional elements are identified. 

3.5 Discussion 

3D pattern identification is the result of a data mining method that finds recurrent structural 

features within a protein dataset. The developed analysis method consists of three 

major modules: triangulation of a protein structure, significance testing of residue inter- 

59

Figure 3.6: The effect of varying the cross-validation sample size on significance testing of residue 

interaction. The diagram shows the increasing but converging number of determined residue triplet 

configurations with one-way, two-way, and three-way interactions at various iteration steps (from 100 to 

1,500 in steps of 100) of a non-parametric cross-validation sampling. 

60

action, and data clustering of the determined residue interactions. 

Protein structure triangulation is the basis of collecting spatial configurations of residues. 

The definition of residue interaction is a complex task, because an amino acid consists 

of many atoms. Many of them are candidates of interaction partners. A coarse grained 

model was used to overcome this problem, however, with the cost of redefining the interaction 

distance. Instead of measuring interaction distances between atoms of two different 

amino acids, the distance between the side chain centroids is used. The theoretical physicochemical 

interaction distance between two atoms cannot be transferred to measure the 

centroid based side chain interactions. The upper bound of interaction distance of 6Åwas 

determined from several visual inspections and measurements of residue configurations. 

The analysis shows that with d = 6Å, various side chain rotamer configurations are captured, 

which may represent a physicochemical interaction. By reducing the interaction 

distance threshold, a bias towards tightly inert residue configurations is observed. Conversely, 

the increase in d results in a huge set of triplet combinations. Some of the larger 

triplets do not capture a 3-body interaction, but may be part of a four-body interaction, 

where the fourth residue is situated between all three residues. Although larger interaction 

states may reflect a complete picture of a structural unit, the primary aim here is to 

find local and adjacent interactions of residues. 

The performance of correlation analysis based on hash tables is sensitive to positional 

errors, which is typically translated into the computation of ”wrong” hash bin indices. 

Consider the sample values a = 3.99, b = 4.01, and c = 4.99, where a is assigned to hash 

bin index i(a) = 1, while b and c are assigned to i(b) = i(c) = 2. The difference between a 

and b is actually less than b and c. The correlation analysis with these hashed data seems 

to be inadequate, although the ”correct” hash bin is in the neighbourhood. A solution 

to this problem is to consider adjacent hash bins, i.e. rectangular region, of the table 

[LW91]. 

The identification of an interaction class, e.g. a two-way interaction, is based on a 

61

probabilistic classification approach. Confidence values were assigned to the classification 

result, by calculating p-values from non-parametric cross-validation sampling. Theoretically, 

the more sampling iterations are used the more stable become the calculated p- 

values. At a certain point, the size of the determined interacting residues should converge 

to some value. The implication of determining a stable p-value is the identification of a 

finite set of residue interactions. Within this study, the final set was not determined and 

for practical reasons, a set after 100 iterations was used. 

The output of extracted patterns depends on the distribution of structural features 

in the input dataset. The introduced algorithm is based on the assumption that there 

are significant trends of residue configurations in proteins, if these interactions provide 

a significant functional or structural advantage. Obviously, we cannot expect that data 

mining on two differently defined data selection would deliver the same mining output. 

From a mathematical point of view, the results are still correct, because the algorithm is 

detecting recurrent residue configurations in the data. 


In this chapter, I have presented a novel data mining approach for the discovery of 3D 

patterns in protein structures. 

A pattern is a residue triplet with two- or three-way 

interaction of residues. The extraction of 3D patterns is not only dependent on algorithmrelated 

parameters, but also on the data selection. 

The validity of the data mining 

approach is justified on the basis of knowing the limits and effects of data and parameters. 

In the following chapter, I will present the biological significance of the mined result. 

62

Chapter 4 

Prediction of functions for mined 

residue triads 

In the previous chapter, a data mining approach was introduced, that identifies recurrent 

interacting residues as triplets in protein structures. Assuming, that a certain residue 

configuration is conserved in evolution, if it provides a structural or functional advantage, 

then the mined 3D pattern may represent a functional site in the protein. The objective in 

this chapter, is to demonstrate the biological validity of the data mined results, by crossvalidation 

with a reference database. I present two example cases of validated residue 

interactions. The first example represents the validation of a metal binding site, where 

the mined patterns represent either a homologous or a convergent structural feature. 

The second validation identifies the catalytic triad from the mined data. The analysis 

includes the search for a 4-body configuration of the catalytic triad (quartet), in order to 

find a previously reported conserved serine residue. The result presented in this chapter 

demonstrates the biological significance of the mined data, and justify the data mining 

approach. 

63


The biological significance of the mined 3D patterns is demonstrated by the rediscovery of 

known residue interactions. A systematic performance analysis, in terms of coverage and 

accuracy is not possible, because a test set with complete functional annotations of local 

residue interactions with biological function is not available. Therefore, various protein 

databases were used as references for cross-validations. 

The automatic cross-validation of metal binding sites is based on the comparison of 

the mined 3D patterns with a metal binding site database. Two reference databases were 

used and the results compared with each other: MSDsite [GDO + 05] and MDB [CHR + 02]. 

The identification of available metal binding sites in the input dataset considered only 

configurations with more than 2 residues. A hit was found, if all residues of a metal binding 

site were present in a protein structure. Likewise, a mined 3D pattern was identified as 

a metal binding site, if all residues of the pattern resemble a subset of a metal binding 

site. However, because a metal binding site can contain more than three residues, and 

the mined patterns can have two overlapping triplets, only identified metal binding sites 

were counted and not every matched pattern. The coverage is computed as: 

ccoverage = 

#unique sites matched by all residues in a 3D pattern 

. (4.1) 

#available sites in protein structure set 

The result of metal binding site cross-validation is compared with the performance of 

SIDEMINE [Old02] extraction. Because a similar experiment was not performed before, 

it was repeated here. The cross-validation of a metal binding site is analogous to the 

identification of active sites in the dataset (cf. above). 

The identification of a convergent metal binding site was done by a manual search in 

the mined output from SCOP40. The protein structures of a found metal binding site 

pattern were analysed in respect of their SCOP classification identifiers. 

64

OLDFIELD 

#Triangulated Interaction #Classified #Clustered #Pattern 

triplet type interactions patterns frequencies 

429,471 

k=2 9,681 925 5,697 

k=3 134,465 1,007 11,957 

SCOP40 

#Triangulated Interaction #Classified #Clustered #Pattern 

triplet type interactions patterns frequencies 

590,255 

k=2 15,455 765 927 

k=3 269,683 2,019 2,361 

Table 4.1: Summary of extracted data at each protein structure data mining step. The data mining 

was performed on OLDFIELD and SCOP40. The number of identfied residue triplet interactions is 

given in ”#Classified interactions”, while the column ”#Clustered patterns” indicates the size of unique 

residue interaction configurations after data clustering, and ”#Pattern frequencies” is the total amount 

of examples of the found residue interactions in the dataset. 

The automatic cross-validation of catalytic residues was done by comparing residues 

from active site templates in CSA [PBT04]. The validation of a catalytic active site for 

all example protein structures was based on manual analysis. 

To test whether the mined result contains a second conserved serine residue in the 

catalytic triad (quartet) (Asp-His-Ser/Ser), larger residue configurations were constructed. 

The method for finding N-bodies is based on the algorithm of [Old02]: two 3D patterns 

(triplets) from the same protein structure were combined, if they share two common 

residues. The analysis considered only the search for 4-, 5-, and 6-bodies. 

4.2 Results 

In the following sections, the biological significance of the mined 3D patterns is evaluated. 

Data mining was performed on the datasets OLDFIELD and SCOP40 with the following 

parameters: interaction distance d = 6Å, cross-validation iteration = 100, and selection 

of cluster based on τ = 0.66 (cf. section 3.4). Table 4.1 summarises the extracted data 

at each processing step. 

65

MSDsite 

Reference Dataset Determined Validated Coverage 

OLDFIELD 567 85 0.15 

SIDEMINE OLDFIELD 567 60 0.11 

MDB 

Reference Dataset Determined Validated Coverage 

OLDFIELD 302 36 0.12 

SIDEMINE OLDFIELD 302 18 0.06 

Table 4.2: Identification of metal binding sites in OLDFIELD. The available metal binding sites in the 

protein domain structures in OLDFIELD (input dataset) were determined by two reference databases 

(MSDsite and MDB). The figures were compared with the cross-validated metal binding sites in the 

mined 3D pattern dataset. A hit was found in the pattern data, if all three residues of a pattern is a 

subset of residues of a metal binding site. The performance was measured in terms of coverage. 

4.2.1 Identification of homologous metal binding sites 

Metal binding proteins play a vital role in a wide range of biological processes, such as 

structural stability and complex formation. The identification of metal binding proteins 

is therefore crucial. The objective in this section is to identify metal binding sites within 

the mined 3D patterns from OLDFIELD by cross-validation with the reference databases 

MSDsite [GDO + 05] and MDB [CHR + 02]. 

Table 4.2 lists the number of determined metal binding sites in the input dataset and 

the validated 3D patterns. The analysis shows that the determined coverage for both 

references is quite similar providing some confidence in the determined value. 

While 

the mined result covers only a small fraction of the available metal binding sites, the 

performance is comparable with SIDEMINE. 

A manual analysis shows, that some of the annotated metal binding sites can be partially 

recovered by merging two 3-bodies into a single 4-body. For example, the MSDsite 

lists the iron binding site, Asp-3His, for the PDB entry 1ar5 with the residues ASP161, 

HIS27, HIS75, and HIS165. The mined result from OLDFIELD contains the patterns 

66

2His-Trp and Asp-His-Trp, with the residues HIS27, HIS75, TRP126, and ASP161, HIS75, 

TRP126, respectively. Both triplets can be merged into the 4-body Asp-2His-Trp. 

A systematic analysis of false negatives is beyond the scope of this work. However, 

preliminary studies indicate, that the selection of interaction distance, plays an important 

role in discovering 3D patterns. For example, by setting the interaction distance d to 8Å, 

various triplet configurations can be extracted that contain the missing histidine, HIS165, 

from the example above. 

The validity of a mined 3D pattern as a metal binding site is demonstrated by manual 

analysis of several example structures. The examples shows that the residues of a metal 

binding site have a strong conservation of the side chain groups, indicating a high energy 

bond in the formation of a coordinative tetrahedral site. Figure 4.1 illustrates an example 

configuration with three cysteines from six structure examples. The listed proteins are 

heterogeneous in nature but are common in the 3Cys mediated ion binding site. Except 

for one entry all structures coordinate a zinc ion in a tetrahedral configuration. 

Another metal binding site with the configuration Cys-2His is shown in figure 4.2. 

The cluster lists 11 proteins with the majority being electron transfer proteins. 

In conclusion, the mined 3D pattern data contain validated metal coordinating residue 

configurations. The result indicates, that the presented data mining system is able to 

identify homologous structural features, which are recurrent in the dataset. 

4.2.2 Validation of convergent metal binding sites 

Proteins with different folds can share a common structural feature. For example, various 

metal binding sites share a common residue arrangement, while the global fold of the 

metal binding proteins is quite different. In this case, the common pattern represents 

a convergent structural feature. 

The objective in this section is to test whether the 

developed data mining algorithm is able to find patterns of convergent structural features. 

For this analysis, the data mining was performed on SCOP40. 

67

PDBID Description Bound metal 

1h2r periplasmic hydrogenase nickel-iron 

1lat glucocorticoid receptor zinc 

2nll retinoic acid receptor zinc 

1ptq protein kinase c zinc 

2ohx alcohol dehydrogenase zinc 

4mt2 metallothionein isoform II zinc 

Figure 4.1: A metal binding site with the 3Cys pattern. Cross-validation of metal binding sites with 3D 

pattern from OLDFIELD identified the 3Cys configuration (top panel). List of protein structures with 

the common 3Cys residue configuration (bottom panel). 

68


1kdi plastocyanin cu 

1aoz ascorbate oxidate cu 

6paz pseudoazurin cu 

1jer stellacyanin cu 

2azu azurin cu 

1bqk pseudoazurin cu 

1aac amicyanin cu 

1byo plastocyanin cu 

1as7 nitrite reductase cu 

1nic nitrite reductase cu 

1rcy rusticyanin cu 

Figure 4.2: A metal binding site with the Cys-2His pattern. Cross-validation of metal binding sites with 

3D pattern from OLDFIELD identified the Cys-2His configuration (top panel). List of protein structures 

with the common Cys-2His residue configuration (bottom panel). 

69


1iml metal-binding protein zn 

1zin phosphotransferase zn 

1kk1 translation zn 

1ibi metal-binding protein zn 

1dgs ligase zn 

1hc7 aminoacyl-trna synthetase zn 

1gax ligase/rna zn 

1dsv virus/virus protein zn 

1i50 transcription zn 

1ptq phosphotransferase zn 

1zbd complex (gtp-binding/effector) zn 

1kb4 transcription/dna zn 

1dcq metal binding protein zn 

1jj2 ribosome cd 

1vfy transport protein zn 

1ffy ligase/rna zn 

1dcq metal binding protein zn 

1dsz transcription/dna zn 

1d66 transcription regulation cd 

2alc dna binding protein zn 

1tfi transcription regulation zn 

4mt2 metallothionein zn 

1jr3 transferase zn 

1a5t zinc finger zn 

1jjd metal binding protein zn 

1bor transcription regulation zn 

1zbd complex (gtp-binding/effector) zn 

1g25 metal binding protein zn 

1pyi complex (dna-binding protein/dna) zn 

1hwt complex (activator/dna) zn 

1het oxidoreductase) zn 

Figure 4.3: A metal binding site with the 3Cys pattern. Cross-validation of metal binding sites with 

3D pattern from SCOP40 identified the 3Cys configuration (top panel). List of protein structures with 

the common 3Cys residue configuration (bottom panel). 

70


1ncs transcription regulation zn 

1rmd dna-binding protein zn 

2drp complex (transcription regulation/dna) zn 

1yuj complex (dna-binding protein/dna) zn 

1a1i complex (zinc finger/dna) zn 

1ubd complex (transcription regulation/dna) zn 

5znf zinc finger dna binding domain zn 

2gli complex (dna-binding protein/dna co 

1tf3 complex (transcription regulation/dna) zn 

1bhi dna-binding regulatory protein n/a 

1e53 transcription zn 

1g2a hydrolase ni 

1jym hydrolyse co 

Figure 4.4: A metal binding site with the Cys-2His pattern. Cross-validation of metal binding sites with 

3D pattern from SCOP40 identified the Cys-2His configuration (top panel). List of protein structures 

with the common Cys-2His residue configuration (bottom panel). 

71

3Cys 

SCOP classification 

SCOP domain identifiers 

a.4.11.1 1i50j 

a.27.1.1 1ffya1 

a.60.2.2 1dgsa1 

b.35.1.2 1heta1 

c.26.1.1 1gaxa3 

c.37.1.8 1kk1a3 

c.37.1.13 1jr3a2, 1a5t 2 

g.38.1.1 1d66a1, 2alca , 1pyia1, 1hwtc1 

g.39.1.2 1kb4b , 1dsza 

g.39.1.3 1iml 2, 1ibia1, 1ibia2 

g.39.1.6 1jj2t 

g.40.1.1 1dsva 

g.41.2.1 1zin 2 

g.41.3.1 1tfi 

g.44.1.1 1bor , 1g25a 

g.45.1.1 1dcqa2 

g.46.1.1 4mt2 , 1jjda 

g.49.1.1 1ptq 

g.50.1.1 1vfya , 1zbdb 

g.56.1.1 1hc7a3 

Cys2His 

SCOP classification 

SCOP domain identifiers 

g.37.1.1 11ncs , d1rmd 1, d2drpa1, d2drpa2, d1yuja , d1a1ia1, d1ubdc1, d5znf , 

d1ubdc2, d2glia4, d2glia2, d2glia3, d1tf3a1, d1bhi 

g.49.1.2 d1e53a 

d.167.1.1 d1g2aa , d1jyma 

Table 4.3: Convergent metal binding sites identified in SCOP40. The determined metal binding sites 

from the 3D patterns in SCOP40 belong to different fold classes of unrelated proteins (convergent structural 

feature). 

Two patterns were identified in this study that represent metal binding sites. The 

3Cys configuration is the first example with 31 structure examples (cf. figure 4.3). The 

second metal binding configuration is the Cys-2His pattern with 17 structure examples 

(cf. figure 4.4). A visual analysis determined that the identified metal binding sites from 

SCOP40 are similar to the mined result from OLDFIELD (cf. previous section). According 

to the SCOP classification scheme, groups of protein structures can be determined, 

that have different domain structures, but share the same metal binding site (cf. 

table 

4.3). This indicates that the pattern was found as a recurrent structural feature in 

evolutionary distant proteins. 

72

The result of this analysis suggests that the developed data mining algorithm is able 

to find recurrent and convergent structural features in a non-redundant structure set. 

4.2.3 Recovering active sites and catalytic triads from the dataset 

The catalytic triad is one of the most characterised non-metal active sites of serine proteases. 

The enzymatic reaction is based on the conserved residues serine, aspartate, and 

histidine that work together in a specific spatial arrangement. Previously, the identification 

of the catalytic triad has been described as the key evaluation analysis in protein 

structure data mining, because the occurrence of this pattern is just above the noise level 

in a dataset of analogous proteins [Old02]. The objective in this section is the search 

for active sites, and the catalytic triad in particular, by cross-validation with CSA. The 

mined result from OLDFIELD was analysed in this study. 

Within OLDFIELD, 235 active sites were determined, while the number of crossvalidated 

active sites from the mined output was 27. Table 4.4 lists the validated protein 

residues. 

The majority of these residues are found in the Asp-His-Ser pattern, which 

was validated as the catalytic triad by manual analysis. The identified catalytic triad 

configuration lists 22 structure examples, with the majority belonging to the enzyme class 

hydrolase, and only a few belongs to the class oxidoreductase. In comparison, [Old02] 

identified 9 proteins, where 7 out of 9 were rediscovered in this analysis. The remaining 

15 out of 22 are additional and approved solutions. Figure 4.5 shows the superimposed 

structures for the Asp-His-Ser configuration. 

This study shows that the presented data mining system is able to find the catalytic 

triad in OLDFIELD. The mined result contains 15 additional valid solutions that were 

not discovered in [Old02]. 

73

3D pattern (k=2) 

Cross-validated 

Pattern PDBID RID CSA SIDEMINE EC UID 

Ala-Arg-Asn 1qgj A ALA 71, A ARG 38, A ASN67 + 1.11.1.7 PER59 ARATH 

7atj A ALA 74, A ARG 38, A ASN 70 1.11.1.7 PER1A ARMRU 

His-2Ser 1elt A HIS 57, A SER 195, A SER 214 + 3.4.21.36 ELA1 SALSA 

1ppf E HIS 57, E SER 195, E SER 214 3.4.21.37 ELNE HUMAN 

1bma A HIS 60, A SER 203, A SER 222 + 3.4.21.36 ELA1 PIG 

1avw A HIS 57, A SER 195, A SER 214 + 3.4.21.4 N/A 

1hyl A HIS 57, A SER 195, A SER 214 + 3.4.21.- COGS HYPLI 

1bit A HIS 57, A SER 195, A SER 214 3.4.21.4 TRY1 SALSA 

1jrt A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRY1 BOVIN 

1try A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRYP FUSOX 

1au8 A HIS 57, A SER 195, A SER 214 3.4.21.20 CATG HUMAN 

1ct0 E HIS 57, E SER 195, E SER 214 + N/A N/A 

Asp-His-Ser 1a8q A ASP 223, A HIS 252, A SER 94 + 1.11.1.10 BPA1 STRAU 

1a7u A ASP 228, A HIS 257, A SER 98 + 1.11.1.10 PRXC STRAU 

1a88 A ASP 226, A HIS 255, A SER 96 + 1.11.1.10 PRXC STRLI 

1a8s A ASP 224, A HIS 253, A SER 94 + 1.11.1.10 PRXC PSEFL 

1tib A ASP 201, A HIS 258, A SER 146 3.1.1.3 LIP THELA 

3tgl A ASP 203, A HIS 257, A SER 144 3.1.1.3 LIP RHIMI 

1bs9 A ASP 175, A HIS 187, A SER 90 + 3.1.1.6 AXE2 PENPU 

1avw A ASP 102, A HIS 57, A SER 195 + + 3.4.21.4 N/A 

1acb E ASP 102, E HIS 57, E SER 195 + + 3.4.21.4 CTRA BOVIN 

1taw A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 N/A 

1au8 A ASP 102, A HIS 57, A SER 195 + + 3.4.21.20 CATH HUMAN 

1elt A ASP 102, A HIS 57, A SER 195 + 3.4.21.36 ELA1 SALSA 

3tgi E ASP 102, E HIS 57, E SER 195 + 3.4.21.4 TRY2 RAT 

1agj A ASP 120, A HIS 72, A SER 195 + 3.4.21.- ETA STAAU 

1auo A ASP 168, A HIS 199, A SER 114 + 3.4.22.38 CATK HUMAN 

1arb A ASP 113, A HIS 57, A SER 194 3.4.21.50 API ACHLY 

1jrt A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 TRY1 BOVIN 

1try A ASP 102, A HIS 57, A SER 195 3.4.21.4 TRYP FUSOX 

2tec E ASP 38, E HIS 71, E SER 225 + 3.4.21.66 THET THEVU 

1ppf E ASP 102, E HIS 57, E SER 195 + + 3.4.21.37 ELNE HUMAN 

1jfr A ASP 177, A HIS 209, A SER 131 N/A P83850 STREX 

1ct0 E ASP 102, E HIS 57, E SER 195 + + N/A N/A 

3D pattern (k=3) 

Cross-validated 

Pattern PDBID RID CSA SIDEMINE EC UID 

Ala-Asp-Ser 1brt A ALA 123, A ASP 228, A SER 98 + 1.11.1.10 BPOA2 STRAU 

1onr A ALA 225, A ASP 17, A SER 176 2.2.1.2 TALB ECOLI 

Asp-Cys-Lys 1nba A ASP 51, A CYS 177, A LYS 144 + 3.5.1.59 CSH ARTSP 

Table 4.4: List of cross-validated active site residues. The catalytic residues in the mined k=2 or k=3 

residue triplets were compared against active site templates in CSA. RID = a Residue identifier consisting 

of a chain identifier + a residue name + a residue sequence position. 

74

Figure 4.5: Re-discovery of the catalytic triad in OLDFIELD. Examples of protein structures with the 

Asp-His-Ser pattern were cross-validated by CSA. 

4.2.4 Discovering the conserved serine residue in the catalytic 

triad (quartet) 

The catalytic triad template (Asp-His-Ser) has been reported as a four residue configuration 

(Asp-His-Ser/Ser) [WBT97] [BFW + 94]. Based on the identified catalytic triad 

pattern in OLDFIELD (cf. 

previous section), the objective in this section is to test 

whether a 4-body or even larger residue configurations can be generated, based on the 

mined 3D patterns. In addition, the analysis searches the conserved serine residue in these 

extended configurations. 

The result of extending the catalytic triad is summarised in table 4.5. With 10 out of 

22 structure examples having a single residue extension, only 7 out of the 10 determined 

4-bodies contain the conserved serine residue (Asp-His-2Ser). 

Other 4-bodies were also found with an additional alanine or cysteine residue. Preliminary 

studies indicate that even larger configurations can be obtained, by combining the 

determined 4-bodies into a 5- or 6-body. However, the biological validity of the additional 

75

PDBID Asp-His-Ser His-2Ser Ala-His-Ser Cys-His-Ser Ala-Asp-His 

1jrt + + + 

1au8 + + + 

1ppf + + + 

1avw + + + + 

1ct0 + + + + 

1elt + + + + 

1try + + + + 

3tgi + + + + 

1acb + + + + 

1arb + + + 

2tec + 

1agj + 

1taw + 

1a8s + 

1jfr + 

1a7u + 

1auo + 

1a88 + 

1a8q + 

1tib + 

3tgl + 

Table 4.5: Extending the catalytic triad into 4-bodies. Two pairs of residue triplets from the same 

protein structure are merged together if two of the residues are identical. The first column indicates the 

catalytic triad configuration, while the second column represents an extension with a previously reported 

conserved serine residue. The remaining columns shows other solutions of 3-body extensions with the 

catalytic triad. 

alanine or cysteine in a 4-body, or even other amino acids in larger residue configurations, 

needs to be determined. 

In conclusion, the presented algorithm is able to find the catalytic triad (quartet), 

i.e. the second conserved serine residue was rediscovered from data mining. While other 

residue configurations of 4-bodies were also found, the biological role of these residues is 

being investigated further. 


The biological cross-validation of the mined 3D patterns requires an adequate knowledge 

base as reference. A precision score cannot be estimated from cross-validation studies, 

because the result is the solution of discovery-driven data mining, and current knowledge 

bases have an incomplete coverage of functional sites. 

In this respect, the mined 3D 

patterns may contain known biological motifs, which are the detectable true positives, 

76

or unknown functional sites, which cannot be confirmed yet. In addition, the result may 

contain noise, which is impossible to detect as false positives. The biological significance 

of the presented data mining was evaluated by examples of known biological functional 

sites: the metal binding site, and the catalytic triad. In particular, only known functional 

sites for proteins in the input structure set were used as benchmark. An alternative to this 

stringent evaluation is to transfer functional sites from homologous proteins, e.g. based 

on the Homology-dervied Secondary Structure of proteins (HSSP) database [SS96], and 

consider these information as true positive reference. 

About one third of the data in the PDB are protein structures co-crystallised with 

metal ions, which allows the study of metal binding sites [BW03]. Within the analysis, 

only a small fraction of proteins with metal binding sites were rediscovered. A systematic 

optimisation of the developed data mining algorithm was not pursued, e.g. by modification 

of feature selection criteria, because this would have exceeded the limit of this thesis. 

Preliminary studies on the source of false negative rate indicates, that the interaction 

distance threshold is the first parameter to be optimised. However with the change of this 

parameter the probability distribution of triangulated structural features is also modified 

and the effect cannot be estimated easily. 

The datasets OLDFIELD and SCOP40 are quite different (cf. section 3.2). OLD- 

FIELD consists of sequentially dissimilar protein structures, while the proteins may still 

share structure similarity. 

This property allows the mining of homologous structural 

features of divergent proteins, such as metal binding sites or the catalytic triad. The developed 

data mining method was also tested, whether it can extract convergent structural 

features, by analysing SCOP40. This dataset consists only of divergent proteins with no 

global structural similarities. As a consequence, structural components are mainly represented 

by convergent features, and the detection of these residue configurations might be 

below detection level. That is, the occurrences of convergent structural features are similar 

to background level. However, metal binding sites are examples of convergent patterns 

77

that were found in this study. The coordination of metal ions is greatly dependent on the 

distances and orientations of the conjugating residues. For that reason, data mining can 

detect these convergent structural features in structurally unrelated proteins. 

The presented data mining system identifies local three residue interactions with respect 

of their spatial and chemical configuration. In addition, examples of 4- and 5-body 

interactions were shown as a solution in extending the catalytic triad pattern. The analysis 

shows, that larger residue configurations can be found with the presented combinatorial 

approach. However, the search for larger structural patterns might deliver only protein 

stabilising features or other biological units in protein structures that are difficult to 

interpret. 


The solution of this developed data mining algorithm is justified by the cross-validation 

of biologically relevant structure motifs provided in this study. 

The mining system is 

able to detect recurrent homologous or convergent structural features in the dataset. 

More importantly, two biological motifs, the metal binding site, and the catalytic triad, 

were rediscovered indicating, that the mined output contains biologically valid solutions. 

While the prediction of functional sites is an important task in structural biology, the 

biological interpretation of a 3D pattern requires evidences of biological significance. The 

combination with published biochemical and experimental data can provide evidences and 

a biological context for data interpretation. In the next chapter, I will present a biomedical 

literature mining system, for the extraction of functional annotation of protein residues. 

78

Chapter 5 

Identification of protein residues in 

MEDLINE 

In this chapter, I present a text mining method to identify protein residues in biomedical 

texts. In the first step, the algorithm identifies the biological entities of residue, protein, 

and organism, and then determines the association of entity triplets. As a result a residue 

is linked to its source protein, and the protein is mapped to its hosting organism. Because 

the developed text mining solution relies on information from UniProtKB, an identified 

protein residue is directly linked to a unique Uniprot entry. 

One application of this 

method is the search for abstract texts in MEDLINE with protein residues, and then use 

the result for the update of citations in UniProtKB. The identification of protein residues 

in biomedical texts is a prerequisite for the extraction of functional annotation of residues. 


The developed protein residue identification system is based on the algorithm of [HLC04]. 

Basically, the developed method is a four step procedure: biological entity recognition 

of organism, protein, and residue, and the association of the entity triplet. Figure 5.1 

illustrates the procedures of this text mining system. 

79

Figure 5.1: Overview of processes and evaluation methods for the developed protein residue identification 

system. 

80

5.1.1 Protein and organism entity recognition 

Theory 

The recognition of protein and organism entities in text is based on a dictionary lookup 

approach. Basically, names of proteins, their synonyms, and their gene names are collected 

from UniProtKB to populate a protein terminology dictionary. The lookup of the 

protein dictionary considers the matching of morphological variants. The dictionary is 

not expanded by syntactical variants of terminological entries, like structural or formal 

variants, and addition of modifier or head word, because the lookup approach with the 

vast number of permutations requires much more computational memory resources. The 

alternative is to use a probabilistic approach. 

A similar method is also used to populate the organism terminology dictionary with 

names and synonyms from the NCBI Taxonomy database [WBB + 06]. 

The lookup of 

terminologies also considers the matching of morphological variants. 


The recognition of protein entities was based on an approach that combined dictionary 

lookup with basic disambiguation [RSKA + 07]. 

All protein names and synonyms were 

collected from UniProtKB. 

Names of species were extracted from the NCBI Taxonomy references in UniProtKB, 

and their scientific and common names collected. The dictionary was complemented with 

terminologies describing only the referenced genus. Full organism names were augmented 

with abbreviated genus forms, i.e. first letter abbreviation of genus + specie. 

The fast and efficient method for annotating texts with protein and organism names 

was based on the publicly available web service called Whatizit [RSAG + 08]. The result is 

an annotation of protein and organism names in text with references to UniProtKB and 

NCBI Taxonomy. 

81

5.1.2 Entity recognition of protein residue 

Theory 

The identification of residue entities is based on the re-implementation of previously published 

regular expression patterns for point mutations [HLC04] [RSMA + 04]. Here, the 

patterns are extended to capture in total three types of residues: wild-type, point mutation, 

and range of residues or pair of residues. 

Although amino acid sequences can 

be considered in the residue entity identification, the lack of information about sequence 

position prevents the precise association detection with proteins. 

The first basic type of residue mention is the single protein residue sequence reference, 

which consists of the name of an amino acid, followed by the sequence position number, 

e.g. ”Gly-12”, ”arginine 4”, ”Tyr74”, ”Arg(53)”. A point mutation is the second type of 

residue mention, where the description details the exchange of an amino acid at a given 

position. 

The common notation is the name of the amino acid, its sequence position 

number, followed by the exchange. The following are examples of point mutations found 

in text: ”W77R”, ”Cys560Arg”, ”ser-52->ala”, ”ala2-methionine”. 

Finally, the third 

type of residue mention describes either a range of residues or an interaction pair, e.g. 

”Tyr 85 to Ser 85”, ”Trp27–Cys29”. The correct identification of this type of residue 

mention requires the consideration of contextual information, which is not handled in 

this version. The common notation is the string sequence: amino acid name, sequence 

position, a connection symbol or connection word, amino acid name, and then sequence 

position. 

In addition to the abbreviated notation, protein residues can be expressed in syntactical 

form, e.g. ”isoleucine at position 3”, ”substitution of Ala at position 4 to Gly”, 

”Ser472 to glutamic acid”. Additional patterns were developed to accommodate these 

and other less precise defined residue mentions in syntactical form, e.g. ”residue at position 

22, 34, and 40”. Although the entity triplet association algorithm does not utilise 

the latter identified residue mentions, annotation can generally be extracted for these 

82

underspecified residues to increase the recall in information extraction. 


The extraction of residue mentions reuses the idea of designing regular expressions to find 

residue entities in text [RSMA + 04] [HLC04]. Some of the previously published regular 

expression patterns were adopted, while other patterns were created to cover other types 

of residue mentions, such as basic abbreviational point mutation patterns. In this thesis, 

sets of regular expressions were developed and implemented as finite state transducer to 

identify three types of residue entities (cf. table 5.1): wild-type, point mutation, and 

range or pair of residues. The result is an annotation of residue mention in text with 

normalised expressions. 

5.1.3 Association identification of the entity triplet organism, 

protein, and residue 

Theory 

The association of the entities organism, protein, and residue is a difficult text mining 

task. Unlike the association of two proteins, e.g. the physical interactions of two proteins 

(protein-protein interaction), the binary semantic relationships of organism-protein and 

protein-residue are not necessarily explicitly stated in biomedical texts. For example, a 

protein may be mentioned at the beginning of a paragraph, while a site-directed mutation 

on the same protein is described in later sections. This is one reason why approaches 

relying only on language patterns or word distance metrics are not feasible to find proteinresidue 

associations. The association task becomes more complex, when multiple proteins 

are mentioned in the text. Usually a residue has a one-to-one relationship with a protein, 

however two proteins can have the same residue at the same sequence position. While 

this ambiguity cannot be solved without deeper natural language processing techniques, 

the problem can be tackled with a knowledge based approach. 

83

RANGE-TO = ("-"+ ("to" "-+")? | "to"); 

CONVERT-TO = ("to" | "-"+ ">"?); 

XAA = ( "X" | "XAA" | "xaa" ); 

POS = (1-9)(0-9)*; 

RESN1 

RESN3 

= [ARNDCQEGHILKMFPSTWYVOUBZX]; 

= ( [aA]la|ALA | [aA]rg|ARG | [aA]sn|ASN | [aA]sp|ASP | [cC]ys|CYS 

| [gG]ln|GLN | [gG]lu|GLU | [gG]ly|GLY | [hH]is|HIS | [iI]le|ILE 

| [lL]eu|LEU | [lL]ys|LYS | [mM]et|MET | [pP]he|PHE | [pP]ro|PRO 

| [sS]er|SER | [tT]hr|THR | [tT]rp|TRP | [tT]yr|TYR | [vV]al|VAL 

| [pP]yl|PYL | [sS]ec|SEC | [aA]sx|ASX | [gG]lx|GLX | [xX]aa|XAA); 

RESNF = ( [aA]lanine | [aA]rginine | [aA]sparagine | [aA]spart(ate|ic acid) | 

[cC]ysteine 

| [gG]lutamine | [gG]lutam(ate|ic acid) | [gG]lycine | [hH]istidine | 

[iI]soleucine 

| [lL]eucine | [lL]ysine | [mM]ethionine | [pP]henylalanine | [pP]roline 

| [sS]erine | [tT]hreonine | [tT]ryptophan | [tT]yrosine | [vV]aline 

| [pP]yrrolysine | [sS]elenocysteine | [aA]spartic acid or [aA]sparagine 

| [gG]lutamic acid or[gG]lutamine); 

SITE 

SITES 

= ( (RESN3 | RESNF) POS "residue"? 

| (RESN3 | RESNF) "-"+ POS "residue"? 

| (RESN3 | RESNF) "residue"? "at position"? POS "residue"? 

| (RESN3 | RESNF) "(" POS ")" "residue"? 

| "amino acid"? "residue" "at position"? POS 

| "amino acid" "residue"? "at position"? POS 

| RESNF "residue" POS); 

= ( RESNF"s" (("," | "and" | "or") RESNF"s")* 

| RESNF"s"? ("at position""s"?)? ("," | "and" | "or") (("at position""s"?)? 

("," | "and" | "or") POS)+ 

| RESNF "residue""s"? 

| RESN3 "residue""s"? ("at position""s"?)? POS (("at position""s"?)? ("," | 

"and" | "or") POS)+ 

| RESN3 "residue""s"? 

| "residue""s"? ("at position""s"?)? POS ("," | "and" | "or") POS)+ 

| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS ("," | "and" | "or") 

POS)+ 

| RESNF ("," | "and" | "or") POS)* "residue""s"?); 

RANGE/PAIR = ( "residue""s"? ("," | "and" | "or") RANGE-TO POS)+ 

| "amino acid" "residue"? "s"? ("," | "and" | "or") RANGE-TO POS)+ 

| ("resiude""s"?)? "at position""s"? ("," | "and" | "or") RANGE-TO POS)+ 

| RESI RANGE-TO RESI); 

MUTATION 

= ( RESN1 POS RESN1 

| RESN1 "-" POS "-" RESN1 

| RESN1 "(" POS ")" RESN1 

| RESI CONVERT-TO (RESN3 | RESNF) 

| RESI RESN3 

| "from" (RESNF | RESN3) CONVERT-TO (RESNF | RESN3) "at position" POS 

| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS 

| RESI ("-"+ | CONVERT-TO) RESI "substitution"); 

Table 5.1: Regular expression patterns for the detection of residue mentions in text. The patterns 

recognise single (SITE) or multiple wild-type residue sites (SITES), a sequence range or residue pair 

(RANGE/PAIR), and point mutation (MUTATION). The set covers abbreviated notations of residues 

as well as grammatical expressions found in text. 

84

The developed method in this work is based on the algorithm of [HLC04]. Basically, 

the identification of a protein residue can only be validated, if it is part of the protein 

sequence, as it is denoted in a reference database, e.g. UniProtKB. This requires that the 

protein mentioned in the text is further supported by evidence for the organisms under 

scrutiny to select the appropriate protein sequence from the bioinformatics database; that 

excludes the risk of using orthologous protein sequences. 


In this study, the developed system to identify the entity triplet association of organism, 

protein, and residue, was based on the algorithm described by [HLC04] with some modifications. 

In the first step proteins were associated with their hosting organisms. Given a 

protein, all pairs of protein-organism (specie) were determined from text and ranked according 

to a word distance measure. The word distance between two entities was defined 

by the smallest number of words between them. The identification of protein-organism 

began with the pair with the smallest word distance measure. A valid association was 

found, if a semantic relation was specified in UniProtKB. If an association was validated 

then the search was terminated, and the protein was annotated with the corresponding 

Uniprot identifier, otherwise the next entity pair from the list was tested. If no match 

between protein and organism (specie) was found, then the search was relaxed to genus 

matching. This relaxed matching is the expansion to the [HLC04] algorithm. Because 

entries in UniProtKB are species specific, the protein-organism (genus) association will 

result in a list of Uniprot identifiers as annotation of the protein. 

The second step of this algorithm was the association of residues with their source 

proteins. The procedure of selecting and ranking the residue-protein pairs was similar 

to the protein-organism association identification. For each pair that was to be tested 

the annotated Uniprot identifier of the protein was used to retrieve the protein sequence 

from the database. Three cases of results can be distinguished: (1) the residue correctly 

85

matches the protein sequence; (2) several alternative sequences are matching from a list 

of proteins; and (3) no match can be found for the residue with the available protein 

sequences. If a match was found, then the residue was annotated with references to the 

protein, otherwise the search continued with the next pair from the ranked list. 

5.2 The construction of evaluation test corpora 

UniProtKB is one of the most comprehensive protein knowledge bases (cf. section 2.1.2). 

It contains manually curated functional annotations on three levels: protein, protein sequence, 

and protein residue. Information is derived from surveys of biomedical articles, 

and entries are annotated with citation references (PMIDs; PubMed identifiers). However, 

the precise association of a citation and a protein residue in context of functional 

annotation is generally not available. 

The test dataset for the developed functional annotation extraction is based on the 

citation references from UniProtKB. A Uniprot corpus was generated by retrieving abstract 

texts from MEDLINE that are indexed by the knowledge base. From the 136,566 

citations listed in UniProtKB, a virtually complete set of 136,559 abstract texts was retrieved 

from MEDLINE. Although not all information presented in the UniProtKB are 

necessarily available in the Uniprot corpus, the Uniprot corpus is a starting point for the 

evaluation of the developed text mining modules. In particular three derived test corpora 

were generated from the Uniprot corpus: the gold standard corpus with manual annotation 

(GC), and the two cross-validation corpora with annotated information derived from 

UniProtKB (XC1, and XC2). Figure 5.2 summarises key features in both test corpora. 

For the automatic evaluation of extracted data, a cross-validation corpus (XC) was 

derived from Uniprot corpus. This test set was used to analyse the performance of proteinorganism 

(XC1) and residue-protein (XC2) associations. 

The test set was annotated 

automatically, i.e. the biological entities were detected with the same ER systems. The 

documents in the Uniprot corpus were scanned for tri-occurrences of organism, protein, 

86

Dataset 

Gold standard corpus 

(GC) 

Cross-validation 

corpus (XC1) 

Cross-validation 

corpus (XC2) 

Abstracts count 100 55,998 4,503 

Method of annotation manual automatic automatic 

total/unique residues 362/262 (with N/A 

N/A 

262/191 having 

residue name + 

residue sequence 

position) 

total/unique proteins 990/511 N/A N/A 

total/unique organisms 323/123 N/A N/A 

total/unique associations 240/172 residueprotein-organism 

NA/70,401 

associations 

protein-organism 

as UTP 

as URP 

Application 

Test the the type, 

amount and reliability 

of the 

extracted information 

(reproduction 

of manually annotated 

information). 

Test set is assumed 

to contain the same 

type of information 

as GC, but certainty 

is not clear. 

Study the reproduction 

of information 

contained in 

the database. 

NA/10,152 

protein-residue 

Test set is assumed 

to contain the same 

type of information 

as GC, but certainty 

is not clear. 

Study the reproduction 

of information 

contained in 

the database. 

Figure 5.2: Test corpora for information extraction evaluation. Based on the citation references from 

UniProtKB a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora 

were derived from this corpus: (1) the gold standard corpus (GC), which resembles a manually annotated 

test set; and (2) the cross-validation corpora (XC1, XC2), which contains automatically assigned 

annotations based on information from UniProtKB. 

and residue in text and a subset was retained if the combinations of the identifier triplet 

(UID+TID+PMID) for each document can be found in the database. UID is the Uniprot 

ID, TID is the NCBI Taxonomy ID, and PMID is the PubMed identifier. If at least a single 

match was found, then a document was selected. For the non-matching combinations the 

corresponding annotations were removed from text. This results in the test set XC1 with 

the associated set of the triple identifier combinations UTP = (UID+TID+PMID). XC2 

is a subselection from XC1 by filtering for documents where the identifier combination 

URP=(UID+RID+PMID) were validated by entries in UniProtKB. RID is a residue 

identifier which consists of a residue name + sequence position. 70,401 UTPs from 55,998 

abstract texts were determined for XC1, and correspondingly 10,152 URPs were derived 

from 4,503 MEDLINE articles in XC2. 

The gold standard corpus (GC) was created through manual curation, since no suitable 

annotated corpora are available for this study. 

A random sample of 100 MEDLINE 

87

abstract texts was drawn from the Uniprot corpus, where every abstract text must contain 

the tri-occurrences of organism, protein and residue. Notice that the detection of the 

entities was based on the entity recognition (ER) systems described in the previous section. 

It is not expected that the ER systems are performing at top level, and therefore a certain 

proportion of the filtered abstract texts contains false positives of identified entities. 

From this set of 100 abstract texts, manual analysis provided four types of annotations. 

The first type is the annotation of the biological entities of organism, protein, and residue, 

while the second is the annotation of entity triplet associations, i.e. organism-proteinresidue. 

Notice that this process did not include the grounding of protein or organism 

entities to entries in the specialised databases, i.e. UniProtKB and NCBI Taxonomy. In 

addition, text segments of sentences with a residue entity were annotated, if they represent 

keywords for functional annotation. Finally, the association of a keyword and a residue 

was also annotated in GC. 

Notice, that the set of documents in GC is partially contained in XC2; only 26 abstracts 

are shared among both datasets. From manual annotation 38 entity triplet associations 

were determined, while the corresponding number from XC2 was 58. The total number 

of manually annotated triplet associations in GC is 172 (cf. figure 5.2). 

The major difference between both evaluation corpora is, that GC contains manually 

confirmed biological entities and their associations. In contrast, the same annotations 

in XC1 and XC2 were done with UniProtKB, based on the assumption that the same 

database information is present in abstract texts. 

The interpretation of performance 

analysis has to consider the properties of these evaluation test corpora. 


The performance of each process of the developed protein residue identification system 

was scored against a manually annotated gold standard corpus. 

Proteins, where the 

protein entity recognition system and manual curation assigned the same entity (full 

88

term matching) were considered as true positives (TP). The same rule also applied for 

counting TP for the detection of residue and organism entities. 

The evaluation of the entity triplet association detections considered only associations 

as TP, if both pair relations organism-protein and protein-residue were determined correctly. 

If one of the relations was incorrect, a found association was counted as false 

positive (FP). 

In contrast, the automatic evaluation of the entity recognition and entity association 

detection systems were performed on XC. A true positive of an annotated entity within 

an abstract text was identified, if UniProtKB lists the same entity in context of the 

given PMID. For example, if organism X in text Y is also indexed in UniProtKB as a 

combination of TID+PMID, then a TP was counted. 

A correct protein-organism association was detected, if the determined identifier combination 

UTP was found in XC. Similarly, a correct residue-protein association was found, 

if the derived identifier combination URP was found in the test corpus. 

The effectiveness of the ER and the association detection systems was measured in 

terms of precision, recall and the balanced F-measure (F1): 

precision = 

#true positive 

#true positive + #false positive , (5.1) 

recall = 

#true positive 

#true positive + #false positive , (5.2) 

F 1 = 

2 ∗ precision ∗ recall 

. (5.3) 

precision + recall 

5.4 Results 

The developed protein residue identification system in this study consists of four modules. 

The following sections assess first performances of biological entity recognition, and then 

89

Unique residue entities 

Reference Dataset Available Extracted Common Precision Recall F1 

Gold standard corpus 191 203 187 0.92 0.98 0.95 

MutationGraB GPCR corpus N/A N/A N/A 0.98 0.77 0.86 

MutationMiner Xylanase corpus N/A N/A N/A 1.00 0.85 0.92 

MEMA Mutation corpus N/A N/A N/A 0.98 0.75 0.85 

Table 5.2: Performance evaluation of residue entity recognition. The performance is compared with other 

published residue entity recognition systems: MutationGraB (GPCR corpus) [LHC07]; MutationMiner 

(Xylanase corpus) [BW05]; and MEMA (Mutation corpus) [RSMA + 04]. Performance was measured in 

terms of precision, recall, and F1 measure. 

the association of the entity triplet organism, protein, and residue. 

The final section 

presents an application of the presented text mining solution that can be used to update 

the citation set of UniProtKB or any other derived databases. 

5.4.1 Evaluation of organism, protein, and residue entity recognition 

The goal of biological entity recognition, in this study, is to detect the mentions of residue, 

protein, and organism in biomedical abstract texts. In order to evaluate the performance 

of the developed ER systems, the detections were compared against the results from 

manual curated test set, the gold standard corpus (GC). 

The evaluation shows that the developed regular expression patterns are highly usable 

for the detection of residue mentions in biomedical texts. ER for residue mention yields 

in a precision of 0.92 and a recall of 0.98. With an F1 measure of 0.95 the performance 

of this ER system is within range of previous reports on point mutation identification 

[LHC07] [BW05] [RSMA + 04] (cf. table 5.2). 

The performance for protein mention identification is evaluated with 65% precision and 

60% recall (62% F1 measure). The result is difficult to compare to previously reported 

systems, e.g. ProMiner and MutationMiner (cf. table 5.3), due to the different experimental 

setup. ProMiner was evaluated on the BioCreAtIvE corpus (80% F1 measure) 

90

Unique protein entities 



ProMiner BioCreAtIvE corpus N/A N/A N/A 0.8 0.8 0.8 


Table 5.3: Performance evaluation of protein entity recognition. The performance is compared with the 

other published protein entity recognition systems: ProMiner (BioCreAtIvE corpus, Task 1B, protein 

and gene name identification) [HFM + 05]; and MutationMiner (Xylanase corpus) [BW05]. Performance 

was measured in terms of precision, recall, and F1 measure. 

Unique organism entities 




Table 5.4: Performance evaluation of organism entity recognition. The performance is compared with 

the NER system of MutationMiner (Xylanase corpus) [BW05]. Performance was measured in terms of 

precision, recall, and F1 measure. 

which links the contained protein mentions to only a small set of organisms. However, 

we have repeated the experiment on the BioCreAtIvE dataset and the result suggests 

that our method yields a comparable performance (76% F1 measure). Conversely, the 

evaluation of MutationMiner not only considers abstract texts but also the content of the 

full-text articles which should improve the results (79% F1 measure). 

Although the developed organism entity recognition system relies on a similar dictionary 

lookup approach as protein entity recognition, the performance is higher (precision 

of 0.81 and recall of 0.72; cf. table 5.4). This indicates that the list of terminologies are 

precise and covers a wide range of expressions. 

In conclusion, with F1 measures of 0.95, 0.62, and 0.76 for the entity recognition of 

residue, protein, and organism, the developed text mining system is able to detect these 

three biological entities in biomedical abstract texts. 

91

Unique resi.-prot.-org.-associations 



MutationGraB Mutation corpus N/A N/A N/A 0.85 0.69 0.76 

MEMA Mutation corpus N/A N/A N/A 0.93 0.35 0.51 

MuteXt tinyGRAP N/A N/A N/A 0.88 0.83 0.85 

Table 5.5: Performance evaluation of residue-protein-organism entity association detection. The performance 

is compared with the other published point mutation detection systems: MutationGraB (Mutation 

corpus1) [LHC07]; and MEMA (Mutation corpus2) [RSMA + 04]. Notice that MEMA identified only associations 

but without grounding. Performance was measured in terms of precision, recall, and F1 measure. 

5.4.2 Performance study on the entity triplet association 

The objective of the developed association detection system is to identify the entity triplet 

of organism, protein, and residue. 

In this section, the performance of this detection 

system is studied by comparing the predicted association with the manually annotated 

associations in the gold standard corpus (GC). 

With a precision of 0.82 and a recall of 0.38 the developed detection system is a reliable 

method for association detection, and the precision is comparable to other related reports 

(cf. table 5.5). In comparison to the systems, MutationGraB and MuteXt, the low recall 

can be explained by the differences in the test corpora; both systems were evaluated on 

protein family specific full-text articles. The evaluated precision of MEMA is different 

from this study, because MEMA identifies only associations without grounding to Uniprot 

entries. 

Manual analysis isolated two main reasons for the low recall. First, the association of 

all the three entities failed in several cases, because the system did not find an association 

between protein and organism. 

Other cases were also encountered, where a proteinorganism 

association was correctly identified, but a protein-residue association could not 

be found. A detailed explanation is given in the discussion section. 

Despite the low recall of this text mining module, the evaluation indicates that the 

developed method is able to detect associations of residue, protein, and organism. More 

92

UTP 

Dataset Available Extracted Common Precision Recall F1 

XC1 70,401 77,407 62,068 0.82 0.88 0.85 

URP 


XC2 10,152 10,876 9,325 0.86 0.92 0.89 

Table 5.6: Performance evaluation of protein-organism and protein-residue entity association detection. 

A cross-validation corpus (XC) from UniProtKB was obtained from MEDLINE, by first retrieving 

abstract texts from MEDLINE, searching for tri-occurrences of the named entities residue, protein, organism, 

and then retaining only those entries for which the identifier combination of UTP (Uniprot identifier 

+ NCBI Taxonomy identifier + PubMed identifier) was found in UniProtKB. The result is the test set 

XC1 for protein-organism association study. XC2 is a subset of XC1 by scaning for documents where 

the identifier combination URP identifier combination (Uniprot identifier + Residue identifier + PubMed 

identifier) was validated by UniProtKB. Performance was measured in terms of precision, recall, and F1 

measure. 

importantly, the detected associations are in accordance with manually identified semantic 

relations between the three biological entities. With a precision of 0.82 the developed 

method is able to identify precisely protein residues in biomedical texts. 

5.4.3 Cross-validation of identified residues with UniProtKB 

In the previous section the system for the association of the entity triplet organism, 

protein, and residue, was evaluated manually on the gold standard corpus. The objective 

in this section is to perform an analysis on a larger test set by cross-validation with 

UniProtKB. For this task, the cross-validation corpora XC1 and XC2 were used. The 

analysis consists of a two-step association study, i.e. the association of protein-organism 

and residue-protein were evaluated individually. Table 5.6 summarises the results. 

With a precision of 0.82 and a recall of 0.88, the result for organism-protein association 

indicates that the system is able to extract correct semantic relations from XC1. The second 

step of the evaluation determines the performance of the residue-protein association 

detection. A similar precision score of 0.86 was determined, while the recall (0.92) was 

93

triplet association/UTRP 

Resource Available Extracted Common Precision Recall F1 

GC 38 61 29 0.48 0.76 0.59 

XC2 58 61 52 0.84 0.90 0.87 

Table 5.7: A specialised performance evaluation between GC and XC2. The test set consists of the 26 

common documents between GC and XC2. A comparison of the annotated entity triplet associations 

from both resources shows that the list of targets are different. 

almost twice as high as the triple entity association determined with GC (cf. table 5.5). 

This can be explained by the differences of the used annotation methods for both test 

corpora. The entities and their associations in GC were determined manually and did not 

considered a grounding step. 

To better compare the performance between the GC and XC2 data the common set of 

26 abstract texts from both corpora were studied (cf. section 5.2). By reusing the URP 

information from the cross-validation corpus the determined performance is similar to the 

one evaluated on the whole XC2 dataset (compare table 5.7 with table 5.6). However, 

the XC2-based evaluation is different form the manual-based annotation study. 

However, this result is different from the evaluation based on manual annotation. A 

detailed analysis shows that manual annotation determined 38 entity triplets, whereas 

XC2 lists 58 associations and only 25 of these are common among both data sets (data 

not shown). This indicates that the annotated targets in GC and XC2 are different and 

cannot be compared directly. 

The results indicate that the developed method is able to detect correct associations 

of residue, protein, and organism. 

5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins 

The developed text mining system annotates an identified protein residue in a text passage 

with references to its source protein and its hosting organism. Therefore, each MEDLINE 

94

Figure 5.3: Identified protein residues in MEDLINE. From a MEDLINE extraction, a subset of 2,884 

Uniprot proteins were identified, with cross-references to 14,007 PDB entries, and a corresponding set of 

18,427 MEDLINE records. In comparison, the citation set of the corresponding entries in UniProtKB 

has only 4,652 PMIDs. Only 657 out of 18,427 PMIDs are cross-validated by UniProtKB data. Dashed 

line = MEDLINE based extraction; solid line = database values. 

record with an identified protein residue can be used to update the citation set of a 

correspondent protein entry in UniProtKB, or any other hyperlinked database, e.g. PDB 

(UniProtKB/PDB). In this study, the whole MEDLINE was scanned with the developed 

protein residue identification method, and the determined set of PMIDs compared with the 

citation sets in UniProtKB/PDB (cf. figure 5.3; for an overview of databanks hyperlinks 

and citation references cf. section 2.1). 

The protein residue identification system found a total of 40,750 MEDLINE records 

where residues were associated with co-mentioned proteins. The unique count of Uniprot 

proteins within the entity triplet associations is 9,354, where 2,884 out of 9,364 proteins 

have hyperlinks to 14,007 PDB entries. Corresponding to these 2,884 Uniprot proteins 

95

is the set of 18,427 out of 40,750 PMIDs. In comparison, UniProtKB indexes for these 

2,884 Uniprot entries a set of 4,652 PMIDs. A set analysis determined that both datasets 

are common in 657 PMIDs. This means that only 3.6 per cent of the identified PMIDs 

can be cross-validated with UniProtKB (cf. figure 5.4). 

The low number of rediscovery can be explained, in that most of the annotations 

in UniProtKB are done from sections only available in full-text articles. Although the 

analysis was based on MEDLINE, the extraction was already able to find a large number 

of relevant abstract texts for citation expansion. With a precision of 0.82 (determined 

by gold standard evaluation), the estimated number of true positives in the PMID set is 

15,110. In context of the 4,652 citations from the database for the 2,884 Uniprot proteins, 

and the consideration of the 657 re-discovered abstract texts, the result of MEDLINE 

analysis expands the citation set by 3 fold. 

In conclusion, the presented text mining system can be used to determine relevant 

literature data for the update of the citation sets in UniProtKB/PDB. 

The extracted abstract texts for those proteins provide the basis for functional annotation 

extraction. 


The presented text mining method identifies protein residues in biomedical texts. The 

first step is the recognition of the entities residue, protein, and organism in texts. The 

language expressions of all three biological entities are quite different. A residue entity, 

for example, is generally mentioned in the text by its three-letter abbreviation form + 

protein sequence position. The regular expression patterns were designed specifically for 

these and other derived expressions, which explains the high precision and recall of the 

residue entity recognition system. However, a residue can also be expressed by its oneletter 

abbreviation or syntactical form. 

While the latter expression is considered and 

implemented in this thesis, it was suggested that these expressions resemble only a small 

96

Figure 5.4: Cross-validation of citations from identified protein residues with UniProtKB/PDB. For a 

subset of UniProtKB/PDB proteins (i.e. proteins with UID and PDBID) the determined PMIDs can be 

cross-validated with the relevant citation set from UniProtKB. Dashed line = the number of common 

PMIDs; uni = UniProtKB/PDB based citations; med = protein residue identification based citations; 

comm = common set of citations between uni and med. 

97

fraction [LHC07] in biomedical texts. 

The implementation of one-letter abbreviation 

would increase the recall, but the method would become less precise. For example the 

matched string ”C4” could be a nucleotide, a gene, an atom in a chemical compound, or 

any other acronym. 

The identification of protein terminologies in text is a great challenge in the biomedical 

text mining community. This is based on the fact that protein names are not standardised, 

and the usage of many alternative names are common, e.g. abbreviations, pet names, 

or synonymous names. In addition, there is no guideline in the construction of names, 

therefore a name can be short or long in respect of word counts, e.g. ”MAP kinase kinase” 

and ”MAP kinase kinase kinase”. 

The developed protein entity recognition system is 

based on a lookup of names and synonyms in a dictionary. Because the entries are finite, 

syntactical variants of protein names cannot be detected, if they are not covered by the 

dictionary. This explains the low recall of this ER system. In contrast, sub-matching of 

a whole protein name or the tagging of ambiguous protein names reduces the precision 

of the method. For example, ”SNF” could be a protein in yeast or the funding agency 

”Swiss National Science Foundation. 

The principle method for organism entity recognition is the same as protein name 

identification in this investigation. A list of terms from NCBI taxonomy was utilised to 

generate an organism name dictionary. Although the developed method is the same as 

protein entity recognition, the system yielded in a higher performance. One explanation is, 

that the dictionary contains predominantly unambiguous terminologies. However, some 

ambiguous terms can also be found, e.g. ”RAT” could be a protein, an organism, or a 

method. To my knowledge, a dedicated research in organism entity recognition has not 

been published nor is a gold standard for performance evaluation available. 

Based on the finding of residue, protein, organism entities in a text, the developed system 

identifies semantic relations between these biological entities. The approach is based 

on the idea of reusing explicitly stated relations contained in UniProtKB. The correct 

98

association between protein and residue relies on several factors: the ER performance, 

the correct protein sequence retrieval, which is dependent on the correct organism-protein 

association, and the correct alignment of a residue with a protein sequence at the specified 

position. On one hand, a low recall in residue-protein association can be explained by 

a missing protein sequence variant in the repository. On the other hand, an incorrect 

protein-organism association leads to the retrieval of a wrong protein sequence. Another 

consideration is, that the protein sequence in the database could deviate from the author’s 

data, because either side may have used different indexing rules. Conversely, the 

true positive rate can also be blurred by the same reason that a non corresponding residue 

sequence index results in a by chance matching with a protein sequence. One solution to 

this specific problem is to consider all residues of the same protein in the sequence alignment. 

However, this method may only be applicable for full-text analysis, as abstract 

texts rarely mention multiple residues of the same protein. 

The evaluation of the entity recognition and the association detection systems was 

done by a manual analysis on the gold standard corpus, and by an automatic crossvalidation 

study. This has the following reasons. Protein annotations in UniProtKB are 

primarily derived from manual information extraction from full-text articles. Although a 

considerable amount of these information may not be present in MEDLINE, the combination 

of X+PMID, where X is either UID or TID, can be used to estimate the information 

extraction performance. However, the false positive rate in this cross-validation study 

cannot be determined, because the knowledge base is incomplete with information, and 

even for the indexed citations. Therefore, manual evaluation on a gold standard test set 

has the advantage to study the false positive and false negative rate. 

An identified protein residue is annotated with references to its source protein (Uniprot 

identifier) and the hosting organism (NCBI Taxonomy identifier). Based on these annotations 

a link can be made between MEDLINE and biological knowledge bases. 

One 

immediate application is to scan MEDLINE for protein residues and use the Uniprot 

99

identifier annotations in combination with the MEDLINE identifier (or PubMed identifier; 

PMID) to update the citation sets of corresponding Uniprot entries. The significance 

of this approach was studied by automatic cross-validation analysis. Although, the results 

indicate that only a small proportion of Uniprot proteins can be found and associated with 

residues from MEDLINE analysis, the identified set of PMIDs has only a small overlap 

with the corresponding citation sets. One explanation is, that annotations were extracted 

from full-text articles, where the same information is not present in the abstract texts; 

they represent the true negative fraction in sense that the information cannot be identified 

from abstract sections. Another explanation is based on the fact that curators provide 

only a list of relevant citations from a batch of processed biomedical articles. In other 

words, the information of irrelevant citations (false positives) or the complete list of true 

positives of citations, from the sample of reviewed biomedical articles, is not available in 

UniProtKB which would have allowed a more precise evaluation. 


The developed text mining solution identifies protein residues in text and annotates them 

with references to UniProtKB and NCBI Taxonomy. Based on these references, a link 

between MEDLINE and UniProtKB is created. Although the identification of protein 

residues in MEDLINE does not necessarily mean that functional annotations are present 

in abstract texts, the analysis is a prerequisite for the mining of functional annotation. 

The extraction of contextual feature as annotations of a protein residue is the topic of the 

following chapter. 

100

Chapter 6 

Information extraction from the 

context of a residue in text 

In the previous chapter, I have introduced a method for the identification of protein 

residues in biomedical texts. The objective, in this chapter, is to extract textual features 

from the context of protein residues that can be used as functional annotation. Because a 

terminological resource is not utilised, the developed method can discover new information 

from text. 

The extracted contextual features are then enriched with semantic labels 

according to a categorisation scheme. The design of this scheme was data-driven, and 

contains concepts of biological interests. The overall result of this text mining solution 

is the annotation of protein residues with text segments that are classified by a set of 

biological categories. 


The developed information extraction system can be divided into two parts: extraction 

of contextual features associated with protein residues, and classification of the extracted 

textual features. Figure 6.1 illustrates the procedures involved in the developed information 

extraction system. 

101

Figure 6.1: Overview of processes and evaluation methods of the developed contextual feature extraction 

system. 

102

6.1.1 Extraction of contextual features 

Theory 

Finding functional annotations of protein residues in biomedical text. 

In this 

study, several assumptions have been made for the extraction of functional annotations 

from biomedical texts, which are explained in the following. The first assumption is, that 

noun phrases in a text are semantically rich in sense, that they are able to represent 

a subject content (keyword) [JK95]. Consequently, they are good candidates of textual 

features for the functional annotation of protein residues. 

The second assumption is, that a biological function of a protein residue, can be found 

as verbal or nominal expression in natural language. In other words, a syntactical relation 

between a residue and a term can capture their semantic relation. Therefore, a syntactical 

analysis of a sentence enables the identification of an explicitly stated biological function. 

For example, from the phrase 

”A inhibits B by phosphorylation of C”, 

the relations 

A—inhibits—by-phosphorylation-of-C 

A—inhibits—B-by-phosphorylation 

A—inhibits—B 

UNK—phosphorylate—C, 

can be identified. Although the identification of a residue-keyword association can be 

attempted with co-occurrence analysis, the target is to extract reliable associations with 

contextual information on their association. In other words the type of association expressed 

by a verb or by a preposition, and the context expressed by a prepositional phrase, 

are important bits of information that represent a justifiable functional annotation. A 

103

discussion on semantic relation and syntactical relation extraction can be found in section 

2.3.2. 

Generally, to identify description of biological function in text, the terminologies from 

GO can be reused. However, this ontology is actually not specialised on protein residues, 

for example the term ”active site” does not even appear as a stand-alone term in the 

repository. Generally, description of protein function refers to higher level of biological 

function, e.g. metabolomics or cell signalling. In contrast, the annotation of protein 

residues requires a different set of terminologies that describe molecular interactions or 

chemical reactions. 

Because a suitable terminological resource is not available, the extraction of syntactical 

relation focuses on semantic relations with the elements: residue entity and contextual 

feature (keyword). The following is a demonstration of how a description of function can 

be identified from a parsed sentence. Given the example sentence from MEDLINE 

”Parathyroid hormone inhibits renal phosphate transport by phosphorylation 

of serine 77 of sodium-hydrogen exchanger regulatory factor-1.” 

(PMID:17975671), 

a syntactical analysis produces the following phrase structure representation 

104

[Parathyroid hormone]/NP 

[inhibits]/V 

[renal phosphate transport]/NP 

[by]/P 

[phosphorylation]/NP 

[of]/P 

[serine 77]/NP 

[of]/P 

[sodium-hydrogen exchanger regulatory factor-1]/NP, 

where NP is a noun phrase, P a preposition, and V a verb. From this parsed sentence, 

the following semantic relations can be determined: 

Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation-of-serine 

77 

Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation 

Parathyroid hormone—inhibits—renal phosphate transport 

UNK—phosphorylate—serine 77. 

In the next section, a template for storing the extracted relation information is discussed. 

Semantic representation of extracted relations. 

The objective of syntactical relation 

extraction is to identify biological relations in a sentence, i.e. a semantic relation 

between a residue entity and a terminology. While the result is a set of syntactical relations 

with different contextual specification (cf. example in previous section), a suitable 

105

data collation method is necessary to avoid data redundancy. That is, the set of determined 

relations, within a given syntactic frame contains a relation, which is a specification 

of another one. For example, the relation 

A—inhibits—B-by-phosphorylation, 

is a specification of the relation 

A—inhibits—B. 

Here, the predicate-argument structure (PAS) is proposed as a semantic representation 

of extracted syntactical relations. A PAS is a template for information extraction, 

where the predicate and the arguments represent the slots to be filled. In this study, the 

predicate (pred) of a PAS is defined as the verb, while the arguments of the verb are 

the numerically labelled arguments arg1 and arg2, or even higher numerically labelled 

arguments. The arg1 label is assigned to arguments, which are understood as agents, 

causers, or experiencers, i.e. the semantic subject. Conversely, the arg2 label is usually 

assigned to the patient argument, i.e. the argument which undergoes the change of state 

or is being affected by the action. 

The transformation of the extracted relations into PAS data, does not consider the 

analysis of the semantic role of the verb arguments, i.e. 

argument modifiers, such as 

location, time, cause, etc. Noun phrases of the extracted relations can have prepositional 

attachments, and the preposition are often indicators of thematic roles of the verb arguments. 

Therefore, prepositional phrases are listed as modifiers of arguments with the 

following label notations: main argument label + preposition, e.g. 

arg1-of, and arg2- 

by. The following illustrates the transformation of relations into a PAS for the previous 

example: 

106

pred = inhibit 

arg1 = Parathyroid hormone 

arg2 = renal phosphate transport 

arg2-by = phosphorylation 

arg2-of = serine 77, 

which corresponds to the following verb frame set: 

inhibit sub-arg1 obj-arg2 P by-arg2 P of-arg2. 

Notice, that the defined PAS does not accord to PAS schemes of some propositional 

banks, e.g. PropBank or PASBio. For example, for the verb ”inhibit” PropBank lists the 

following frame set: 

inhibit sub-ARG0 obj-ARG1 

inhibit sub-ARG0 S-ARG1, 

while additional arguments are not defined (notice, that the definition of ARG0 in Prop- 

Bank is equivalent to arg1 in this definition, and ARG1 corresponds to arg2). Although 

verb frame sets from publicly available propositional banks can be considered in this study, 

the set of listed verbs have a low coverage with the set of verbs co-occurring with residue 

mentions in MEDLINE. The low coverage and the non-domain specific verb frame sets 

are the main reasons why these resources were not reused. 


The extraction of contextual features is based on a syntactical analysis of natural language 

sentences. Two approaches were developed in this work and compared in the performance 

107

evaluation study: shallow parser based relation extraction, and full parser based relation 

extraction. 

Shallow parser based relation extraction. 

The first approach was to develop a 

shallow parser, which aims to find the boundaries of major constituents in a sentence, 

such as noun phrases. The design is based on heuristics and the idea of finding general 

relations between closed-class English words [LCM03]. The reported parser finds verbal 

relations between noun phrases, and prepositional relations of a set of the most frequent 

prepositions, i.e. ”of”, ”in”, and ”by”. Here, the parser is implemented as a general 

relation extraction method, where the list of prepositions are not limited to the three 

mentioned ones. The purpose is to find more contextual features, and thereby discover 

more information. 

Initially, an abstract text was split into sentences, and then annotated with partof-speech 

(POS) tags using the CISTAGGER. The tagger was trained in the CISLEX 

lexical resource that contains a rich terminological set of the biomedical domain [Gue96]. 

Based on a rule set and the POS information the developed shallow parser identified noun 

phrases, verb groups, verb phrases, and prepositional phrases for analysed sentences: 

NP = Det? (Adj|Adv|N)* N 

PP = P NP 

VG = (Adv|Aux|V|InfTo)* V 

VP = VG NP PP*. 

N is a noun, Det a determiner, Adj an adjective, Adv an adverb, P a preposition, PP a 

prepositional phrase, VP a verb phrase, and VG a verb group. Notice, that the grammar 

does not consider coordinating conjunctions, e.g. with ”and”, ”or” and ”,”. The grammar 

can be easily extended to capture conjunctions by 

108

NPx = NP (CC NP)*, 

where 

CC = (”and” | ”or” | ”,”){1,2}. 

However, the pattern would then also find false positives as illustrated in the following 

example. The sentence 

”Highly conserved phosphopantothenate binding residues include Asn59, 

Ala179, Ala180, and Asp183 from one monomer and Arg55’ from the 

adjacent monomer.” (PMID:12906824), 

contains the noun phrases 

NP1 = ”Asn59, Ala179, Ala180, and Asp183 from one monomer” 

NP2 = ”Arg55’ from the adjacent monomer”. 

The extended patterns would have extracted a single noun phrase, from which the identification 

of the correct post-nominal prepositional phrase attachment cannot be done 

easily: 

NPx = 

”Asn59, Ala179, Ala180, and Asp183 from one monomer and 

Arg55’ from the adjacent monomer”. 

Based on the determined phrase structure, the parser then extracts verbal relations of 

noun phrases or prepositional phrases. A condition of the extraction is, that at least one 

relation element must contain one or more residue mentions: 

109

REL = NP PP* VP. 

The extracted relation is then transformed to fill the slots of the predefined PAS template. 

Full parser based relation extraction. 

The second approach in contextual feature 

extraction utilises the full parser ENJU [MT05] (version 2.3), which generates a so called 

head-driven parse tree from a sentence. The advantage of this parser is, that a parsing 

model adapted to biomedical text is utilised. This parser generates predicate-argument 

relations between words. 

Because the generated output contains a lot of information, 

different interpretations are possible. In this study, a wrapper was developed that converts 

the parser’s output into the presented PAS data format. 

The assumption is, that by 

following the direct links of a verb to its arguments in the tree, and then collecting all the 

sub-branches of each argument, the phrase structure of a verb argument can be found. 

The identified NP PP* VP structures are then decomposed to fill the PAS template. 

6.1.2 Categorisation of contextual features 

Theory 

A PAS captures a verb frame within a text sentence, where the arguments may represent a 

subject content. In order to evaluate the relevance of these arguments a semantic interpretation 

is needed. Here, a classification method was developed, that assigns automatically 

semantic labels to the arguments of a PAS. For this task, the categories have to be defined 

as suitable labels for information interpretation. Although an ontological model of protein 

residue function is not available, there are two approaches to this problem. The first is 

to adopt annotation schemes from various protein databases, e.g. the UniProtKB. This 

represents a top-down approach. One motivation for reusing the categorisation scheme of 

UniProtKB is, that classified information with this scheme can be directly used to update 

110

the relevant fields in the database. 

Alternatively, a bottom-up approach can propose new categories. In this study, suitable 

text segments from MEDLINE were analysed, if they represent suitable functional 

annotations for residues. The result, is an overview of information distribution in MED- 

LINE, which has led to the proposition of a categorisation scheme. The defined categories 

of both schemes are compared in table 6.1. Both categorisation schemes reflect concepts 

of biological interest. However the bottom-up approach has the advantage that proposed 

categories are data-driven, while in a top-down approach examples of listed categories may 

not be present in natural language text, or other categories are missing in the scheme. 

The assignment of categories to contextual features is based on the endogenous classification 

approach [Cer00]. In contrast, the exogenous, i.e. corpus-based, approach requires 

large amounts of contextual cues, which are difficult to obtain. According to the author, 

the endogenous approach is more reliable to produce results even under conditions of 

sparse data. 

From a reference set of terms with manually assigned labels according to a categorisation 

scheme, the algorithm computes the mutual information of the lexical constituents of 

terms and their assigned categories. These scores are then used to calculate and select the 

highest scoring association of a term and a category. The algorithm was re-implemented 

and used in this study. 


The semantic interpretation of contextual features, which are the arguments of the extracted 

PAS, relies on the endogenous classification approach described by [Cer00]. The 

method was re-implemented in this study. The algorithm relies only on the mutual information 

of the lexical constituents of terms and their assigned categories. 

During the training phase, lexical constituents of multi-word terms were extracted 

from a labelled reference set. They represent the features of the predefined categories. 

111

MAN FEAT 

Category Defintion Category Defintion 

STR COMP 

Structure component. Class denoting concepts that 

represent pieces and parts of the protein structure. 

DOMAIN Extent of a domain, which is defined as a specific combination of secondary 

structures organised into a characteristic three-dimensional structure of fold. 

MOTIF Short (up to 20 amino acids) sequence motif of biological interest. 

TOPO DOM Topological domain. 

CHAIN Extent of a polypeptide chain in the mature protein. 

TRANSMEM Extent of a transmembrane region. 

COILED Extent of a coiled-coil region. 

CHEM MOD 

Chemical modification. Class denoting changes to 

the protein sequence and the chemical composition. 

VARIANT Authors report that sequence variants exist. 

MOD RES Posttranslational modification of a residue. 

PEPTIDE Extent of a released active peptide. 

VAR SEQ Description of sequence variants produced by alternative splicing, alternative 

promoter usage, alternative initiation and ribosomal frameshifting. 

LIPID Covalent binding of a lipid moiety. 

CARBOHYD Glycosylation site. 

STR MOD Structural modification. Class denoting the changes 

to the protein structure without changes to the 

chemical composition. 

REGION Extent of a region of interest in the sequence. 

SITE Any interesting single amino-acid site on the sequence, that is not defined by 

another feature key. 

BINDING Binding type. Class denoting different 

physico-chemical forces leading to a bond formation 

between a protein structure component and a 

chemical entity. 

BINDING Binding site for any chemical group (co-enzyme, prosthetic group, etc.). 

METAL Binding site for a metal ion. 

DISULFID Disulfide bond. 

CROSSLNK Posttranslationally formed amino acid bonds. 

DNA BIND Extent of a DNA-binding region. 

NP BIND Extent of a nucleotide phosphate-binding region. 

ZN FING Extent of a zinc finger region. 

CA BIND Extent of a calcium-binding region. 

ENZ ACT Enzymatic activity. Types of enzymatic reactions as 

a subpart to protein functions. 

ACT SITE Amino acid(s) involved in the activity of an enzyme. 

CELL Cellular phenotype. Class denoting different cellular 

phenotypes that can be affected by structural or compositional 

changes of a protein. 

N/A 

Table 6.1: Biological categories for the classification of protein residue related information. Two sets 

of schemes were used: a text data motivated definition of categories (MAN) determined from manual 

analysis of sentences with annotations for protein residues from MEDLINE, and key categories from the 

feature table of UniProtKB (FEAT). 

112

The association between both, a feature (w) and a category (c), was estimated based on 

their mutual information score 

I(w, c) = log 2 

P (w,c) 

P (w)P (c) . (6.1) 

The association between the multi-word term T = {w i } n i=1 

and a category c was 

computed by the sum of the associations of its words 

A(T, c) 

= P ∗ (c) ∑ n 

i=1 I(w i, c), (6.2) 

where P ∗ (c) is the probability of a category associated with a term. The categorization 

of a multi-word term into one of the categories, amounts to the identification of the best 

fitting category C ∗ for a term, based on the words in a term 

c ∗ = arg max c A(T, c). (6.3) 

The reference set was generated, by using maximal length noun phrase (MLNP) analysis. 

The assumption of this approach is that textual features co-occurring with a residue 

within a noun phrase (NP r ) are good candidates of terms for functional annotation. In 

order to identify the boundaries of these candidate terms, the MLNP algorithm relies on 

the lookup of a determined set of noun phrases without nested residue entities (NP ¬r ). In 

other words, the algorithm assumes that nested terms in NP r are also expressed as standalone 

noun phrases, which can be identified by a broad syntactical analysis on MEDLINE. 

The following is an example for illustration. Consider the term 

”complex formation”, 

which is identified as a stand-alone noun phrase NP ¬r in the sentence 

113

”The GlyNH2 was removed and the reactive-site peptide bond X18- 

Glu19 was synthesized by complex formation with proteinase K.” 

(PMID:9047374). 

The same term co-occurs with a residue entity within another noun phrase (NP(r)) 

”Rb-E2F-DNA complex formation” 

in the sentence 

”MDM2 also interacts with Rb through its central acidic domain and inhibits 

Rb function in part by blocking Rb-E2F-DNA complex formation.” 

(PMID:16337594). 

The determined MLNP in this example is ”complex formation”. 

Once the set of MLNPs were extracted, each item (NP) was manually labelled, based 

on a categorisation scheme. Within this study, two categorisation schemes (cf. table 6.1) 

were used independently and studied: the categories defined by manual analysis on MED- 

LINE sentences (bottom-up approach), and the categories defined as keys in the feature 

table from UniProtKB (top-down approach). The sets of categories from the bottom-up 

approach and from the top-down approach are referred as MAN and FEAT in this study. 

Table 6.2 compares the distribution of labels within the reference set. 

An illustration, where a determined MLNP can be used to find relevant information 

from contextual features of a protein residue, is the following example. From the sentence 

114

MAN 

FEAT 

Category Frequency Category Frequency 

STR COMP 433 DOMAIN 28 

MOTIF 8 

TOPO DOM 4 

CHAIN 2 

TRANSMEM 2 

COIL 1 

CHEM MOD 361 VARIANT 275 

MOD RES 59 

PEPTIDE 13 

VAR SEQ 6 

LIPID 3 

CARBOHYD 1 

STR MOD 25 REGION 100 

SITE 246 

BINDING 195 BINDING 139 

METAL 25 

DISULFID 11 

CROSSLNK 10 

DNA BIND 6 

NP BIND 5 

ZN FING 2 

CA BIND 1 

ENZ ACT 90 ACT SITE 110 

CELL 161 N/A 

GEN BIOL 2,172 GEN BIOL 2,372 

GEN ENG 643 GEN ENG 651 

Table 6.2: Category distribution in the text feature reference set. The text feature reference set was 

compiled from maximal length noun phrase analysis (MLNP) from two sets of noun phrases: one without 

residue mentions and the other with identified protein residue entities. The features in the reference set 

were manually assigned with labels of the categorisation scheme MAN and FEAT. GEN BIOL = general 

biological terminologies; GEN ENG = general English words. 

115

”Mutation K241Q completely abolishes DNA glycosylase activity and 

covalent complex formation in the presence of NaBH4.” (PMID:9241232), 

the following relation can be identified 

mutation K241Q—abolish—covalent complex formation. 

A semantic label can be assigned to the relation argument ”covalent complex formation” 

because the term ”complex formation” is labelled in the reference set. 


The extraction of contextual features of residues results in a set of syntactical relations, 

which are represented as PAS. The performance of this extraction module was evaluated 

by comparing the returned PAS data with manual annotations in the gold standard test 

corpus (cf. section 5.2). A true positive was counted, if the syntactical relations in a PAS 

were correct, and if the arguments in the PAS contained the annotated residue entity and 

the marked keyword(s) in the test corpus. If any of these conditions were not met, then a 

false positive was registered. The performance was measured in terms of precision, recall 

and F1-measure, as described earlier in section 5.3. 

The performance of the developed classification method was evaluated by a 100 times 

5-fold cross-validation. For each iteration, terms in the reference set were shuffled, and 

partitioned into a test set (1/5 of the data) and a training set (4/5 of the data). The 

average precision, recall and F1-measure (cf. section 5.3) were calculated for each classifier 

from the determined confusion matrix. 

116

PAS 

Method Available Extracted Common Precision Recall F1 

Shallow parsing 117 82 56 0.68 0.48 0.56 

Full parsing 117 86 32 0.37 0.27 0.31 

Table 6.3: Evaluation of syntactical language parser performance. The performance of the two language 

parsers (shallow and full parsing) were evaluated on the basis of precision, recall and F1 measures by 

comparing the annotated PAS data in the test set with the returned PAS output from the parsers. 

6.3 Results 

In this section, the performances of contextual feature extraction and categorisation are 

studied. The test dataset is the gold standard corpus. 

6.3.1 Contextual feature extraction evaluated 

The objective in contextual feature extraction is to find textual features that are suitable 

as functional annotations for protein residues. 

In this section, the performance of this extraction system is studied by comparing 

the results produced with two different language parsers: the shallow parser, and the full 

parser. Sentences from the gold standard corpus (GC) were used as test dataset for this 

analysis. 

Within this study, the analysis determined that the developed shallow parser has a 

better performance than the full parser ENJU. The shallow parser yielded in a F1 measure 

of 0.56 (precision of 0.68 and recall of 0.48), while the full parser ENJU has a F1 measure 

of 0.31 (precision of 0.37 and recall of 0.27) (cf. table 6.3). 

The results suggest that contextual information of a residue entity can be extracted 

from a syntactical analysis with a F1 measure of 0.56 and 0.31 for shallow parsing and 

full parsing, respectively. 

117

6.3.2 Performance analysis of the classifiers 

One problem in functional annotation extraction is the semantic interpretation of the 

extracted text data. 

The solution proposed in this work, is based on a classification 

approach. 

Two different categorisation schemes were tested in this study: MAN and 

FEAT. The performance of the developed classification method was evaluated by repeated 

cross-validation studies. Table 6.5 summarises the results from the determined confusion 

matrix (cf. table 6.4). 

For MAN, the top three performing classifiers with F1 measures of 0.62, 0.57, and 0.57 

are STR COMP (precision of 0.56, recall of 0.69), CHEM MOD (precision of 0.54, recall 

of 0.59) and BINDING (precision of 0.63, recall of 0.52). The average performance of the 

whole classification system for this categorisation scheme yielded in an average precision 

of 0.48 and an average recall of 0.42. In comparison the classification based on FEAT has 

a much lower average performance: average precision of 0.24, average recall of 0.18. The 

weak performances of the FEAT classifiers is explained by the distribution of examples 

in the categories; for some categories the number of corresponding features or examples 

is low (cf. table 6.2). A discussion is presented in section 6.4 

Examining the false positive rate in the confusion matrix of MAN reveals that the classifiers 

are confused with the category GEN BIOL (general biological terms) or GEN ENG 

(general English terms). This is not surprising considering that English terms are ambiguous. 

In addition, some categories show confusions with others, e.g. STR COMP with 

CHEM MOD, and ENZ ACT with STR COMP. One explanation is that some terms 

can be assigned to more than one category. For example, ”mutant structure” refers to 

an altered protein structure state, which is based on a chemical change in the protein 

sequence. 

Despite the average performances of some classifiers, the presented method can be 

used to assign categories to textual features. However, significant improvements on the 

performances of some classifiers are necessary before the system can be used automatically. 

118

Prediction 

BINDING GEN BIOL CELL CHEM MOD GEN ENG ENZ ACT STR COMP STR MOD 

BINDING 1,772 762 28 93 165 26 546 0 

A | GEN BIOL 560 15,815 525 1,496 4,514 159 1,714 65 

c | CELL 96 1,167 836 150 325 91 67 0 

t | CHEM MOD 38 1,103 12 3,742 761 79 546 25 

u | GEN ENG 144 2,556 126 510 1,820 46 480 35 

a | ENZ ACT 33 338 80 201 226 324 457 0 

l | STR COMP 160 783 64 551 592 35 4,914 11 

STR MOD 1 91 1 129 125 0 21 43 

Table 6.4: Performance analysis of the classifiers (confusion matrix). Classification with categories 

from MAN were analysed by cross-validation studies with 100-iterations. The result is represented as a 

confusion matrix. 

119

MAN 

FEAT 

Category Precision Recall F1 Category Precision Recall F1 

STR COMP 0.56 0.69 0.62 DOMAIN 0.50 0.24 0.32 

MOTIF 0.98 0.36 0.53 

TOPO DOM 0 0 0 

CHAIN 0 0 0 

TRANSMEM 0 0 0 

COIL 0 0 0 

CHEM MOD 0.54 0.59 0.57 VARIANT 0.50 0.69 0.58 

MOD RES 0.40 0.23 0.29 

PEPTIDE 0.05 0.06 0.05 

VAR SEQ 0 0 0 

LIPID 1 0.32 0.48 

CARBOHYD 0 0 0 

STR MOD 0.24 0.10 0.15 REGION 0.44 0.44 0.44 

SITE 0.40 0.55 0.46 

BINDING 0.63 0.52 0.57 BINDING 0.41 0.45 0.43 

METAL 0.05 0.02 0.03 

DISULFID 0.53 0.15 0.23 

CROSSLNK 0 0 0 

DNA BIND 0 0 0 

NP BIND 0 0.06 0 

ZN FING 0 0 0 

CA BIND 0 0 0 

ENZ ACT 0.43 0.20 0.27 ACT SITE 0.45 0.31 0.36 

CELL 0.50 0.31 0.38 N/A 

GEN BIOL 0.70 0.64 0.67 GEN BIOL 0.76 0.65 0.70 

GEN ENG 0.21 0.32 0.26 GEN ENG 0.23 0.32 0.27 

0.48 0.42 0.43 0.25 0.18 0.19 

Average 

Average 

Table 6.5: Performance evaluation of the classifiers (precision, recall, F1 measure).Evaluation of classification 

of textual features (noun phrases). Classification with categories from MAN and FEAT were 

analysed by cross-validation studies with 100-iterations. The performance was measured in terms of 

precision, recall, and F1 measure. 

120

One option is to increase the number of training data, or the size of features for each 

classifier. Another alternative is to modify the definition of classes. The results suggest 

that the algorithm is, in generally, suitable for classification. 


The presented text mining solution extracts textual features from the context of residue 

entities. The identification of the contextual features, and the association with the residue 

entity, is based on the syntactical analysis of the sentence. More specifically, only a subset 

of semantic relations that are found in verbal and prepositional relations are extracted 

from text. The advantage of this approach is, that not only the semantic relation partners 

and the semantic relation type are found, but also contextual information is extracted. 

Within this study two approaches in syntactical analysis were compared, i.e. shallow 

parsing and full parsing, while the result indicates that the ENJU parser had a weaker 

performance than the developed shallow parser. Manual analysis on the false positive rate 

indicates that the source of incorrectly determined syntactical structure originates from 

false part-of-speech tagging. For example, in the sentence 

”Conversely, K382Q displays a highly altered responsiveness to the activator, 

suggesting that Lys(382) is involved in both activator binding and 

allosteric transition mechanism.” (PMID:10751408), 

both parsers identified ”altered” as a verb in past tense, although the correct POS is a 

noun modifier. The performance of the POS tagger is critical for the detection of phrase 

boundaries. However, both parsers rely on two different methods for POS tagging and the 

performance of the POS tagger has to be considered as well when comparing the shallow 

and full parser. Table A.1 lists some examples, where a parser failed in extracting the 

annotated PAS data from GC. 

121

The extracted information is difficult to normalise, because there is no gold standard 

of how to represent the association, and how to qualify the contextual information. In 

this work, the predicate-argument structure is used as a template for the extracted information. 

Although verb frame sets from PropBank or PASBio can be used to normalise 

the extracted data, they are not designed to capture description of protein residue function. 

On the other hand, this gives the extraction method the advantage to discover new 

knowledge. Because the extracted information is not normalised, the performance can 

only be measured in terms of sensitivity. 

The evaluation of the classification method indicates, that the presented approach can 

provide an automatic solution for text interpretation. However, some of the categories 

have only few examples, which is reflected in weak performances of the classifiers. One 

solution to this problem is to balance the example sets of each category, for example, 

by collecting more terminologies from MEDLINE. Alternatively, other categories may 

be defined to balance the ratio between a category and the associated set of examples. 

Yet another approach is not to classify arguments of a PAS, but cluster them based on 

their, for example, contextual usage. 

The advantage here is to find more information 

similarities among the PAS data by overcoming the information representativeness of a 

training (reference) set. 

Despite the fact, that semantic labels can be assigned to the arguments in a PAS, 

the developed method is not able to interpret the meaning of the whole extracted text 

segment. For example, in the sentence 

”Specific binding of the WT and mutant receptors Cys14Ala and 

Cys199Ala was inhibited in the presence of the disulfide bond reducing 

agent, DTT, implying that disulfide bonds are formed and can be 

reduced in these mutant receptors.” (PMID:9202220). 

The following information was extracted and semantic categories were assigned to the 

122

arguments of the PAS 

pred = inhibited 

arg1 = Specific binding 

arg1-of = [the WT and mutant receptors CYS14 ALA and 

CYS199 ALA]/CHEM MOD 

arg2-in = the presence 

arg2-of = the disulfide bond reducing agent. 

Although one part of the information in the example has been correctly assigned with the 

label CHEM MOD, the entire text phrase should be labelled with BINDING. A solution 

to this problem is not trivial and requires several levels of linguistic analysis. 


In this chapter, I have presented the developed contextual feature extraction system for 

the annotation of residue entities. Because a suitable terminological resource is not available, 

the identification of functional annotation is based on the extraction of syntactical 

relations between a residue entity and a noun phrase. The developed method allows the 

discovery of novel information that can provide key information for functional annotation. 

In the next chapter, I will demonstrate the validity of the extracted information as 

functional annotation of protein residues. 

123

Chapter 7 

Extraction of functional annotation 

for protein residues from MEDLINE 

In the previous two chapters, two fundamental text mining components for the functional 

annotation extraction were presented. In this chapter, I provide results of the combined 

extraction result, and assesses the performance of the combined system. The objective in 

this study is to determine the qualitative and quantitative distribution of information in 

MEDLINE. Because the information is derived solely from biomedical abstract texts, it 

is necessary to examine the data in terms of validity, novelty, and biological significance. 

In the first part of the evaluation, the performance of the functional annotation extraction 

is studied on the gold standard corpus. Then the biological significance of the 

extracted data from MEDLINE is studied on two example proteins, the suppressor protein 

p53, and the Janus kinase 2 protein. Finally, the distribution of information is examined 

by two specific analysis: the cross-validation of identified active site residues with CSA, 

and the cross-validation of binding residues with MSDsite. 

124


The evaluation of the functional annotation extraction system was based on the performance 

analysis of its extraction components: protein residue identification, and contextual 

feature extraction (cf. section 5.3 and section 6.2). 

The analysis on the biological validity of the mined functional annotations was done by 

manual analysis. For each protein residue, the set of extracted annotations was reviewed 

and grouped by similar topics. Because a set of annotations for each associated protein 

residue can be very large, random samples were drawn from a list of annotations sorted 

by residue name and position. The result is a set of sample annotations for each extracted 

residue of a protein. The information was compared with the corresponding annotations 

in UniProtKB. 

The validation of catalytic residues was done by cross-validation with CSA [PBT04]. 

The analysis was performed on three levels, i.e. 

the comparison of identified protein 

residues from MEDLINE with CSA, comparison of residues with extracted functional annotations, 

and comparison of residues with extracted annotations classified as ENZ ACT 

(cf. section 6.1.2). The residues were compared by using the combination of the identifiers 

RID+UID (cf. section 5.3). 

The validation of binding residues from MEDLINE extraction was done accordingly. 

The third level of validation compared residues with extracted annotations classified as 

BINDING. 

125

7.2 Results 

7.2.1 Evaluation of the developed functional annotation extraction 

system 

The presented functional annotation extraction system consists of two basic modules: 

identification of protein residues, and contextual feature extraction. The following describes 

an analysis of the overall performance of the combined text mining system. The 

test set is the gold standard corpus (GC; cf. section 5.2). The evaluation was done 

in two respects: manual validation of extracted information, and cross-validation with 

UniProtKB annotations. 

Manual validation of extracted information. 

The gold standard corpus consists 

of 100 abstract texts with tri-occurrences of the triplet protein, residue and organism. 

However, manual analysis identified only 51 abstract texts with residue entities that can 

be associated with their proteins and hosting organisms. 

The number of associations 

(OPR) is 172. This represents the target for protein residue identification. 

Corresponding to these OPRs is the set of functional annotations (PAS data). For 109 

out of 172 OPRs, keywords were co-mentioned in verbal relations. The number of PAS 

associated with the 109 OPRs is 117. This represents the target of functional annotation 

extraction. 

Figure 7.1 summarises the performance of the functional annotation extraction. With 

a previously determined precision of 0.82 and a recall of 0.38, the protein residue identification 

module detects 79 OPRs with 65 out of 79 being the correct ones. Contextual 

feature extraction for these 65 protein residues resulted in 35 PAS data. In comparison 

with the 117 annotated PAS of the 109 OPRs, only 16 out of 35 extracted PAS are true 

positives. However, the total number of extracted PAS is 46, which results in a precision 

of 0.35 and a recall of 0.13. A systematic analysis revealed, that the rate of false positives 

126

PAS data 


GC 117 46 16 0.35 0.13 0.25 

Figure 7.1: Performance evaluation of the functional annotation extraction system. The performance 

is dependent on the two combined text mining modules: protein residue identification; and contextual 

feature extraction. The performance was measured in terms of precision, recall, and F1 measure 

127

has the following sources: a false positive of OPR with extracted PAS, a true positive 

OPR with no annotated PAS, and a true positive of OPR with false positive of PAS. 

In comparison, if the system would have identified all protein residues correctly, the 

performance of the whole extraction would have yielded in a precision of 0.68 and a 

recall of 0.48 (cf. section 6.3). Considering, the presented text mining solution is a pilot 

approach to extract functional annotations for the validation of predicted functional sites, 

the result is good for this area and comparable to first studies in BioCreAtIvE or Critical 

Assessment of Techniques for Protein Structure Prediction (CASP). The recall can be 

explained by the performance of the contextual feature extraction module. 

The result indicates, that the extracted functional annotations have a reasonable precision 

in this first attempt of functional annotation extraction, but is low in coverage. 

This can be explained by the sum of the performances of each text mining module. On 

one hand, an incorrectly determined protein residue leads to a false positive of PAS. On 

the other hand, a failed entity recognition contributes to the false negative rate. In addition, 

language complexity, and incorrectly parsed sentences are the other reasons for the 

false positive and false negative rate of functional annotation extraction. 

In conclusion, the presented functional annotation extraction system delivers precise 

information, but has a low coverage of extraction. However, in context of the bioinformatics 

work of this thesis, a precision-driven extraction system is prefered over a recall 

oriented text mining solution. 

Cross-validation with UniProtKB functional annotations. 

Despite the low coverage 

of the functional annotation extraction system, the extracted information is correct 

and reusable for the annotation of protein residues. Table B.1 lists the 16 verified PAS 

data, corresponding to 17 verified protein residues. A comparison with UniProtKB shows, 

that 5 out of 16 are rediscovered knowledge. The remaining 11 out of 16 contain novel 

information that can be used to update the protein knowledge base. 

The extraction of functional annotations is a multi-step system. Although the per- 

128

formances of each module may not be at optimal level, the results demonstrate that 

functional annotations are available and extractable from MEDLINE. 

7.2.2 Studying mined functional annotations for the proteins 

p53 and Jak2 

UniProtKB curates functional annotations for proteins on three levels: protein level, 

protein domain level, and protein residue level. The objective in this section is to study the 

validity and novelty of mined functional annotations from whole MEDLINE extraction. 

The result provides an indication of the biological significance for automatic extraction 

from MEDLINE. The annotations of two example proteins, p53 and Jak2, are analysed 

and compared with relevant information from UniProtKB. 

Tumour suppressor protein p53. 

p53 plays a critical role in preventing human cancer 

formation. In the native state, the protein assembles to a tetrameric phosphoprotein. 

It consists of four functional domains: (1) the proline-rich, acidic, N-terminus, which is 

involved in transcriptional activation, e.g. Mdm2 binding; (2) the central core, which 

binds DNA; (3) the oligomerisation domain with nuclear localisation signals, which allows 

the transfer into the nucleus; and (4) the C-terminus, which regulates DNA-binding 

[SYH + 03]. 

The extraction of functional annotations from MEDLINE for the human tumor protein 

p53 resulted in 1,665 PAS data. 

A manual analysis on samples of mined functional 

annotations indicates, that there are two main topics: the regulatory post-translational 

modification, and the binding activity of residues, where in some cases the interaction 

partner is also stated. Table C.1 lists example annotations grouped by similar topics. For 5 

out of 6 of the identified residues with post-translational modification, i.e. THR18, SER46, 

SER15, THR55, and SER315, the extracted information is similar to the annotations in 

the UniProtKB entry. The remaining residue, SER6, has no annotation in the UniProtKB. 

129

The knowledge base does not provide further information on the biological implication 

of these residues, while the extracted data contain more contextual information. 

For 

example: 

”[...]ATM-mediated phosphorylation of the ser15 site of p53[...]” 

(PMID:14757188), 

”[...]Ser46 phosphorylation activates p53-dependent apoptosis[...]” 

(PMID:17172844). 

The analysis also found annotations for some critical residues that are not recorded in 

UniProtKB. For example: 

”[...]the amino acid change C135R generates the loss of TP53 DNAbinding 

activity[...]” (PMID:17914575), 

”[...]R248W abolish the association with p63[...]” (PMID:11172034). 

The activity of p53 is thought to be regulated through a number of post-translational 

modifications at the N- and C-terminal regions. Review articles report that seven serines 

(SER6, SER9, SER15, SER20, SER33, SER37, and SER46) and two threonines (THR18, 

and THR81) in the N-terminal domain are modified by kinases upon exposure of cells to 

ionising radiation or UV light. The analysis shows that MEDLINE extraction can recover 

this information for the residues SER6, SER15, SER46, and THR18. 

Janus Kinase 2 (Jak2). 

Jak2 plays a crucial part in various growth factors and cytokine 

signalling pathways. Similar to other protein tyrosine kinases of the Janus kinase 

family, Jak2 consists of a tyrosine kinase domain and a tyrosine kinase-like domain. It is 

thought that the kinase-like domain can negatively regulate the kinase domain. 

130

The set of extracted functional annotations for Jak2 has the size of 624 PAS data, and 

contains only information on seven residues: L539 (1 annotation), W515 (1 annotation), 

K607 (2 annotations), V617 (630 annotations), F617 (5 annotations; a reported variant 

associated with Budd-Chiari syndrome), V678 (3 annotations), and D816 (1 annotation). 

A comparison with UniProtKB data shows, that the extracted information for F617, K607, 

and L539 are similar to the annotations in the database. These and other annotations for 

D816, V678, and W515 describe mutation events (data not shown). 

In order to assess the extracted information on V617, random samples were selected 

and studied manually. The result of the analysis indicates, that the set of annotations 

contains a lot of redundant information. The data can be grouped into two main topics: 

disease, and genetical origin. Table D.1 lists some examples of extracted functional 

annotations. 

The effect of mutating residue 617 on cellular function, and its association with particular 

diseases has already been reported, but none of the extracted annotations provide any 

molecular explanation. A survey of research publications on Jak2 revealed, that myeloid 

and lymphoid malignancies are associated with Jak2 V617F. It is proposed, that the 

residue 617 destabilises the kinase and kinase-like domain interactions, and thereby promotes 

activation of kinase activity [POHS05]. These results suggest that the extracted 

information reflects pieces of evidences, however, their biological relations may not be 

available in the mined output or even in MEDLINE. 

In summary, the study of the mined functional annotations of residues for the two proteins 

presented here indicates, that MEDLINE contains information, which are recurrent 

in a number of abstract texts. Despite the data redundancy, some functional annotations 

are not contained in UniProtKB, indicating that MEDLINE extraction retains its 

originality. 

131

7.2.3 Cross-validation of mined catalytic residues with CSA 

In the previous section, functional annotations were extracted from MEDLINE, and for a 

range of annotations, the contained information was analysed on its biological validity and 

novelty. This section focuses on enzyme-related information in the extracted annotations. 

The objective is to study how reliable the extracted information is for the validation of 

catalytic residues. The identified residues with these associated annotations are compared 

with CSA. Figure 7.2 summarises the result of this analysis. 

The CSA lists 12,971 protein residues (RID+UID), of which 799 were identified in 

MEDLINE. The missing 12,172 protein residues in CSA can be explained by the performance 

of the identification system (cf. section 5.4). Another explanation is, that CSA 

is curated from full-text publication extraction, and the same information may not be 

available in MEDLINE. 

By selecting residues with extracted functional annotations from MEDLINE, 691 out 

of 799 protein residues were retained. This result indicates that a lot of functional descriptions 

are available as contextual features of the identified protein residues. The result 

is consistent with previous performance evaluation studies (cf. section 6.4). With a precision 

of 0.43 and recall of 0.20, the classifier for the category ENZ ACT (cf. section 6.3) 

identified enzyme-related functional annotations for 77 out of 691 protein residues. Manual 

analysis shows, that this reduction can be explained by the classifier’s performance. 

Another explanation is the absence of relevant contextual cues in the extracted text. 

A search for the term ”catalytic triad” in the sentences of the identified protein residues 

yielded in a sub-selection of 221 out of 46,750 residues. A comparison with CSA shows, 

that 44 out of 221 are re-discoveries of active site residues. 

The annotations for the 

remaining 177 may contain supporting evidences to identify the residues as catalytic. A 

systematic analysis of these predicted catalytic residues should start with the 27 out of 

177 residues, which have annotations classified as ENZ ACT. 

In conclusion, the developed text mining system rediscovers active site residues, by 

132

Figure 7.2: Cross-validation of text mined catalytic residues with CSA. The analysis was done based 

on the comparison of the determined RID+UID pairs. The numbers reflect the determined RID+UID 

pairs. RID = Residue identifier; UID = Uniprot identifier. 

133

Figure 7.3: Cross-validaiton of text mined binding residues with MSDsite. Annotation was studied 

on the level of using solely the mentioned protein residue, the residue with PAS data, and residue with 

information on binding. The number indicates the counted RID+UID pairs in the data. RID = Residue 

identifier; UID = Uniprot identifier. 

solely mining abstract text from MEDLINE. While the rate of false positive is not known, 

the extraction identified 1,391 protein residues with enzyme-related functional annotations. 

The significance of these potentially new CSA residues are further studied in 

ongoing work. 

7.2.4 Annotation of protein residues in MSDsite 

The MSDsite [GDO + 05] holds a number of predicted ligand binding sites, by automatically 

analysing ligand contacting residues in the PDB. The objective in this section is to analyse 

how many of these binding residues can be annotated from mining MEDLINE. 

134

The analysis shows that 512 out of the 46,750 identified protein residues in MEDLINE 

are also contained in MSDsite (cf. figure 7.3). A large proportion of these residues are 

associated with PAS data (429 out of 512), while only a smaller subset of 12 have information 

classified as BINDING. Manual analysis shows, that all of these 12 annotations are 

correct. They can be used to validate the predicted ligand binding residues in MSDsite 

(table E.1). 

For the remaining 417 out of 512 residues, the associated PAS data may still contain 

valid information for the annotation. However, a systematic analysis was not performed 

at this stage of study. 

In summary, a relatively small set of protein residues recovered from MEDLINE extraction 

can be used for the annotation of MSDsite entries. 


The extraction of functional annotation is a multi-step process, and the quality of the 

result has to be interpreted in context of each subprocess’ performance. Although the 

performances of each extraction module may not be at optimal level, the evaluation results 

indicate that the mined output contains biologically meaningful data. Considering the 

validation of a predicted function requires any evidences of biological function, the developed 

text mining system can become a valuable tool, for example for the protein function 

prediction assessement in the Critical Assessment of Techniques for Protein Structure Prediction 

(CASP) [LRTV07]. With the improvement of the information extraction modules, 

the quality of mined functional annotations is expected to become more reliable. 

The biological relevance of the extracted functional annotation was demonstrated on 

two different proteins, p53 and Jak2. 

The results show, that not only information in 

UniProtKB can be rediscovered from MEDLINE, but also novel information can be extracted 

as well. These functional annotations can be considered to complement existing 

annotations in UniProtKB. However, manual analysis on subsets of the extracted annota- 

135

tions indicates, that the information is represented redundantly in MEDLINE. One major 

reason is, that biological facts are expressed repeatedly within the biological community. 

The study of identifying catalytic residues and binding residues from the mined functional 

annotations, and the cross-validation with CSA and MSDsite shows, that the developed 

text mining solution is able to find relevant data from MEDLINE. Although the 

developed classifiers have a weak performance, it is not clear whether this explains completely 

the cross-validation results. It is possible, that key information is not mentioned 

in abstract texts that would identify the biological role of the protein residues. Another 

explanation is based on the protein residue identification performance, which had been 

evaluated with a low recall score. 

Although abstract texts cover only a subset of information from full-text articles, and 

information is represented repeatedly in MEDLINE, this study shows that the text mined 

information is biologically valid and contains snippets of additional information that are 

relevant for UniProtKB. For example, the extracted annotations complement existing 

information in UniProtKB and provide first data of yet not curated functional sites in 

proteins. 


In this chapter, two text mining components were combined to form the functional annotation 

extraction system. Performance analysis shows, that the system is precise, but 

has a low coverage. However, the low recall is compensated by the fact, that information 

is distributed redundantly. The extracted information is biologically valid, and contains 

some novel data, which can be used to update UniProtKB. So far, functional annotations 

of residues have been evaluated in isolation, i.e. independent from structural context in 

proteins. In the following chapter a biological context is created, by combining functional 

annotations with protein structure data (cf. chapter 3 and chapter 4). 

136

Chapter 8 

Combining active site prediction 

with mined functional annotations 

The goal in this thesis is to combine information from two disjoint information resources. 

In this course various methodologies were developed for the prediction of functional sites 

in proteins, and the extraction of relevant information for the functional annotation of 

protein residues from scientific articles. 

More specifically, a predicted functional site 

can be validated by a set of functional annotations of protein residues. 

Conversely, a 

set of functional annotations requires a structural context to understand the molecular 

mechanism of a protein function. 

In the previous chapters, I have presented the results on 3D pattern mining from PDB 

(cf. chapter 3) and functional annotation extraction from MEDLINE (cf. chapters 5, 6, 

and 7). Here, the produced datasets are combined and analysed. The objective in this 

chapter is to validate predicted active sites that the data mining output may contain, 

by combining specific functional annotations extracted from MEDLINE. The result is 

compared with data from CSA. 

137

Figure 8.1: Overview of processes and evaluation methods of combining the protein structure dataset 

and literature dataset. 


8.1.1 Combining protein structure data with literature data 

Theory 

The method to combine PDB with MEDLINE data, i.e. the functional annotation of a 

residue from a protein structure, is based on the combination of two identifiers: RID+UID 

(cf. section 5.3). There are two major subtasks to combine the datasets (cf. figure 8.1): 

linking PDB entries to a Uniprot entry, and associating a residue with its co-mentioned 

protein in text. 

Mapping residues in PDB to UniProtKB. 

The mapping between PDB and UniProtKB, 

and the inherited mapping of a protein residue from a PDB entry to its UniProtKB sequence 

index, is a non-trivial task. One problem is that the author of a determined protein 

structure used an arbitrary residue index system that is not in accordance with the wild- 

138

type protein sequence. 

Furthermore, residues in a protein deletion mutant may have 

been numbered sequentially, irrespectively of sequence gaps. Another example is, that 

UniProtKB does not have the corresponding protein sequence for a crystallised protein, 

which may be, for example, a novel splice variant. 

In some cases, cross-links from PDB to UniProtKB, or UniProtKB to PDB are available. 

However, over time the links may have become outdated. In order to find the correct 

mapping between the protein residue indices in both databases, an exhaustive sequence 

alignment is required. Various solutions and services have been provided for the periodic 

update of UniProtKB-PDB mappings [VMMR + 05] [Mar05] [VZHC05] [MSD08]. 

Here, I reuse a previously published lookup table file [Mar05] for the mapping of 

protein residues in PDB to UniProtKB. Notice, that the lookup table is based on the 

alignment analysis work of the Macromolecular Structure Database (MSD) group at the 

European Bioinformatics Institute [MSD08]. 

Mapping protein residue in text to UniProtKB. 

The mapping of a residue entity 

in text to its co-mentioned protein, and ultimately the mapping to UniProtKB, is 

explained in section 5.1. 


The correct sequence index mapping of a PDB entry to its corresponding Uniprot entry 

was based on the lookup table produced by [Mar05] (version October 2008). An example 

of the lookup table data is shown in figure 8.2. The combination of the following keys were 

used to unambiguously map a residue from PDB to its Uniprot native sequence position: 

PDBID + chainID + RID. 

139

PDB 

UniProtKB 

PDBID chainID serial resName resSeq UID resName seqIndex 

11gs B 1 PRO 2 GSTP1 HUMAN P 3 

11gs B 2 TYR 3 GSTP1 HUMAN Y 4 

11gs B 3 THR 4 GSTP1 HUMAN T 5 

11gs B 4 VAL 5 GSTP1 HUMAN V 6 


11gs B 6 TYR 7 GSTP1 HUMAN Y 8 

11gs B 7 PHE 8 GSTP1 HUMAN F 9 

11gs B 8 PRO 9 GSTP1 HUMAN P 10 


11gs B 10 ARG 11 GSTP1 HUMAN R 12 

Figure 8.2: Lookup table for PDB/UniProtKB mapping. Excerpt of the lookup table to map protein 

residues from a PDB entry to the corresponding UniProtKB entry. 


The validation of identified catalytic residues was done by manual examination of the 

functional descriptions of annotated protein residues. 

Within this analysis 6 datasets 

were used (cf. section 7.2): CSA is the set of active site residues from the Catalytic Site 

Atlas [PBT04]; OLDFIELD is the set of residues in the non-redundant structure set from 

[Old02]; PATTERN is the set of residues from the data mined 3D patterns; OPR is the 

set of protein residues identified from MEDLINE extraction; FA is the subset of OPR, 

which have functional annotations extracted from MEDLINE; and ENZ is the subset of 

FA, where the contained information are classified as ENZ ACT, i.e. the information are 

enzyme-related. 

8.3 Results 

8.3.1 Protein residue mapping between three data resources 

This section gives an overview of the analysed datasets. Figure 8.3 summarises the data. 

OLDFIELD contains in total 341,365 protein residues, counted as RID+PDBID. 

328,796 out of 341,365 residues are found in the lookup table, which corresponds to 

280,521 RID+UID. Parallely, the residues from the mined 3D pattern set (PATTERN) was 

140

Figure 8.3: Overview of the combined datasets from protein structure data and biomedical literature 

data. The combined dataset is analysed to identify active site residues. CSA = active site database; OPR 

= identified protein residues; PAS = contextual feature assigned to a protein residue; ENZ = contextual 

feature with enzyme-related information; OLDFIELD = protein structure subset from PDB; PATTERN 

= data mined structural features from OLDFIELD. 

141

mapped to 24,500 RID+UID. The identification of protein residues in MEDLINE found 

a total of 132,476 RID+UID with a unique count of 46,750 RID+UID. This dataset is 

referred as OPR. 36,569 out of 46,750 protein residues have functional annotations (FA), 

while another subset of 1,467 out of 36,569 have annotations classified as ENZ ACT 

(ENZ). A set analysis between OLDFIELD and OPR determined 2,402 common protein 

residues, 197 out of 2,402 also listed in CSA. 

In summary, for a large fraction of protein residues in OLDFIELD, mapping to 

UniProtKB sequence indices is available. However, only 2,402 are recovered from MED- 

LINE extraction, which can be used for validation. 

8.3.2 Rediscovery of active sites and catalytic residues 

The identification of catalytic residues from protein structure data mining, and from 

biomedical literature mining was studied previously (cf. sections 4.2 and 7.2). Each 

result was evaluated by cross-validation with CSA. This section studies the validation of 

predicted active sites from the combined datasets. 

Previously, three structural patterns were identified as active sites, by cross-validation 

with CSA (cf. chapter 4). One of the pattern represents the well known catalytic triad. 

This pattern was found in 19 proteins within the dataset (cf. section 4.2). Associated 

with these 19 proteins is the set of 57 protein residues. The analysis shows that only 3 out 

of 57 residues were identified in MEDLINE, The 3 identified residues in text correspond 

to the same protein, bovine chymotrypsinogen (cf. table 8.1). The associated functional 

annotations for the residues ASP102, and HIS57, were not classified as ENZ ACT. The 

contained information in these annotations only indirectly indicate the catalytic property 

of these residues; the annotations do not mention them as part of the catalytic triad. In 

conclusion, a structure-based prediction of an active site was not validated by literature 

data. 

The intersection of PATTERN, OPR, and CSA results in a set of 15 protein residues. 

142

RID+UID 

S195 CTRA BOVIN; D102 CTRA BOVIN; H57 CTRA BOVIN 

Sentence ”These include the NH2-terminal four residues, the sequences near histidine-57 (chymotrypsinogen 

A numbering system), aspartic acid-102, aspartic acid-189, and serine-195, 

the regions of the three disulfide bridges, and the COOH-terminal end (residues 225- 

229) of the proteins. When aligned to maximize homology the identity of residues is 

34%.”(PMID:804314) 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

N/A 

D102 CTRA BOVIN; H57 CTRA BOVIN 

”In bovine chymotrypsinogen A in 2H2O at 31 degrees C, histidine-57 has a pK’ of 7.3 and 

aspartate-102 a pK’ of 1.4, and the histidine-40-aspartate-194 system exhibits inflections at 

pH 4.6 and 2.3.” (PMID:31898) 

pred = has 

arg1 = HIS57 

arg2 = a pK 

arg2-of = 7.3 and ASP102 a pK 

arg2-of = 1.4 

D102 CTRA BOVIN 

”In bovine chymotrypsin Aalpha under the same conditions, the histidine-57-aspartate-102 

system has pK’ values of 6.1 and 2.8, and histidine-40 has a pK’ of 7.2.” (PMID:31898) 

pred = have 

arg1 = the HIS57 ASP102 system 

arg2 = pK values 

arg2-of = 6.1 and 2.8 

D102 CTRA BOVIN; H57 CTRA BOVIN 

”The results suggest that the pK’ of histidine-57 is higher than the pK’ of aspartate-102 in 

both zymogen and enzyme.” (PMID:31898) 

pred = is 

arg1 = that the pK 

arg1-of = HIS57 

arg2 = higher than the pK 

arg2-of = ASP102 

arg2-in = both zymogen and enzyme 

H57 CTRA BOVIN 

”The 1H NMR chemical shift of the Cepsilon1 H of histidine-57 in the chymotrypsin Aalphapancreatic 

trypsin inhibitor (Kunitz) complex is constant between pH 3 and 9 at a value 

similar to that of histidine-57 in the porcine trypsin-pancreatic trypsin inhibitor complex 

[Markley, J.L., and Porubcan, M. A. (1976), J. Mol. Biol. 102, 487–509], suggesting that the 

mechanisms of interaction are similar in the two complexes.” (PMID:31898) 

pred = is 

arg1 = complex 

arg2 = constant 

arg2-between = pH 3 and 9 

arg2-at = a value similar 

arg2-to = that 

arg2-of = HIS57 

arg2-in = the porcine trypsin-pancreatic trypsin inhibitor complex 

Table 8.1: Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen. 

Based on the performance of the functional annotation extraction system and the availability of information 

in MEDLINE, only few information was extracted. The mined information on the active site 

residues mention only indirectly their catalytic properties. 

143

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

C32 THIO HUMAN; C35 THIO HUMAN 

”A hydrogen bond between the sulfhydryls of Cys32 and Cys35 may reduce the pKa of Cys32 

and this pKa depression probably results in increased nucleophilicity of the Cys32 thiolate 

group.” (PMID:8805557) 

pred = reduce 

arg1 = A hydrogen bond 

arg1-between = the sulfhydryls 

arg1-of = CYS32 and CYS35 

arg2 = the pKa 

arg2-of = [CYS32 and this pKa depression]/ENZ ACT 

C215 PTN1 HUMAN 

”The structure of the catalytically inactive mutant (C215S) of the human proteintyrosine 

phosphatase 1B (PTP1B) has been solved to high resolution in two complexes.” 

(PMID:9391040) 

pred = solved 

arg1 = [inactive mutant (C215S)]/ENZ ACT 

arg1-of = the human protein-tyrosine phosphatase 1B (PTP1B) 

arg2 = unk 

arg2-to = to high resolution 

arg2-in = in two complexes 

Table 8.2: Identified catalytic residues from MEDLINE extraction. The mined functional annotation 

were classified as enzyme-related, suggesting the correspondent protein residue has some catalytic properties. 

The identified residues were also cross-validated by CSA, however the mined 3D pattern with 

these residues were not validated as active site residues by the database. 

The analysis shows that only 3 out of 15 protein residues have enzyme-related annotations. 

2 out of 3 residues correspond to the protein human thioredoxin (cf. table 8.2). However, 

none of the mined 3D patterns can provide a structure context to the identified catalytic 

residues. A manual analysis on the 12 out of 15 residues shows, that some of the associated 

annotations were not correctly classified as enzyme-related, which can be explained by 

the performance of the classifier (cf. section 6.3). 

For 16 out of 197 protein residues, i.e. the intersection between OLDFIELD, OPR, 

and CSA, the term ”catalytic triad” is found as co-mention within sentences. While none 

of the 16 residues are associated with a mined 3D pattern, 6 out of 16 residues have 

enzyme-related functional annotations (cf. table 8.3). 

In conclusion, the results in this study indicate, that the coverage of relevant information 

to validate predicted active sites is too low. However, some of the enzyme-related 

annotations are biological valid, but have no correlation with a 3D pattern. 

144

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

S80 HNL HEVBR; D207 HNL HEVBR; H235 HNL HEVBR 

”Our results yielded further support for an enzymatic mechanism involving the catalytic 

triad Ser80, His235, and Asp207 as a general acid/base.” (PMID:11354003) 

pred = involving 

arg1 = furhter support 

arg1-for = for an enzymatic mechanism 

arg2 = [the catalytic triad SER80, HIS235, and ASP207]/ENZ ACT 

E132 LINB PSEPA; D108 LINB PSEPA; H272 LINB PSEPA 

”The enzyme belongs to the alpha/beta hydrolase family and contains a catalytic triad 

(Asp108, His272, and Glu132) in the lipase-like topological arrangement previously proposed 

from mutagenesis experiments.” (PMID:11087355) 

pred = contains 

arg1 = unk 

arg1-to = the alpha/beta hydrolase family and 

arg2 = [a catalytic triad (ASP108, HIS272, and GLU132)]/ENZ ACT 

Table 8.3: Catalytic triad residues available from the mined functional annotations. The active site 

residues were identified by a search for the term ”catalytic triad” in the mined functional annotation 

data. The validity was also confirmed by comparison with CSA. 

8.3.3 Search for novel catalytic residues 

In the previous section, the combined dataset was evaluated by cross-validation with CSA. 

Thus the identified catalytic residues represent only re-discoveries of known data. The 

goal in this section is to search for novel catalytic residues by combining enzyme-related 

annotations with mined 3D pattern. 

A set analysis between CSA, OLDFIELD, and OPR revealed, that 2,205 residues 

are included in OLDFIELD and OPR, but not in CSA (cf. figure 8.3). A search for 

the term ”catalytic triad” in sentences of these 2,205 identified residues resulted in a 

subselection of 24 residues. The analysis shows that none of the 24 residues were found in 

the mined 3D pattern. However, 15 out of 24 residues have enzyme-related annotations 

(cf. table F.1), suggesting they are catalytic residues. A manual analysis determined, 

that the annotations contain valid evidences to identify the residues as catalytic. 

The result in this study indicates, that MEDLINE extraction can find some additional 

catalytic residues that are not represented in CSA. However, a correlation with the mined 

3D patterns was not found, and functional annotations were not interpreted in a structural 

context. 

145

8.3.4 General correlation found between predicted functional 

sites and extract functional annotations. 

Previously, the validation of predicted active sites was studied by cross-validation of known 

catalytic residues. In this section a more general correlation analysis between structure 

and function data is studied. Because the coverage of extracted functional annotations 

of protein residues is too low to be useful to annotate the residues of the prediction, 

we cannot expect that all residues in one prediction are annotated with description of 

biological function. However, if a predicted functional site has some feature which point 

to a common concept of function, then this can be used to prioritise the prediction. 

Table 8.4 (left panel) shows the top 25 mined structural patterns which were ranked 

by the number of distinct residues with PAS data. In total 168 patterns have annotations 

ranging from one residue to a maximal of nine distinct residues with annotations. Another 

view is to take into consideration the number of annotated residues in context of the total 

number of residues in a prediction (cf. table 8.4, right panel). This gives an indication of 

how frequent a pattern is and how much do we know on each residue from the text mined 

data. 

The extraction of biological features from text for protein residues matches to a number 

of various proteins, including homologues proteins. So far the annotation of residues 

in a predicted functional site considered only first level information (annotations for exact 

protein), however, the correlation analysis can also exploit information from homologous 

proteins (second level information). Based on the information from the Homology-derived 

Secondary Structure of proteins (HSSP) database [SS96], the annotation of the prediction 

was expanded by extracted information from homologues. The result of this study shows, 

that the number of residue annotation is increased by 10% (cf. table 8.5). A control analysis 

of how many residues in the non-redundant protein dataset OLDFIELD are identified 

in MEDLINE and how many of these have an association with PAS data indicates that 

the low recall of the developed text mining system is the reason for the weak annotation 

146

#residues with Pattern #residues in A/B #residues with Pattern #residues in A/B 

PAS (A) pattern (B) PAS (A) pattern (B) 

6 9 10 16 CYS CYS PHE-1 12 0.5 4 10 11 11 ALA HIS HIS-1 6 0.6667 

4 10 15 11 ASP HIS TRP-2 18 0.2222 4 9 15 11 GLN LEU TRP-2 6 0.6667 

4 10 11 20 HIS MET PHE-1 12 0.3333 6 9 10 16 CYS CYS PHE-1 12 0.5 

4 9 18 11 GLY MET TYR-1 12 0.3333 3 10 13 10 CYS PHE TYR-1 6 0.5 

4 9 11 17 ALA LEU VAL-1 30 0.1333 4 10 11 20 HIS MET PHE-1 12 0.3333 

4 8 9 10 CYS CYS HIS-1 12 0.3333 4 11 18 9 CYS ILE PHE-1 12 0.3333 

4 11 8 18 HIS HIS SER-1 12 0.3333 4 11 8 18 HIS HIS SER-1 12 0.3333 

4 11 18 9 CYS ILE PHE-1 12 0.3333 4 18 10 10 ASP CYS PHE-1 12 0.3333 

4 11 11 12 HIS HIS MET-1 21 0.1905 4 19 11 10 ASP CYS ILE-1 12 0.3333 

4 9 15 11 GLN LEU TRP-2 6 0.6667 4 20 9 11 ASP GLY MET-1 12 0.3333 

4 10 15 11 ASP HIS TRP-1 15 0.2667 4 8 9 10 CYS CYS HIS-1 12 0.3333 

4 10 11 11 ALA HIS HIS-1 6 0.6667 4 9 18 11 GLY MET TYR-1 12 0.3333 

4 20 9 11 ASP GLY MET-1 12 0.3333 3 9 10 8 CYS HIS MET-1 9 0.3333 

4 18 10 10 ASP CYS PHE-1 12 0.3333 2 11 13 9 ASN LYS SER-1 6 0.3333 

4 19 11 10 ASP CYS ILE-1 12 0.3333 2 11 14 8 ALA ARG ASN-2 6 0.3333 

4 11 14 7 ASP MET SER-1 18 0.2222 2 11 17 10 CYS PHE PRO-1 6 0.3333 

4 9 17 10 ALA ILE PHE-1 18 0.2222 2 18 10 11 ARG GLU PRO-1 6 0.3333 

3 9 10 8 CYS HIS MET-1 9 0.3333 2 19 9 11 ALA PRO TYR-1 6 0.3333 

3 10 13 10 CYS PHE TYR-1 6 0.5 2 9 11 9 ASP CYS LYS-1 6 0.3333 

3 21 11 10 CYS GLY VAL-1 21 0.1429 1 10 10 20 HIS PRO TYR-1 3 0.3333 

3 11 9 9 ASP MET SER-1 15 0.2 1 10 12 11 ILE LEU PHE-1 3 0.3333 

3 17 11 9 ALA LEU VAL-1 102 0.0294 1 14 8 7 ASP HIS SER-1 3 0.3333 

3 10 10 19 ALA HIS MET-1 18 0.1667 1 8 11 17 GLU THR THR-1 3 0.3333 

3 8 8 15 ASP HIS SER-1 33 0.0909 4 10 15 11 ASP HIS TRP-1 15 0.2667 

3 10 9 11 CYS VAL VAL-1 33 0.09099 4 10 15 11 ASP HIS TRP-2 18 0.2222 

Table 8.4: Functional annotations of protein residues in predicted functional sites. A functional site is 

predicted as a structure pattern that is recurrent among a non-redundant set of proteins. The table on 

the left panel lists the top 25 patterns ranked by the total number of annotated protein residues for each 

pattern, while the table on the right panel ranks the pattern by the total number of annotated protein 

residues in context of total number of residues found in all structure examples. 

147

Residue Annotations 

-HSSP 

+HSSP 

OPR FA OPR FA 

OLDFIELD 2,402 1,963 243 192 

PATTERN 168 132 16 19 

Table 8.5: Homology-based transfer of extracted functional annotations for protein residues in the 

mined pattern data. Based on the HSSP information the identified protein residues and their associated 

functional annotations were transferred from homologous proteins to the target proteins and residues in 

the mined structure pattern data. 

expansion. 

In conclusion, a general correlation between protein structure and function data is 

found in this study. The set of available annotations for protein residues is an indication 

of biological function for a predicted functional site. The biological significance of this 

result is being investigated further. 


The distribution of information in the combined data was studied by a search for active 

site residues. Another approach in sampling the dataset is the identification of ligand 

binding residues. A search can be done from the protein structure data, by selecting only 

residues of an identified metal binding site, and then consulting the literature for relevant 

annotations. 

The validation of a predicted active site in this study demonstrates, that the amount 

of extracted functional annotations was not sufficient for this task. 

Considering, that 

the catalytic triad is a well characterised structural feature, the information should be 

available in MEDLINE. In fact, by searching for the term ”catalytic triad” in the text 

mined data, several associations between the term and residues can be found. A close 

examination reveals that some are annotations for homologous proteins with the Asp- 

His-Ser catalytic triad motif (data not shown). 

However, the results of the presented 

studies indicate that the recall of the text mining system is to low to capture sufficiently 

148

annotations for protein homologues. 

Despite the identification of some catalytic residues in this analysis, it must be noted 

that literature-based verification of predicted active sites cannot rule out the detection of 

false positives. The absence of a biological evidence in the literature does not mean, that 

the prediction is wrong, but that simply no knowledge is currently available. Biological 

research is hypothesis-driven, and therefore not all of the predicted active site residues 

are expected to be reported in the literature, if they have not been a biological research 

target. 


In this chapter I performed a correlation analysis between the dataset from protein structure 

data mining and literature mining. 

The result in this study suggests, that the 

combined data have little correlations. For example, a structure-based prediction of an 

active site had no functional annotations with biological evidences, while the result was 

cross-validated with CSA. Conversely, literature-based identification of catalytic residues 

could not be interpreted in an evolutionary conserved structure context, because data 

mining did not find a suitable recurrent structure pattern. 

149

Chapter 9 

Conclusions and future work 

9.1 Summary of main contributions 

The goal of this thesis was to identify functional sites in proteins. For this purpose a 

novel approach that combines protein structure data mining and literature mining was 

used. Below is a summary of contributions. 

Significance testing of residue interaction is a novel approach to identify statistically 

significant spatial and chemical configurations of residues. 

The developed 

method relies solely on mathematical models, and the analysis shows, that recurrent 

homologous or convergent structural features can be extracted. More importantly, 

the mined result contains biologically valid data. For example, 22 proteins with the 

catalytic triad were identified from cross-validation studies. Altogether, the developed 

data mining method can be used to discover novel information; the result is a 

prediction of functional sites. 

Identification of protein residues is an important text mining component developed 

in this study for the extraction of functional annotations. The implemented solution 

utilises regular expression patterns, and lists of terminologies from UniProtKB and 

NCBI Taxonomy, in order to find and associate biological entities. Ultimately, an 

150

identified protein residue is mapped to a Uniprot protein, which means other extracted 

information can be integrated into UniProtKB. With a precision of 0.82 and 

a recall of 0.38, residues can be identified and associated precisely with their Uniprot 

proteins. From a whole MEDLINE analysis, 15,110 abstract texts were found, that 

can be used for information extraction of 2,884 UniProtKB/PDB proteins. 

Contextual feature extraction is a discovery-driven information extraction approach, 

to find description of function associated with a residue entity in the text. The developed 

method extracts from a parsed sentence verbal and prepositional relations 

of a residue and its contextual features. The Gene Ontology was not used, because 

it does not contain suitable terminologies for the identification of functional descriptions 

of residues. With a precision of 0.68 and a recall of 0.48, the language parser 

found 46,750 annotations for the identified protein residues from MEDLINE. Manual 

analysis indicates that some of the extracted annotations are valid, and contain 

novel information that can be used to update the feature table in UniProtKB. 

Annotation of protein structures is the main objective in this thesis. The goal is to 

create a synthesis between protein structure data and protein function data. The 

hypothesis is, that the intersection of information from both datasets can lead to 

the discovery of new biological information. For example, a predicted active site can 

be validated with evidences from the set of functional annotations. Although crossvalidations 

demonstrates, that mined information from PDB and literature contain 

correct results, no correlation was found between both datasets. Nevertheless, the 

text mined information are valid, and 1,391 catalytic residues were found, that can 

be used to update CSA. 

151

9.2 Limitations and future works 

During the work of this thesis, various research techniques, and three major analysis 

components have been developed. Their algorithms, and implementations were explained, 

their performances analysed, and suggestions for improvement have been made. In the 

following is a discussion on the improvements for the combined dataset analysis. 

To biologically validate a predicted functional site with published experimental data 

results it has to be assumed that the extracted functional annotations from the literature 

provide sufficient supporting evidence for a biological function. This has been shown to 

be partly correct for some examples. However, it will probably not work in all cases. My 

results suggest that other factors have to be considered in order to achieve one of the 

followings: (1) standardised description of function of protein residues; (2) identification 

of a representative functional concept of a structural feature; and (3) verification of the 

validity of the pattern as a consensus functional site, where annotations of other protein 

examples share the same annotations. Although the verification approach uses the vast 

and broad covering information from MEDLINE, the analysis indicates that this might 

not be sufficient for this task. 

Another serious limitation in the literature-based verification of functional sites is to 

take into account that our knowledge of the protein function space could be incomplete or 

even incorrect. Protein structure data mining aims to deliver biologically unbiased results, 

since 3D pattern mining relies on mathematical models and no biological knowledge is 

used. The result is a prediction of functional sites. However, the input is biologically 

biased. 

Currently, we do not have the complete knowledge of the fold space, which 

means the actual distribution of structural features may be skewed. As a consequence, 

the prediction may contain a large fraction of false positives. In the long run, various 

structural genomics initiatives may expand our knowledge of the fold space. 

In the meantime, the literature is the main resource of biological evidences to validate 

predictions. Yet, our knowledge of protein residue function, and even the spectrum of 

152

iological function has still to be determined. 

This can lead to four scenarios: (1) a 

true functional site is fully supported by evidences (true positive); (2) a true functional 

site is partly supported by evidences (incomplete knowledge); (3) a falsely predicted 

functional site is partly supported by evidences (incomplete knowledge); and (4) a falsely 

predicted functional site is fully supported by contradictory evidences (false positive). 

While, from a bioinformatical point of view, there is little we can do about this problem, 

the identification of case (2), (3), and case (4) can propose further biological experiments 

to find the missing data. 

153

Bibliography 

[AGM + 90] 

SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local 

alignment search tool. Journal of Molecular Biololgy, 215(3):403–10, 1990. 

[AL02] 

M Ashburner and SE Lewis. On ontologies for biologists: the gene ontology 

- uncoupling the web. Novartis Foundation Symposium, 2002. 

[AMS + 97] 

SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and 

DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein 

database search programs. Nucleic Acids Research, 25(17):3389–402, 1997. 

[APG + 94] PJ Artymiuk, AR Poirrette, HM Grindley, DW Rice, and P Willett. A 

graph-theoretic approach to the identification of three-dimensional patterns 

of amino acid side-chains in protein structures. Journal of Molecular Biololgy, 

243(2):327–44, 1994. 

[Att02] 

TK Attwood. The PRINTS database: a resource for identification of protein 

families. Brief Bioinform, 3(3):252–63, 2002. 

[AZP + 05] 

G Ausiello, A Zanzoni, D Peluso, A Via, and M Helmer-Citterich. pdbFun: 

mass selection and fast comparison of annotated PDB residues. 

Nucleic 

Acids Research, 33:W133–137, Jul 2005. 

154

[BFL04] 

T Binkowski, P Freeman, and J Liang. pvSOAR: detecting similar surface 

patterns of pocket and void surfaces of amino acid residues on proteins. 

Nucleic Acids Research, 32:555–558, 2004. 

[BFW + 94] 

A Barth, K Frost, M Wahab, W Brandt, HD Schadler, and R Franke. Classification 

of serine proteases derived from steric comparisons of their active 

sites, part ii: ”ser, his, asp arrangements in proteolytic and nonproteolytic 

proteins”. Drug Design Discovery, 2:89–111, November 1994. 

[BGH + 00] 

WC Barker, JS Garavelli, H Huang, PB Mcgarvey, BC Orcutt, GY Srinivasarao, 

C Xiao, LL Yeh, RS Ledley, JF Janda, F Pfeiffer, HW Mewes, 

A Tsugita, and C Wu. The protein information resource (pir). Nucleic 

Acids Research, 28(1):41–44, January 2000. 

[BKL00] 

SE Brenner, P Koehl, and M Levitt. The astral compendium for protein 

structure and sequence analysis. Nucleic Acids Research, 28(1):254–256, 

January 2000. 

[BLK + 08] 

E Beisswanger, V Lee, JJ Kim, D Rebholz-Schuhmann, A Splendiani, 

O Dameron, S Schulz, and U Hahn. Gene regulation ontology (gro): design 

principles and use cases. Studies in health technology and informatics, 

136:9–14, 2008. 

[BM05] 

R Bunescu and RJ Mooney. A shortest path dependency kernel for relation 

extraction. 

In Proceedings of the Joint Conference on Human Language 

Technology / Empirical Methods in Natural Language Processing 

(HLT/EMNLP’05), 2005. 

[BM06] 

R Bunescu and RJ Mooney. Subsequence kernels for relation extraction. In 

Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information 

Processing Systems 18, pages 171–178. MIT Press, 2006. 

155

[BMC08] BMC. Biomed central. http://www.biomedcentral.com/, November 2008. 

[BT03] 

JA Barker and JM Thornton. An algorithm for constraint-based structural 

template matching: application to 3D templates with statistical analysis. 

Bioinformatics, 19(13):1644–1649, September 2003. 

[BW03] 

PE Bourne and H Weissig. Structural Bioinformatics (Methods of Biochemical 

Analysis, V. 44). Wiley-Liss, 1 edition, February 2003. 

[BW05] 

CJO Baker and R Witte. Mutation miner - textual annotation of protein 

structures. CERMM Symposium, 2005. 

[BWF + 00] 

HM Berman, J Westbrook, Z Feng, G Gilliland, TN Bhat, H Weissig, 

IN Shindyalov, and PE Bourne. The protein data bank. Nucleic Acids 

Research, 28(1):235–242, January 2000. 

[CB94] 

RR Copley and GJ Barton. A structural analysis of phosphate and sulphate 

binding sites in proteins. Estimation of propensities for binding and conservation 

of phosphate binding sites. Journal of Molecular Biology, 242:321– 

329, Sep 1994. 

[CCR + 08] 

BL Cantarel, PM Coutinho, C Rancurel, T Bernard, V Lombard, and 

B Henrissat. The Carbohydrate-Active EnZymes database (CAZy): an 

expert resource for Glycogenomics. Nucleic Acids Research, Oct 2008. 

[Cer00] 

F Cerbah. Exogenous and endogenous approaches to semantic categorization 

of unknown technical terms. In in In Proceedings of the 18th International 

Conference on Computational Linguistics (COLING, pages 145–151, 

2000. 

[CFK + 05] 

BY Chen, VY Fofanov, DM Kristensen, M Kimmel, O Lichtarge, and 

LE Kavraki. Algorithms for structural comparison and statistical analysis 

156

of 3D protein motifs. Pacific Symposium on Biocomputing, pages 334–345, 

2005. 

[Cha93] 

P Chakrabarti. Anion binding sites in protein structures. Journal of Molecular 

Biololgy, 234:463–482, Nov 1993. 

[CHR + 02] 

JM Castagnetto, SW Hennessy, VA Roberts, ED Getzoff, JA Tainer, and 

ME Pique. Mdb: the metalloprotein database and browser at the scripps 

research institute. Nucleic Acids Research, 30(1):379–382, January 2002. 

[CK06] IG Choi and SH Kim. Evolution of protein structural classes and protein 

sequence families. Proceedings of the National Academy of Sciences, 

September 2006. 

[CL64] 

RV Cochran and LH Lund. On the kirkwood superposition approximation. 

Journal of Physical Chemistry, 1964. 

[CMP05] 

J Crim, R McDonald, and F Pereira. Automatically annotating documents 

with normalized gene lists. BMC Bioinformatics, 6 Suppl 1, 2005. 

[CMR06] 

P Corbett and P Murray-Rust. High-throughput identification of chemistry 

in life science texts. In Computational Life Sciences II, pages 107–118. 

Springer, 2006. 

[CSL + 06] 

FM Couto, MJ Silva, V Lee, E Dimmer, E Camon, R Apweiler, H Kirsch, 

and D Rebholz-Schuhmann. Goannotator: linking protein go annotations 

to evidence text. Journal of Biomedical Discovery and Collaboration, 1:19+, 

December 2006. 

[DBAD03] 

R Day, DA Beck, RS Armen, and V Daggett. A consensus view of fold 

space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein 

Science, 12:2150–2160, Oct 2003. 

157

[DCG + 04] 

F Diella, S Cameron, C Gemuend, R Linding, A Via, B Kuster, ST Ponten, 

N Blom, and TJ Gibson. Phospho.elm: a database of experimentally verified 

phosphorylation sites in eukaryotic proteins. BMC Bioinformatics, 5, June 

2004. 

[DS05] 

A Doms and M Schroeder. Gopubmed: exploring pubmed with the gene 

ontology. Nucleic Acids Research, 33(Web Server issue), July 2005. 

[FGS98] 

JS Fetrow, A Godzik, and J Skolnick. Functional analysis of the escherichia 

coli genome using the sequence-to-structure-to-function paradigm: identification 

of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase 

activity. Journal of Molecular Biololgy, 282(4):703–711, October 

1998. 

[FKY + 01] 

C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: a 

natural-language processing system for the extraction of molecular pathways 

from journal articles. Bioinformatics, 17 Suppl 1, 2001. 

[Fri07] 

D Frishman. Protein annotation at genomic scale: the current status. Chem 

Rev, 107(8):3448–3466, August 2007. 

[FS98] 

JS Fetrow and J Skolnick. Method for prediction of protein function from sequence 

using the sequence-to-structure-to-function paradigm with application 

to glutaredoxins/thioredoxins and T1 ribonucleases. Journal of Molecular 

Biololgy, 281(5), September 1998. 

[Fuk98] 

K Fukuda. Toward information extraction: identifying protein names from 

biological papers, 1998. 

[FWLN94] D Fischer, H Wolfson, SL Lin, and R Nussinov. Three-dimensional, sequence 

order-independent structural comparison of a serine protease against 

158

the crystallographic database reveals active site similarities: potential implications 

to evolution and to protein folding. Protein Science, 3(5):769–778, 

May 1994. 

[GDAW03] 

R Gaizauskas, G Demetriou, PJ Artymiuk, and P Willett. Protein structures 

and information extraction from biological texts: the pasta system. 

Bioinformatics, 19(1):135–143, January 2003. 

[GDO + 05] 

A Golovin, D Dimitropoulos, TJ Oldfield, A Rachedi, and K Henrick. 

Msdsite: A database search and retrieval system for the analysis and viewing 

of bound ligands and active sites. Proteins: Structure, Function, and 

Bioinformatics, 58(1):190–199, 2005. 

[GH08] 

A Golovin and K Henrick. Msdmotif: exploring protein sites and motifs. 

BMC Bioinformatics, 9(1), 2008. 

[GJYLRS08] S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Combining 

evidence, specificity, and proximity towards the normalization of gene ontology 

terms in text. EURASIP journal on bioinformatics & systems biology, 

2008. 

[Glu91] 

JP Glusker. Structural aspects of metal liganding to functional groups in 

proteins. Advances in Protein Chemistry, 42:1–76, 1991. 

[GOC06] GOConsortium. The gene ontology (go) project in 2006. Nucleic Acids 

Research, 34(Database issue), January 2006. 

[GPP + 03] 

F Glaser, T Pupko, I Paz, RE Bell, D Bechor-Shental, E Martz, and N Ben- 

Tal. 

ConSurf: identification of functional regions in proteins by surfacemapping 

of phylogenetic information. Bioinformatics, 19(1):163–164, January 

2003. 

159

[Gue96] 

F Guenthner. Electronic lexica and corpora research at cis. CIS Bericht- 

96-100, 1996. 

[HBB + 08] 

N Hulo, A Bairoch, V Bulliard, L Cerutti, BA Cuche, E de Castro, 

C Lachaize, PS Langendijk-Genevaux, and CJ Sigrist. 

The 20 years of 

PROSITE. Nucleic Acids Research, 36:D245–249, Jan 2008. 

[HBGK03] 

M Hendlich, A Bergner, J Günther, and G Klebe. Relibase: design and 

development of a database for comprehensive analysis of protein-ligand interactions. 

Journal of Molecular Biololgy, 326(2):607–620, February 2003. 

[HFM + 05] 

D Hanisch, K Fundel, HT Mevissen, R Zimmer, and J Fluck. Prominer: 

rule-based protein and gene entity recognition. BMC Bioinformatics, 6 

Suppl 1, 2005. 

[HJ99] C Hadley and DT Jones. A systematic comparison of protein structure 

classifications: SCOP, CATH and FSSP. Structure, 7:1099–1112, Sep 1999. 

[HLC04] 

F Horn, AL Lau, and FE Cohen. Automated extraction of mutation data 

from the literature: application of mutext to g protein-coupled receptors and 

nuclear hormone receptors. Bioinformatics, 20(4):557–568, March 2004. 

[HMBC97] 

TJ Hubbard, AG Murzin, SE Brenner, and C Chothia. SCOP: a structural 

classification of proteins database. Nucleic Acids Research, 25:236–239, Jan 

1997. 

[HNR + 05] 

ZZ Hu, M Narayanaswamy, KE Ravikumar, K Vijay-Shanker, and CH Wu. 

Literature mining and database annotation of protein phosphorylation using 

a rule-based system. Bioinformatics, 21(11):2759–2765, June 2005. 

[Hob02] 

JR Hobbs. Information extraction from biomedical text. Journal of Biomedical 

Informatics, 35(4):260–264, August 2002. 

160

[HPS + 03] 

A Harrison, F Pearl, I Sillitoe, T Slidel, R Mott, JM Thornton, and 

CA Orengo. Recognizing the fold of a protein structure. Bioinformatics, 

19(14):1748–1759, September 2003. 

[HS94] 

L Holm and C Sander. The fssp database of structurally aligned protein 

fold families. Nucleic Acids Research, 22(17):3600–3609, September 1994. 

[HS96] L Holm and C Sander. Mapping the protein universe. Science, 

273(5275):595–603, August 1996. 

[HSSS92] 

U Hobohm, M Scharf, R Schneider, and C Sander. Selection of representative 

protein data sets. Protein Science, 1(3):409–417, March 1992. 

[HZH + 04] 

M Huang, X Zhu, Y Hao, DG Payan, K Qu, and M Li. Discovering patterns 

to extract protein-protein interactions from full texts. Bioinformatics, 

20(18):3604–3612, December 2004. 

[IPGK05] 

VA Ivanisenko, SS Pintus, DA Grigorovich, and NA Kolchanov. PDBSite: 

a database of the 3D structure of protein functional sites. Nucleic Acids 

Research, 33:D183–187, Jan 2005. 

[JB04] 

A Jakulin and I Bratko. Testing the significance of attribute interactions. 

In In ICML, pages 409–416. ACM Press, 2004. 

[JGLRS08] S Jaeger, S Gaudan, U Leser, and D Rebholz-Schuhmann. Integrating 

protein-protein interactions and text mining for protein function prediction. 

BMC Bioinformatics, 9(Suppl 8), 2008. 

[JIDG03] 

M Jambon, A Imberty, G DelÃ c○age, and C Geourjon. A new bioinformatic 

approach to detect common 3d sites in protein structures. Proteins: 

Structure, Function, and Genetics, 52:137–145, 2003. 

161

[JK95] 

J Justeson and S Katz. Technical terminology: some linguistic properties 

and an algorithm for identification in text. Natural Language Engineering, 

pages 9–27, 1995. 

[KCRB07] 

R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker. A workflow for 

mutation extraction and structure annotation. Journal of Bioinformatics 

and Computational Biology, 5(6):1319–1337, December 2007. 

[KH04] 

E Krissinel and K Henrick. Secondary-structure matching (ssm), a new tool 

for fast protein structure alignment in three dimensions. Acta Crystallographica 

Section D: Biological Crystallography, 60(1):2256–2268, December 

2004. 

[KJ94] 

GJ Kleywegt and TA Jones. Detection, delineation, measurement and display 

of cavities in macromolecular structures. Acta Crystallographica Section 

D: Biological Crystallography, 50(Pt 2):178–185, March 1994. 

[Kle99] 

GJ Kleywegt. Recognition of spatial motifs in protein structures. Journal 

of Molecular Biololgy, 285(4):1887–1897, January 1999. 

[KN03] 

K Kinoshita and H Nakamura. Identification of protein biochemical functions 

by similarity search using the molecular surface database ef-site. Protein 

Science, 12(8):1589–1595, August 2003. 

[KNT05] A Koike, Y Niwa, and T Takagi. Automatic extraction of gene/protein 

biological functions from biomedical text. Bioinformatics, 21(7):1227–1236, 

April 2005. 

[KON99] T Kawabata, M Ota, and K Nishikawa. The protein mutant database. 

Nucleic Acids Research, 27(1):355–357, January 1999. 

162

[Las95] 

RA Laskowski. Surfnet: a program for visualizing molecular surfaces, cavities, 

and intermolecular interactions. Journal of Molecular Biololgy, 13(5), 

October 1995. 

[LC05] G Leroy and H Chen. Genescene: An ontology-enhanced integration of 

linguistic and co-occurrence based relations in biomedical texts: Research 

articles. Journal of the American Society for Information Science and Technology, 

56(5):457–468, March 2005. 

[LCM03] G Leroy, H Chen, and JD Martinez. A shallow parser based on closedclass 

words to capture relations in biomedical text. Journal of Biomedical 

Informatics, pages 145–158, June 2003. 

[LEW98] 

J Liang, H Edelsbrunner, and C Woodward. Anatomy of protein pockets 

and cavities: measurement of binding site geometry and implications for 

ligand design. Protein Science, 7(9):1884–1897, September 1998. 

[LHC07] LC Lee, F Horn, and FE Cohen. Automatic extraction of protein point 

mutations using a graph bigram association. PLoS Computational Biology, 

3(2):e16+, February 2007. 

[LRTV07] 

Gonzalo Lopez, Ana Rojas, Michael Tress, and Alfonso Valencia. Assessment 

of predictions submitted for the CASP7 function prediction category. 

Proteins, 69 Suppl 8:165–74, 2007. 

[LW91] Y Lamdan and HJ Wolfson. Protein structures and information extraction 

from biological texts: the pasta system. Computer Vision and Pattern 

Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Conference 

on, pages 22–27, June 1991. 

[Mar05] AC Martin. Mapping pdb chains to uniprotkb entries. Bioinformatics, 

21(23):4297–4301, December 2005. 

163

[MB99] Y Matsuo and SH Bryant. Identification of homologous core structures. 

Proteins, 35:70–79, Apr 1999. 

[MG03] J McCallum and S Ganesh. Text mining of DNA sequence homology 

searches. Applied Bioinformatics, 2:59–63, 2003. 

[MR03] 

S Mika and B Rost. UniqueProt: Creating representative protein sequence 

sets. Nucleic Acids Research, 31:3789–3791, Jul 2003. 

[MSD08] MSDmapping. Msdmapping. http://www.ebi.ac.uk/msd-as/ 

MSDMapping/, November 2008. 

[MT05] Y Miyao and J Tsujii. Probabilistic disambiguation models for widecoverage 

hpsg parsing. In ACL ’05: Proceedings of the 43rd Annual Meeting 

on Association for Computational Linguistics, pages 83–90. Association for 

Computational Linguistics, 2005. 

[NBD + 06] 

J Natarajan, D Berrar, W Dubitzky, C Hack, Y Zhang, C Desesa, 

JR Van Brocklyn, and EG Bremer. 

Text mining of full-text journal articles 

combined with gene expression analysis reveals a relationship between 

sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC 

Bioinformatics, 7:373+, August 2006. 

[NED03] 

S Novichkova, S Egorov, and N Daraselia. Medscan, a natural language 

processing engine for medline abstracts. Bioinformatics, 19(13):1699–1706, 

September 2003. 

[OCR01] 

MJ Ondrechen, JG Clifton, and D Ringe. Thematics: A simple computational 

predictor of enzyme function from structure. 

Proceedings of the 

National Academy of Sciences, 98(22):12473–12478, October 2001. 

164

[Old01] 

TJ Oldfield. Creating structure features by data mining the PDB to use as 

molecular-replacement models. Acta Crystallographica Section D: Biological 

Crystallography, 57:1421–1427, Oct 2001. 

[Old02] 

TJ Oldfield. Data mining the protein data bank: residue interactions. Proteins, 

49(4):510–528, December 2002. 

[OMJ + 97] 

CA Orengo, AD Michie, S Jones, DT Jones, MB Swindells, and JM Thornton. 

CATH-a hierarchic classification of protein domain structures. Structure, 

5:1093–1108, Aug 1997. 

[PB06] 

BJ Polacco and PC Babbitt. Automated discovery of 3d motifs for protein 

function annotation. Bioinformatics, 22(6):723–730, March 2006. 

[PBT04] 

CT Porter, GJ Bartlett, and JM Thornton. The Catalytic Site Atlas: a 

resource of catalytic sites and residues identified in enzymes using structural 

data. Nucleic Acids Research, 32(Database issue), January 2004. 

[PJYLRS08] P Pezik, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Static dictionary 

features for term polysemy identification. Building and evaluating 

resources for biomedical text mining, LREC Workshop, 2008. 

[PKS06] G Pandey, V Kumar, and M Steinbach. Computational approaches for 

protein function prediction: A survey. Technical Report 06-028, Department 

of Computer Science and Engineering, University of Minnesota, Twin Cities, 

2006. 

[Plo08] PloS. Public library of science. http://www.plos.org/, November 2008. 

[PMC08] 

PMC. Pubmed central. http://www.pubmedcentral.nih.gov/, November 

2008. 

165

[POHS05] 

M Pesu, J O’Shea, L Hennighausen, and O Silvennoinen. Identification of an 

acquired mutation in Jak2 provides molecular insights into the pathogenesis 

of myeloproliferative disorders. 

Molecular Interventions, 5:211–215, Aug 

2005. 

[RMK + 07] 

ND Rawlings, FR Morton, CY Kok, J Kong, and AJ Barrett. Merops: the 

peptidase database. 

Nucleic Acids Research, pages gkm954+, November 

2007. 

[Ros99] 

B Rost. Twilight zone of protein sequence alignments. Protein Engineering 

Design and Selection, 12(2):85–94, February 1999. 

[RSAG + 08] 

D Rebholz-Schuhmann, M Arregui, S Gaudan, H Kirsch, and A Jimeno 

Yepes. Text processing through web services: Calling whatizit. Bioinformatics, 

2008. 

[RSKA + 07] 

D Rebholz-Schuhmann, H Kirsch, M Arregui, S Gaudan, M Riethoven, and 

P Stoehr. Ebimed-text crunching to gather facts for proteins from medline. 

Bioinformatics, 23(2), January 2007. 

[RSMA + 04] 

D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle, G Casari, and H Kirsch. 

Automatic extraction of mutations from medline and cross-validation with 

omim. Nucleic Acids Research, 2004. 

[Rus98] RB Russell. Detection of protein three-dimensional side-chain patterns: 

new examples of convergent evolution. 

Journal of Molecular Biology, 

279(5):1211–1227, June 1998. 

[SAR + 07] 

B Smith, M Ashburner, C Rosse, K Bard, W Bug, W Ceusters, LJ Goldberg, 

K Eilbeck, A Ireland, CJ Mungall, N Leontis, P Rocca-Serra, A Ruttenberg, 

SA Sansone, RH Scheuermann, N Shah, PL Whetzel, and S Lewis. The 

166

OBO Foundry: coordinated evolution of ontologies to support biomedical 

data integration. Nature Biotechnology, 25(11):1251–5, 2007. 

[SB05] 

A Schutz and P Buitelaar. Relext: A tool for relation extraction from text 

in ontology extension. The Semantic Web - ISWC 2005, pages 593–606, 

2005. 

[SB06] 

J Schuman and S Bergler. Postnominal prepositional phrase attachment 

in proteomics. In Proceedings of the HLT-NAACL BioNLP Workshop on 

Linking Natural Language and Biology. Association for Computational Linguistics, 

2006. 

[SDC06] 

A Sidhu, T Dillon, and E Chang. Unification of protein data and knowledge 

sources. Knowledge-Based Intelligent Information and Engineering Systems, 

pages 728–737, 2006. 

[Sin04] A Singer. Maximum entropy formulation of the Kirkwood superposition 

approximation. Journal of Chemical Physics, 121:3657–3666, Aug 2004. 

[SPIBA03] 

PK Shah, C Perez-Iratxeta, P Bork, and MA Andrade. Information extraction 

from full text scientific articles: where are the keywords? 

BMC 

Bioinformatics, 4(1), May 2003. 

[SPNW04] 

A Shulman-Peleg, R Nussinov, and HJ Wolfson. Recognition of functional 

sites in protein structures. Journal of Molecular Biololgy, 339(3):607–633, 

June 2004. 

[SS96] R Schneider and C Sander. The HSSP database of protein structuresequence 

alignments. Nucleic Acids Research, 24(1):201–5, 1996. 

[SSR03] 

A Stark, S Sunyaev, and RB Russell. A model for statistical significance of 

local similarities in structure. Journal of Molecular Biology, 326(5):1307– 

1316, March 2003. 

167

[STB06] 

MH Saier, CV Tran, and RD Barabote. Tcdb: the transporter classification 

database for membrane transport protein analyses and information. Nucleic 

Acids Research, 34(Database issue), January 2006. 

[SWS + 04] 

MJ Schuemie, M Weeber, BJ Schijvenaars, EM van Mulligen, CC van der 

Eijk, R Jelier, B Mons, and JA Kors. Distribution of information in biomedical 

abstracts and full-text publications. Bioinformatics, 20(16):2597–2604, 

November 2004. 

[SYH + 03] 

S Saito, H Yamaguchi, Y Higashimoto, C Chao, Y Xu, AJ Fornace, E Appella, 

and CW Anderson. Phosphorylation site interdependence of human 

p53 post-translational modifications in response to stress. Journal of Biological 

Chemistry, 278:37536–37544, Sep 2003. 

[TCS + 07] 

RT Tsai, WC Chou, YS Su, YC Lin, CL Sung, HJ Dai, IT Yeh, W Ku, 

TY Sung, and WL Hsu. 

Biosmile: A semantic role labeling system for 

biomedical verbs using a maximum-entropy model with automatically generated 

template features. BMC Bioinformatics, 8:325+, September 2007. 

[TMA08] 

Y Tsuruoka, J Mcnaught, and S Ananiadou. Normalizing biomedical terms 

by minimizing ambiguity and variability. BMC Bioinformatics, 9(Suppl 3), 

2008. 

[TOT04] 

Y Tateisi, T Ohta, and J Tsujii. Annotation of predicate-argument structure 

on molecular biology text. In First International Joint Conference on Natural 

Language Processing In the IJCNLP-04 workshop on Beyond Shallow 

Analyses, March 2004. 

[TW02] 

L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical 

text. Bioinformatics, 18(8):1124–1132, August 2002. 

168

[VMMR + 05] S Velankar, P McNeil, V Mittard-Runte, A Suarez, D Barrell, R Apweiler, 

and K Henrick. 

E-msd: an integrated data resource for bioinformatics. 

Nucleic Acids Research, 33(Database issue), January 2005. 

[VZHC05] A Via, A Zanzoni, and M Helmer-Citterich. Seq2Struct: a resource for 

establishing sequence-structure links. Bioinformatics, 21(4):551–3, 2005. 

[WAB + 06] 

CH Wu, R Apweiler, A Bairoch, DA Natale, WC Barker, B Boeckmann, 

S Ferro, E Gasteiger, H Huang, R Lopez, M Magrane, MJ Martin, 

R Mazumder, C O’Donovan, N Redaschi, and B Suzek. The universal 

protein resource (uniprot): an expanding universe of protein information. 

Nucleic Acids Research, 34(Database issue), January 2006. 

[WBB + 06] 

DL Wheeler, T Barrett, DA Benson, SH Bryant, K Canese, V Chetvernin, 

DM Church, M Dicuccio, R Edgar, S Federhen, LY Geer, W Helmberg, 

Y Kapustin, DL Kenton, O Khovayko, DJ Lipman, TL Madden, DR Maglott, 

J Ostell, KD Pruitt, GD Schuler, LM Schriml, E Sequeira, ST Sherry, 

K Sirotkin, A Souvorov, G Starchenko, TO Suzek, R Tatusov, TA Tatusova, 

L Wagner, and E Yaschenko. 

Database resources of the national center 

for biotechnology information. Nucleic Acids Research, 34(Database issue), 

January 2006. 

[WBT97] AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hashing 

algorithm for deriving 3d coordinate templates for searching structural 

databases. application to enzyme active sites. Protein Science, 6(11):2308– 

2323, November 1997. 

[WD03] 

G Wang and RL Dunbrack. Pisces: a protein sequence culling server. Bioinformatics, 

19(12):1589–1591, August 2003. 

169

[WK07] 

R Witte and T Kappler. Enhanced semantic access to the protein engineering 

literature using ontologies populated by text mining. International 

Journal of Bioinformatics Research and Applications, 2007. 

[WR97] HJ Wolfson and I Rigoutsos. Geometric hashing: an overview. Computational 

Science and Engineering, IEEE [see also Computing in Science & 

Engineering], 4(4):10–21, 1997. 

[WSC04] T Wattarujeekrit, PK Shah, and N Collier. Pasbio: predicate-argument 

structures for event extraction in molecular biology. BMC Bioinformatics, 

5, October 2004. 

[YEC + 07] 

S Yoon, JC Ebert, EY Chung, G De Micheli, and RB Altman. Clustering 

protein environments for function prediction: finding prosite motifs in 3d. 

BMC Bioinformatics, 8 Suppl 4, 2007. 

[YHF + 02] 

H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. Automatic 

extraction of gene and protein synonyms from medline and journal 

articles. Proceedings of the AMIA Symposium, pages 919–923, 2002. 

[YLPV07] YL Yip, N Lachenal, V Pillet, and AL Veuthey. Retrieving mutationspecific 

information for human proteins in UniProt/Swiss-Prot Knowledgebase. 

Journal of Bioinformatics and Computational Biology, 5:1215–1231, 

Dec 2007. 

[YMTT05] A Yakushiji, Y Miyao, Y Tateisi, and J Tsujii. Biomedical information 

extraction with predicate-argument structure patterns. In SMBM, 2005. 

170

Appendix A 

Examples of errors in relation 

extraction. 

171

Table A.1: Examples of errors in the relation extraction for the detection of 

contextual features. 

. 

Sentence 

Annotated residue 

Annotated keywords 

Annotated PAS 

TP shallow parsing 

FP full parsing 

Sentence 



Annotated PAS 

FP shallow parsing 

TP full parsing 

Sentence 



Annotated PAS 

FP shallow parsing 

FP full parsing 

”This observation provides a rationale for the reduced electron-transfer efficiency displayed 

by the E92K mutant. ” (PMID:10089511) 

GLU92 

reduced electron-transfer efficiency 

pred = diplayed 

arg1 = the reduced electron-transfer efficiency 

arg2-by = the E92K mutant 

pred = displayed 

arg1 = a rationale 

arg1-for = the reduced electron-transfer efficiency 

arg2-by = the GLU92 LYS mutant 


arg1-by = the GLU92 LYS mutant 

”An apparent ’acceptor consensus overlap’ at Ser474 suggests that the mechanism behind 

the glycosaminoglycan split of TM may involve a competition for substrate between xylosyltransferase 

and N-acetylgalactosaminyltransferase.” (PMID:8216207) 

SER474 

acceptor consensus overlap 

pred = suggests 

arg1 = An apparent ’acceptor consensus overlap’ 

arg1-at = SER474 

arg2 = the mechanism behind the glycosaminoglycan split 

arg2-of = TM 



arg2 = that the mechanism 

arg2-behind = the glycosaminoglycan split 

arg2-of = 


arg1 = An apparent ’acceptor consensus overlap’ 


arg2 = that the mechanism 

arg2-behind = the glycosaminoglycan split 

arg2-of = TM 

”Using this approach, coupled with Edman degradation of the 32PO4-labeled tryptic 

peptides, and comparison with tryptic peptides analyzed after labeling normal human 

colonic tissues, we identified ser-52 as the major K18 physiologic phosphorylation site.” 

(PMID:7523419) 

SER52 

physiologic phosphorylation site 

pred = identified 

arg1 = unk 

arg2 = SER52 

arg2-as = the major K18 phosphorylation phosphorylation site 


arg2 = SER52 

arg2-as = the major 


arg1 = we 

arg2 = SER52 

172

Appendix B 

Examples of extracted functional 

annotations compared with 

UniProtKB 

173

. 

RID+UID 

Table B.1: Comparison of extracted protein residue annotations from GC with 

UniProtKB. Mined functional annotations are listed as PAS, while relevant 

information from UniProtKB are reproduced from the feature table (FT) entry 

line. 

SER15 P53 HUMAN 

Sentence ”Previous studies have demonstrated that phosphorylation of human 

p53 on serine 15 contributes to protein stabilization after 

DNA damage and that this is mediated by the ATM family of kinases.” 

(PMID:11865061) 

UniProtKB/FT 

PAS 

RID+UID 

Sentence 

UniProtKB/FT 

PAS 

PAS 

RID+UID 

Sentence 

UniProtKB/FT 

PAS 

RID+UID 

Sentence 

UniProtKB/FT 

PAS 

SER15 MOD RES: Phosphoserine; by PRPK 

SER15 VARIANT: S->R in a sporadic cancer; somatic mutation. 

pred = contributes 

Arg1 = 

arg1-on = SER15 

arg2 = 

arg2-to = protein stabilization 

arg2-after = DNA damage and that 

GLU189 CP27B HUMAN, LEU343 CP27B HUMAN 

”The R389G mutant was totally inactive,but mutant L343F retained 

2.3% of wild-type activity,and mutant E189G retained 22% of wildtype 

activity.” (PMID:12050193) 

GLU189 VARIANT: E-K in VDDR I; 11% of wild-type activity. 

LEU343 VARIANT: L->F in VDDR I; 2.3% of wild-type activity. 

pred = retained 

arg1 = but mutant LEU343 PHE 

arg2 = 2.3 % 

arg2-of = wild-type activity 

pred = retained 

arg1 = and mutant GLU189 GLY 

arg2 = 22 % 

arg2-of = wild-type activity 

CYS260 TGA1 ARATH, CYS266 TGA1 ARATH 

”Furthermore,site-directed mutagenesis of TGA1 Cys-260 and Cys- 

266 enables the interaction with NPR1 in yeast and 

Arabidopsis.” (PMID:12953119) 

C260/C266 DISULFID: (potential). 

C260 MUTAGEN: C->N; Gain of interaction with NPR1; when associated with S-266. 

C266 MUTAGEN: C->S: Gain of interaction with NPR1; when associated with S-260. 

pred = enables 

arg1 = site-directed mutagenesis 

arg1-of = TGA1 CYS260 and CYS266 

arg2 = the interaction 

arg2-with = NPR1 

arg2-in = yeast and Arabidopsis 

THR13 RUM1 SCHPO, SER19 RUM1 SCHPO 

”Direct in vitro kinase assay using GST-fusion proteins of wild-type as well as various mutants 

of p25(rum1) demonstrated that MAPK phosphorylates 

the N-terminal portion of p25(rum1) and residues Thr13 

and Ser19 are major phosphorylation sites for MAPK.” 

(PMID:12135491) 

THR13 MOD RES: Phosphothreonine; by MAPK 

SER19 MOD RES: Phosphoserine; by MAPK 

SER19 MUTAGEN: S->E:reduces activity as a cdc2 inhibitor; when associated with E-13 

pred = are 

arg1 = the N-terminal portion 

arg1-of = p25(rum1) and residues THR13 and SER19 

174

. . . continuation of table B.1 

arg2 = major phosphorylation sites 

arg2-for = MAPK 

RID+UID 

Sentence 

UniProtKB/FT 

PAS 

PAS 

RID+UID 


”Together with the fact that replacement of both Thr13 and Ser19 with 

Glu,which mimics the phosphorylated state of these residues,also significantly reduces the activity 

of p25(rum1) as a Cdc2 inhibitor,it was suggested that 

the phosphorylation of Thr13 and Ser19 negatively regulates 

the function of p25(rum1).” (PMID:12135491) 

THR13 N/A 

SER19 N/A 

pred = suggested 

arg2 = that the phosphorylation 

arg2-of = THR13 and SER19 

pred = regulates 

arg1 = that the phosphorylation 


arg2 = the function 

arg2-of = p25(rum1) 


Sentence ”Further evidence indicates that phosphorylation of Thr13 

and Ser19 may retain a negative effect on the function of 

p25(rum1) even in vivo.” (PMID:12135491) 

UniProtKB/FT 

PAS 

RID+UID 

THR13 N/A 

SER19 N/A 

pred = retain 

arg1 = that 


arg2 = a negative effect 

arg2-on = the function 

arg2-of = p25(rum1) 

GLU55 DHMA MYCAV, ASP123 DHMA MYCAV, TRP124 DHMA MYCAV 

Sentence ”Many residues essential for the dehalogenation reaction are conserved 

in DhmA;the putative catalytic triad consists of 

Asp123,His279,and Asp250,and the putative oxyanion 

hole consists of Glu55 and Trp124.” (PMID:12147465) 

UniProtKB/FT 

PAS 

PAS 

RID+UID 

Sentence 

UniProtKB/FT 

GLU55 N/A 

ASP123 ACT SITE: Nucleophile (by similarity). 

TRP124 N/A 

pred = consists 

arg1 = the putative catalytic triad 

arg2 = 

arg2-of = ASP123 

pred = consists 

arg1 = and the putative oxyanion hole 

arg2 = 

arg2-of = GLU55 and TRP124 

CYS48 THIO RAT, CYS152 THIO RAT, CYS73 THIO RAT 

”Thus,PrxV mutants lacking Cys(48) or Cys(152) showed 

no detectable thioredoxin-dependent peroxidase activity,whereas mutation of 

Cys(73) had no effect on activity.” (PMID:10751410) 

N/A 

175

. . . continuation of table B.1 

PAS 

PAS 

RID+UID 

pred = showed 

arg1 = CYS48 or CYS152 

arg2 = no detectable thioredoxin-dependent peroxidase activity 

pred = had 

arg1 = whereas mutation 

arg1-of = CYS73 

arg2 = no effect on activity 

GLY43 PPCS HUMAN 

Sentence ”Highly conserved ATP binding residues include 

Gly43,Ser61,Gly63,Gly66,Phe230,and 

Asn258.” (PMID:12906824) 

UniProtKB/FT 

PAS 

RID+UID 

N/A 

pred = include 

arg1 = conserved ATP binding residues 

arg2 = GLY43 

ASN59 PPCS HUMAN 

Sentence ”Highly conserved phosphopantothenate binding residues include 

Asn59,Ala179,Ala180,and Asp183 from one 

monomer and Arg55’ from the adjacent monomer.” (PMID:12906824) 

UniProtKB/FT 

PAS 

RID+UID 

N/A 

pred = include 

arg1 = conserved phosphopantothenate binding residues 

arg2 = ASN59 

GLU50 SHD HUMAN, GLU51 SHD HUMAN 

Sentence ”Rab3A binding-defective mutants of rabphilin 

(E50A) and Noc2( E51A) were still localized in the distal 

portion of the neurites (where dense-core vesicles had accumulated) in nerve growth factordifferentiated 

PC12 cells,the same as the wild-type proteins,whereas Rab27A 

binding-defective mutants of rabphilin ( E50A/I54A) and 

Noc2( E51A/I55A) were present throughout the cytosol.” 

(PMID:14722103) 

UniProtKB/FT 

PAS 

RID+UID 

Sentence 

UniProtKB/FT 

PAS 

N/A 

pred = localized 

arg1 = Rab3A binding-defective mutants 

arg1-of = rabphilin ( GLU50 ALA ) and Noc2 ( GLU51 ALA ) 

arg2 = 

arg2-in = the distal portion 

arg2-of = the neurites ( where dense-core vesicles 

TRP124 DHMA MYCAV 

”Trp124 should be involved in substrate binding and product 

(halide) stabilization,while the second halide-stabilizing residue cannot be identified 

from a comparison of the DhmA sequence with the sequences of three 

dehalogenases with known tertiary structures.” (PMID:12147465) 

N/A 

pred = involved 

arg1 = TRP124 

arg2 = 

arg2-in = substrate binding and product (halide) stabilization 

176

Appendix C 


annotations for the protein p53 

177

Table C.1: Examples of literature mined annotations of protein residues in 

p53. The listed data are grouped by topics. 

. 

regulatory PTM 

RID+UID 

PMID 10930428 

PAS 

RID+UID 


pred = creased 

arg1 = a background 

arg1-of = constitutive phosphorylation 

arg1-at = SER6 that 

arg2 = 10-fold 

arg2-upon = upon exposure 

arg2-to = either ionizing radiation or UV light 

pred = exhibited 

arg1 = Untreated A549 cells 

arg2 = a background 

arg2-of = constitutive phosphorylation 

arg2-at = SER6 that 

pred = is 

arg1 = The relative phosphorylation 

arg1-of = THR18 

arg1-by = VRK2B 

arg2 = similar 

arg2-in = magnitude 

arg2-to = that induced 

arg2-by = taxol 

PMID 12487430 

PAS 

RID+UID 

THR18 P53 HUMAN 

pred = compared 

arg1 = that phosphorylation 

arg1-at = THR18 decreased binding 

arg1-to = recombinant Mdm2 protein 

arg2 = 

arg2-with = the unphosphorylated and the two other single phosphorylated analogues 

PMID 11030628 

PAS 

RID+UID 


pred = regulates 

arg1 = and phosphorylation 

arg1-of = SER46 

arg2 = the transcriptional activation 

arg2-of = this apoptosis-inducing gene 

PMID 11875057 

PAS 

RID+UID 


pred = hibited 

arg1 = IR-induced phosphorylation 


arg2 = 

arg2-by = wortmannin 

PMID 14757188 

PAS 


pred = duce 

arg1 = 

arg1-in = synergy 

arg2 = ATM-mediated phosphorylation 

arg2-of = the SER15 site 

178

. . . continuation of table C.1 

RID+UID 


PMID 17292432 

PAS 

RID+UID 


pred = suppressed 

arg2 = both NaVO(3)-induced SER15 phosphorylation and accumulation 


PMID 11850826 

PAS 

RID+UID 


pred = observed 

arg1 = Increased phosphorylation 


arg2 = 

arg2-in = heat shocked GM638 

PMID 10933801 

PAS 

RID+UID 


pred = define 

arg1 = These data 

arg2 = THR55 

arg2-as = a novel phosphorylation site and 

arg2-for = the first time show threonine phosphorylation 

arg2-of = human 

PMID 15116093 

PAS 

RID+UID 

PMID 9246643 

PAS 


pred = clarify 

arg1 = This study 

arg2 = the biological significance 

arg2-of = doxorubicin-induced THR55 phosphorylation 

pred = reduced 

arg1 = phosphorylation 


arg2 = and phosphorylation 



pred = reversed 

arg1 = but SER315 

arg2 = the effect 

arg2-of = phosphorylation 


RID+UID 

PMID 7926727 

PAS 

RID+UID 

PHE19 P53 HUMAN 

pred = are 

arg1 = PHE19 

arg2 = crucial 

arg2-for = the interactions 

arg2-between = 


binding activity 

179


PMID 11323395 

PAS 

RID+UID 

pred = play 

arg1 = 


arg2 = a key role 

arg2-in = the dissociation 

arg2-of = mdm2 

arg2-in = response 

arg2-to = Cr(VI) 

PMID 17914575 

PAS 

RID+UID 

CYS135 P53 HUMAN 

pred = generates 

arg1 = that the amino acid change CYS135ÃRG 

arg1-in = the human TP53 

arg2 = the loss 

arg2-of = TP53 DNA-binding activity 

PMID 16784539 

PAS 


pred = dephosphorylates 

arg1 = both 

arg1-in = vitro and 

arg1-in = vivo and 

arg2 = the SER315 site 


RID+UID 

PMID 10432310 

PAS 

RID+UID 


protein-protein-interaction 

pred = containing 

arg2 = phosphate 

arg2-at = SER20 inhibited DO-1 binding 

PMID 11960368 

PAS 

RID+UID 


pred = mutated 

arg1 = analysis 

arg1-of = HDM2 proteins 

arg2 = 

arg2-at = the consensus Akt recognition sites 


PMID 11172034 

PAS 

RID+UID 

PMID 7624134 

PAS 

ARG175 P53 HUMAN 

pred = abolish 

arg1 = mutations ARG175˜HIS or ARG248˜TRP 

arg2 = the association 



pred = abolished 

arg1 = 

arg1-to = alanine ( p53- SER315ÃLA ) 

180


arg2 = phosphorylation 

arg2-by = cdk2 kinase 

RID+UID 

PMID 7624134 

PAS 

RID+UID 


pred = required 

arg1 = SER315 

arg1-of = wtp53 

arg2 = 

arg2-for = transcriptional activity 

arg2-in = vivo 

PMID 16818505 

PAS 

RID+UID 

CYS238 P53 HUMAN 

pred = retains 

arg1 = ( CYS238˜TYR ) mutant 

arg2 = functional wild-type 

PMID 16707427 

PAS 


biological activity 


arg1 = the ARG175˜LEU mutant 

arg2 = an attenuated tumor suppressor activity 

arg2-in = the regulation 

arg2-of = transcription 

RID+UID 

PMID 10616523 

PAS 

RID+UID 


disease 


arg1 = The acquisition 

arg1-of = both mutations ( GLY245˜VAL and ARG72˜PRO ) 

arg1-in = the transformation 

arg1-from = transient leukemia 

arg1-to = overt acute megakaryoblastic leukemia 

arg2 = a functional role 

arg2-of = mutant 

PMID 18181044 

PAS 


pred = sociated 

arg1 = the development 

arg1-of = lung carcinoma and that ARG72˜PRO genotype 

arg2 = 

arg2-with = a poorer prognosis 

arg2-of = lung cancer 

181


RID+UID 

PMID 7761089 

PAS 

RID+UID 

VAL138 P53 HUMAN 

molecular stability 

pred = showed 

arg1 = The human VAL138 mutant 

arg2 = temperature-sensitive transformation 

arg2-of = rat embryo fibroblasts ( REFs ) 

arg2-in = collaboration assay 

arg2-with = activated 

PMID 15703170 

PAS 


pred = duce 

arg1 = oncogenic mutations HIS168ÃRG and z:resi ty 

ARG249˜SER 

arg2 = substantial structural perturbation 

arg2-around = the mutation site 

arg2-in = the L2 and L3 loops 

182

Appendix D 


annotations for the protein Jak2 

183

Table D.1: Examples of literature mined annotations of protein residues in 

Jak2. The listed data are grouped by topics. 

. 

disease 

PMID 16896569 

RID+UID 

VAL617 JAK2 HUMAN 

pred = improved 

arg1 = The improved knowledge 

arg1-of = the molecular basis 

arg1-of = the disease because 

arg1-of = the discovery 

arg1-of = the VAL617˜PHE mutation 

arg1-in = the JAK2 gene 

arg2 = the molecular diagnosis and 

PMID 16503548 

RID+UID 

PAS 


pred = is 

arg1 = that the JAK2 VAL617˜PHE mutation 

arg2 = rare 

arg2-in = patients 

arg2-with = idiopathic erythrocytosis 

PMID 16247455 

RID+UID 

PAS 


pred = reported 

arg1 = A missense somatic mutation 

arg1-in = JAK2 gene ( JAK2 VAL617˜PHE ) 

arg2 = 

arg2-in = chronic myeloproliferative disorders 

PMID 18024388 

RID+UID 

PAS 


pred = is 

arg1 = The JAK2 VAL617˜PHE point mutation 

arg2 = rare 

arg2-in = hypereosinophilic syndrome and/or chronic eosinophilic leukemia 

PMID 15858187 

genetic 

RID+UID 

PAS 


pred = had 

arg1 = All 51 patients 

arg1-with = 9pLOH 

arg2 = the VAL617˜PHE mutation 

pred = is 

arg1 = VAL617˜PHE 

arg2 = a somatic mutation present 

arg2-in = hematopoietic cells 

molecular function 

184

. . . continuation of table D.1 

PMID 15970705 

RID+UID 

PAS 



arg1 = JAK2 ( VAL617˜PHE ) 

arg2 = 

arg2-with = constitutive phosphorylation 

arg2-of = JAK2 and its downstream effectors 

arg2-as = 

PMID 16239216 

RID+UID 

PAS 


pred = duces 

arg1 = that the homologous VAL617˜PHE mutation 

arg2 = activation 

arg2-of = JAK1 and Tyk2 

PMID 16384930 

RID+UID 

PAS 


pred = link 

arg1 = the presence 

arg1-in = PV erythroblasts 

arg1-of = proliferative and antiapoptotic signals that 

arg2 = the JAK2 VAL617˜PHE mutation 

arg2-with = the inhibition 

arg2-of = death receptor signaling 

PMID 16442619 

RID+UID 

PAS 


pred = does 

arg1 = crease 

arg1-of = expression and kinase activity 

arg1-of = JAK2 

arg1-in = CML cells 

arg2 = result 

arg2-from = the JAK2 VAL617˜PHE activation mutation and that transformation 

arg2-into = to blast crisis 

PMID 16461300 

RID+UID 

PAS 



arg1 = the presence 

arg1-of = the JAK2 VAL617˜PHE mutation 

arg2 = 

arg2-with = higher platelet activation 

PMID 16904848 

RID+UID 

PAS 


pred = transmit 

arg1 = that JAK2 VAL617˜PHE 

arg2 = signals 

arg2-from = ligand-activated TpoR or EpoR 

PMID 15863514 

RID+UID 

PAS 


pred = changes 

arg2 = conserved VAL617˜PHE 

arg2-in = the pseudokinase domain 

arg2-of = JAK2 that 

185

Appendix E 


annotations of the category binding 

event 

186

Table E.1: [Mined functional annotations of protein residues with information 

on binding events. The mined information correspond to 17 protein residues 

listed in MSDsite. The extracted information can be used for functional annotation 

and validation of predicted binding site in the database. 

. 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

T199 CAH2 HUMAN 

”The three-dimensional structures of azide-bound and sulfate-bound T199V CAIIs were determined 

by x-ray crystallographic methods at 2.25 and 2.4 A, respectively (final crystallographic 

R factors are 0.173 and 0.174, respectively).” (PMID:8262987) 

pred = determined 

arg1 = The three-dimensional structures 

arg1-of = [azide-bound and sulfate-bound THR199 VAL CAIIs]/BINDING 

arg2 = 

arg2-by = x-prot:ray crystallographic methods 

arg2-at = at 2.25 and 2.4 A ,respectively ( final crystallographic 

R55 PPIA HUMAN 

”On the basis of the structure, it is proposed that Arg55 hydrogen-bonds to the nitrogen 

to deconjugate the resonance of the prolyl amide bond and thus facilitates the cis-trans 

rotation.” (PMID:8652511) 

pred = proposed 

arg2 = [that ARG55 hydrogen-bonds]/BINDING 

arg2-to = the nitrogen 

pred = deconjugate 

arg1 = [that ARG55 hydrogen-bonds]/BINDING 

arg1-to = the nitrogen 

arg2 = the resonance 

arg2-of = the prolyl amide bond and 

L255 PH4H HUMAN 

”Only for the R252Q and L255V mutants were catalytically active tetramer and dimer recovered 

and for R252G some dimer, i.e. 20% (R252Q, tetramer), 44% (L255V, tetramer) 

and 4.4% (R252G, dimer) of the activity for the respective wild-type (wt) forms.” 

(PMID:9799096) 

pred = recovered 

arg1 = active tetramer and dimer 

arg2 = and 

arg2-for = [ARG252 GLY some dimer]/BINDING 

Y156 HGXR TRIFO 

”But the forces involved in recognizing the exocyclic C2-substituents of the purine ring, which 

involve the Tyr156 hydroxyl, Ile157 backbone carbonyl, and Asp163 side-chain carboxyl, may 

be weakened by the shifted conformation of the peptide backbone resulted from loss of the 

Glu11-Arg155 salt bridge.” (PMID:9843428) 

pred = resulted 

arg1 = 

arg1-by = the shifted conformation 

arg1-of = the peptide backbone 

arg2 = 

arg2-from = loss 

arg2-of = [the GLU11 ARG155 salt bridge]/BINDING 

K79 HGXR TOXGO 

”The Leu78-Lys79 peptide bond in the active site adopts the cis configuration, which it must 

to bind PRPP or pyrophosphate.” (PMID:10545171) 

pred = adopts 

arg1 = [The LEU78 LYS79 peptide bond]/BINDING 

arg1-in = the active site 

arg2 = the 

G57 FLAV CLOBE 

187

. . . continuation of table E.1 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

”In the Clostridium beijerinckii flavodoxin, the reduction of the flavin mononucleotide (FMN) 

cofactor is accompanied by a local conformation change in which the Gly57-Asp58 peptide 

bond ”flips” from primarily the unusual cis O-down conformation in the oxidized state to 

the trans O-up conformation such that a new hydrogen bond can be formed between the 

carbonyl group of Gly57 and the proton on N(5) of the neutral FMN semiquinone radical 

[Ludwig, M. L., Pattridge, K. A., Metzger, A. L., Dixon, M. M., Eren, M., Feng, Y., and 

Swenson, R. P. (1997) Biochemistry 36, 1259-1280].” (PMID:10353827) 

pred = accompanied 

arg1 = ) cofactor 

arg2 = 

arg2-by = a local conformation change 

arg2-in = [which the GLY57 ASP58 peptide bond]/BINDING 

D160 APX STRGR; M161 APX STRGR; G201 APX STRGR; R202 APX STRGR; F219 

APX STRGR 

”These studies allowed the tracing of the previously disordered region of the enzyme (Glu196- 

Arg202) and the identification of some of the functional groups of the enzyme that are 

involved in enzyme-substrate interactions (Asp160, Met161, Gly201, Arg202 and Phe219).” 

(PMID:10771423) 

pred = involved 

arg1 = disordered region 

arg1-of = the enzyme ( GLU196 ARG202 ) and the identification 

arg1-of = some 

arg1-of = the functional groups 

arg1-of = the enzyme that 

arg2 = 

arg2-in = [enzyme-substrate interactions ( ASP160, MET161, GLY201, ARG202, 

PHE219)]/BINDING 

I209 FIXL RHIME 

”Interaction between the iron-bound O(2) and Ile209 was also observed in the resonance 

Raman spectra of RmFixLH as evidenced by the fact that the Fe-O(2) and Fe-CN stretching 

frequencies were shifted from 575 to 570 cm(-1) (Fe-O(2)), and 504 to 499 cm(-1), respectively, 

as the result of the replacement of Ile209 with an Ala residue.” (PMID:10926518) 

pred = observed 

arg1 = Interaction 

arg1-between = [the iron-bound O(2) and ILE209]/BINDING 

arg2 = 

arg2-in = the resonance Raman spectra 

arg2-of = RmFixLH as 

188

Appendix F 


annotations of active site residues 

189

Table F.1: Identified catalytic triad residues from MEDLINE exraction. The 

listed sentences describe the mentioned protein residues as catalytic (comention 

with the term ”catalytic triad”), however, none of them are recorded 

in CSA, thus the identified information are novel data. 

. 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

RID+UID 

Sentence 

PAS 

D44 TPP2 HUMAN, H264 TPP2 HUMAN, S449 EPHA3 HUMAN 

”The amino acids forming the putative catalytic triad (Asp-44, His-264, Ser-449) as well as 

the conserved Asn-362, potentially stabilizing the transition state, were replaced by alanine 

and the mutated cDNAs were transfected into human embryonic kidney (HEK) 293 cells.” 

(PMID:12445476) 

pred = forming 

arg1 = The amino acids 

arg2 = [the putative catalytic triad ( ASP44, HIS264, SER449)]/ENZ ACT 

C25 CYSP1 CARCN, H159 CYSP1 CARCN, D175 CYSP1 CARCN 

”The seven cysteine residues are aligned with those of papain and the catalytic triad 

(Cys25, His159, Asn175) of all cysteine peptidases of the papain family is conserved.” 

(PMID:10355634) 

pred = aligned 

arg1 = The seven CYS+ 

arg2 = 

arg2-with = with those 

arg2-of = of papain and the catalytic triad ( CYS25 

C176 NADE MYCTU, E52 NADE MYCTU, K121 NADE MYCTU 

”The residues forming the putative catalytic triad (Cys176, Glu52 and Lys121) were replaced 

by alanine; the mutated enzymes were expressed in the Escherichia coli Origami (DE3) strain 

and purified.” (PMID:15748981) 

pred = forming 

arg1 = The residues 

arg2 = [the putative catalytic triad ( CYS176, GLU52, and LYS121)]/ENZ ACT 

S1752 POLG BVDVS 

”Our study provides experimental evidence that histidine at position 1658 and aspartic acid 

at position 1686 constitute together with the previously identified serine at position 1752 

(S1752) the catalytic triad of the pestiviral NS3 serine protease.” (PMID:10915606) 


arg1 = 

arg1-with = the 

arg2 = [SER1752 ( S1752 ) the catalytic triad]/ENZ ACT 

arg2-of = the pestiviral NS3 serine protease. 

D167 POLS SFV, H145 POLS SFV, S219 POLS SFV 

”After this autoproteolytic cleavage, the free carboxylic group of Trp267 interacts with the 

catalytic triad (His145, Asp167 and Ser219) and inactivates the enzyme.” (PMID:18177892) 

pred = interacts 

arg1 = the free carboxylic group 

arg1-of = TRP267 

arg2 = 

arg2-with = [the catalytic triad ( HIS145, ASP167, and SER219)]/ENZ ACT 

D122 ARY2 RAT 

”Substitution of the catalytic triad Asp-122 with either alanine or asparagine resulted in the 

complete loss of protein structural integrity and catalytic activity.” (PMID:15209520) 

pred = resulted 

arg1 = Substitution 

arg1-of = the catalytic triad ASP122 

arg1-with = either alanine or asparagine 

arg2 = 

arg2-in = the complete loss 

arg2-of = [protein structural integrity and catalytic activity]/ENZ ACT 

190

. . . continuation of table F.1 

RID+UID 

Sentence 

PAS 

D156 LYPA1 HUMAN 

”To investigate whether this bridging function occurs in vivo, two transgenic mouse lines 

were established expressing a muscle creatine kinase promoter-driven human LPL (hLPL) 

minigene mutated in the catalytic triad (Asp156 to Asn).” (PMID:9811888) 

pred = mutated 

arg1 = ( hLPL ) minigene 

arg2 = 

arg2-in = [the catalytic triad (ASP156 ASN)]/ENZ ACT 

191

Appendix G 

Glossary 

3D pattern – a recurrent residue triplet configuration (with k=2 or k=3 interaction of residues) within a dataset of protein 

structures. 

arg – the argument of a PAS 

BIND – the set of binding-related functional annotations of extracted protein residues, i.e. annotations are labelled as 

BINDING. 

BINDING – a category in MAN, describing binding events of a protein residue. 

CSA – a database of manually curated active sites with structure templates derived from PDB. 

Contextual feature . 

EC – Enzyme classification identifier. 

ER – entity recognition. 

ENZ – the set of enzyme-related functional annotations of extracted protein residues, i.e. 

ENZ ACT. 

annotations are labelled as 

ENZ ACT – a category in MAN, describing enzyme-related information. 

FA – a functional annotation; or the set of extracted protein residues with functional annotations. 

FEAT – a categorisation scheme based on UniProtKB. 

FN – a false negative. 

FP – a false positive. 

FT – a record in Uniprot data file with functional annotation. 

Functional annotation – Information on biological function assigned to a protein residue. 

GC – a manually annotated test set with abstract texts drawn from a random selection of UniProtKB citations. 

GO – Gene Ontology. 

MAN – a categorisation scheme based on manual analysis on MEDLINE. 

MEDLINE – a database of citations and abstract texts from biomedical publications. 

NP – a noun phrase is defined as a nominal sequence. 

OLDFIELD – a non-redundant structure dataset of protein domains selected from PDB by sequence alignments. 

OPR – a semantic relation between a residue, its source protein, and hosting organism; or the set of mined protein residues. 

192

PAS – a data structure to accommodate the semantic relation between a predicate its arguments. 

PDBID – PDB identifier. 

PDB – the primary database of protein structure with spatial coordinates. 

PMID – a PubMed identifier. 

POS – a class of words, e.g. noun, verb, adjective, used for linguistic analysis. 

PP – a prepositional phrase is defined as preposition + noun phrase. 

pred – the predicate of a PAS. 

Protein residue – a residue with known association to its source protein within a hosting organism (OPR). 

RE – Relation extraction. 

RID – a Residue identifier: residue name + residue protein sequence. 

SCOP40 – a non-redundant protein structure dataset derived from SCOP. 

SCOP – a derived protein structure database with manual classification of proteins based on structure similarities. 

SITE – a record in the PDB data file denoting residues of a functional site. 

Structure pattern – cf. 3D pattern. 

TN – a true negative. 

TP – a true positive. 

TID – a Taxonomy identifier based on the NCBI Taxonomy guideline. 

UID – a Protein identifier based on the UniProtKB guideline. 

UniProtKB – a protein sequence database with manual annotations on protein residues. 

VG – a verb group is sequence of verbs, auxiliaries, or verb modifiers. 

VP – a verb phrase, consisting of a verb group + noun phrase. 

XC – a cross-validation corpus based on references from UniProtKB. 

chainID – a protein chain identifier in a PDB entry. 

k=2, k=3 – a residue triplet configuration with two-way or three-way interaction. 

resName – a residue name. 

resSeq – a protein residue sequence identifier from a PDB entry. 

seqIndex – a protein residue sequence identifier from a UniProtKB entry. 

193

Automatic functional annotation of predicted active sites - European ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?