Automatic functional annotation of predicted active sites - European ...
Automatic functional annotation of predicted active sites - European ...
Automatic functional annotation of predicted active sites - European ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Automatic</strong> <strong>functional</strong> <strong>annotation</strong><br />
<strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong>:<br />
combining PDB and literature mining<br />
Kevin Nagel<br />
Wolfson College<br />
A dissertation submitted to the University <strong>of</strong> Cambridge<br />
for the degree <strong>of</strong> Doctor <strong>of</strong> Philosophy<br />
<strong>European</strong> Molecular Biology Laboratory,<br />
<strong>European</strong> Bioinformatics Institute,<br />
Wellcome Trust Genome Campus, Hinxton,<br />
Cambridge CB10 1SD, United Kingdom.<br />
Email: kevin5jan@googlemail.com<br />
January 2009
Declaration<br />
This dissertation is the result <strong>of</strong> my own work, and includes nothing which is the outcome<br />
<strong>of</strong> work done in collaboration, except where specifically indicated in the text. The dissertation<br />
does not exceed the specified length limit <strong>of</strong> 300 pages as defined by the Biology<br />
Degree Committee. This thesis has been typeset in 12pt font using L A TEX 2εaccording<br />
to the specifications defined by the Board <strong>of</strong> Graduate Studies and the Biology Degree<br />
Committee.<br />
1
Summary<br />
Kevin Nagel<br />
<strong>European</strong> Bioinformatics Institute<br />
University <strong>of</strong> Cambridge<br />
Dissertation title: <strong>Automatic</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong>:<br />
combining PDB and literature mining.<br />
Proteins are essential to cell functions, which is mainly identified in biological experiments.<br />
The structural models for proteins help to explain their function, but are not direct<br />
evidence for their function. Nonetheless, we can mine structural databases, such as Protein<br />
Data Bank (PDB), to filter out shared structural components that are meaningful with<br />
regards to the protein function.<br />
This thesis applied mining techniques to PDB to identify evolutionary conserved structural<br />
patterns, e.g. <strong>active</strong> <strong>sites</strong>. This analysis retrieved 3- and 4-bodies with assumed twoand<br />
three-way residue interaction that have been selected from a distribution analysis <strong>of</strong><br />
residue triplets. A subset <strong>of</strong> the mined patterns is assumed to represent an <strong>active</strong> site,<br />
which should be confirmed by <strong>annotation</strong>s gathered by automatic literature analysis.<br />
Literature analysis for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> proteins relies on the extraction<br />
<strong>of</strong> GO terms from the context <strong>of</strong> a protein mention. The <strong>annotation</strong> <strong>of</strong> protein residues<br />
2
equires the identification <strong>of</strong> chemical functions, which could be found in the context<br />
<strong>of</strong> residue mentions. MEDLINE abstracts have been processed to identify protein mentions<br />
in combination with species and residues (F1-measure 0.52; the F1-measure is a<br />
statistical measure <strong>of</strong> a test’s accuracy based on the precision and recall <strong>of</strong> a test). The<br />
identified protein-species-residue triplets have been validated and benchmarked against<br />
reference data resources. Then, contextual features were extracted through shallow and<br />
deep parsing and the features have been classified into predefined categories (F1-measure<br />
ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with <strong>annotation</strong><br />
types in UniProtKB to assess the relevance <strong>of</strong> the <strong>annotation</strong>s for ongoing curation<br />
projects.<br />
Altogether, the <strong>annotation</strong>s have been assessed automatically and manually<br />
against reference data resources.<br />
All MEDLINE has been processed to filter out <strong>annotation</strong>s for residues. A subset <strong>of</strong><br />
identified catalytic <strong>sites</strong> could be cross-validated against the Catalytic Site Atlas (CSA;<br />
44 out <strong>of</strong> 221). 429 out <strong>of</strong> 512 protein residues from MSDsite was then annotated with<br />
contextual data. Altogether, MEDLINE does not provide sufficient data to fully annotate<br />
the content from PDB. Conversely, residue <strong>annotation</strong> is achieved with a different feature<br />
set than provided from GO, and incomplete <strong>annotation</strong>s in the reference datasets can be<br />
filled from public literature.<br />
3
Acknowledgements<br />
This thesis would not have been possible without the support, direction, and love <strong>of</strong> a multitude<br />
<strong>of</strong> people. First, I would like to thank my supervisor Dietrich Rebholz-Schuhmann<br />
for his trust, encouragements, and for all his unconditional support and guidance. Dietrich<br />
has throughout given me opportunity and a sound research methodology. Working<br />
with him I have learned the value <strong>of</strong> vision, and persistence in achieving it.<br />
I am blessed to have had Tom Oldfield for my second supervisor. Ever since I was<br />
interviewed by Tom, he has been inspiring, helpful and most <strong>of</strong> all patient. I will look back<br />
fondly on our discussions, the ”insights” in protein science he gave me, and the cheerful<br />
and motivational chats. I am deeply indebted for his belief in me.<br />
I would like to thank my thesis committee members for their valuable and constructive<br />
comments and valuable criticism; Michael Ashburner, Kim Henrick, and Rob Russell.<br />
They all seemed to find time for me despite their busy schedules.<br />
A special thank you must go to Kim Henrick; had he not encouraged me to pursue a<br />
research position I would not be a scientist now.<br />
I would also like to acknowledge Antonio Jimeno for his time, patience, and suggestions<br />
and especially for reminding me to keep my focus always. But most <strong>of</strong> all I will remember<br />
the great times we had cycling to and from work.<br />
I would like to thank the past and present members <strong>of</strong> the Rebholz Group (Text<br />
Mining). During my years <strong>of</strong> research, the group has expanded and I have had the chance<br />
to learn from them as well as to have fun with them within the group.<br />
4
I am also thankful to the <strong>European</strong> Molecular Biology Laboratoy EMBL for the scholarship<br />
and the organised EMBL International PhD programme, throughout which I have<br />
had the chance to meet many talented and cheerful PhD students from the EMBL/EBI<br />
Hinxton.<br />
A special thank you to Christina Granroth and Dagmar Harzheim, who have done the<br />
pro<strong>of</strong>reading <strong>of</strong> this thesis. Thank you Dagmar for becoming clearer what I want to say.<br />
Finally, I would like to acknowledge my wife Almut Nagel and my daughter Juli Nagel.<br />
Without Almut I would have become a working maniac with no joy in life; she helped me<br />
to maintain balance during my PhD research and also for the future. My special thanks<br />
and love will go to Juli, aged one, from whom I have learned so much.<br />
5
Contents<br />
1 Introduction 15<br />
1.1 Proteins and <strong>functional</strong> <strong>sites</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
1.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
1.6 Guide to remaining chapters . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
2 Background 26<br />
2.1 Protein related data resources . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.1.1 Protein Data Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
2.1.2 Universal Protein Knowledge base . . . . . . . . . . . . . . . . . . . 31<br />
2.1.3 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
2.1.4 Biomedical literature . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
2.2 Protein structure data mining . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.2.1 Hypothesis-driven data analysis . . . . . . . . . . . . . . . . . . . . 36<br />
2.2.2 Discovery-driven data mining . . . . . . . . . . . . . . . . . . . . . 37<br />
2.3 Biomedical literature mining . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.3.1 Biological entity recognition . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.3.2 Biological relation extraction . . . . . . . . . . . . . . . . . . . . . . 39<br />
6
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
3 Mining residue interactions as triads from PDB 42<br />
3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3.1.1 Structural feature extraction . . . . . . . . . . . . . . . . . . . . . . 44<br />
3.1.2 Detection <strong>of</strong> significant configurations as interactions . . . . . . . . 47<br />
3.1.3 Grouping and selecting frequent configurations . . . . . . . . . . . . 52<br />
3.2 Analysing available non-redundant protein structure sets . . . . . . . . . . 53<br />
3.3 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
3.4.1 Identification <strong>of</strong> residue interactions is dependent on data selection 55<br />
3.4.2 The interaction distance correlates with the distribution <strong>of</strong> residue<br />
triads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
3.4.3 Interaction classification is sensitive to the size <strong>of</strong> cross-validation . 59<br />
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />
4 Prediction <strong>of</strong> functions for mined residue triads 63<br />
4.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />
4.2.1 Identification <strong>of</strong> homologous metal binding <strong>sites</strong> . . . . . . . . . . . 66<br />
4.2.2 Validation <strong>of</strong> convergent metal binding <strong>sites</strong> . . . . . . . . . . . . . 67<br />
4.2.3 Recovering <strong>active</strong> <strong>sites</strong> and catalytic triads from the dataset . . . . 73<br />
4.2.4 Discovering the conserved serine residue in the catalytic triad (quartet)<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />
7
5 Identification <strong>of</strong> protein residues in MEDLINE 79<br />
5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
5.1.1 Protein and organism entity recognition . . . . . . . . . . . . . . . 81<br />
5.1.2 Entity recognition <strong>of</strong> protein residue . . . . . . . . . . . . . . . . . 82<br />
5.1.3 Association identification <strong>of</strong> the entity triplet organism, protein,<br />
and residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
5.2 The construction <strong>of</strong> evaluation test corpora . . . . . . . . . . . . . . . . . . 86<br />
5.3 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
5.4.1 Evaluation <strong>of</strong> organism, protein, and residue entity recognition . . . 90<br />
5.4.2 Performance study on the entity triplet association . . . . . . . . . 92<br />
5.4.3 Cross-validation <strong>of</strong> identified residues with UniProtKB . . . . . . . 93<br />
5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins . . . . . 94<br />
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
6 Information extraction from the context <strong>of</strong> a residue in text 101<br />
6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
6.1.1 Extraction <strong>of</strong> contextual features . . . . . . . . . . . . . . . . . . . 103<br />
6.1.2 Categorisation <strong>of</strong> contextual features . . . . . . . . . . . . . . . . . 110<br />
6.2 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />
6.3.1 Contextual feature extraction evaluated . . . . . . . . . . . . . . . . 117<br />
6.3.2 Performance analysis <strong>of</strong> the classifiers . . . . . . . . . . . . . . . . . 118<br />
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
8
7 Extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> for protein residues from MED-<br />
LINE 124<br />
7.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
7.2.1 Evaluation <strong>of</strong> the developed <strong>functional</strong> <strong>annotation</strong> extraction system 126<br />
7.2.2 Studying mined <strong>functional</strong> <strong>annotation</strong>s for the proteins p53 and Jak2129<br />
7.2.3 Cross-validation <strong>of</strong> mined catalytic residues with CSA . . . . . . . . 132<br />
7.2.4 Annotation <strong>of</strong> protein residues in MSDsite . . . . . . . . . . . . . . 134<br />
7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />
8 Combining <strong>active</strong> site prediction with mined <strong>functional</strong> <strong>annotation</strong>s 137<br />
8.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />
8.1.1 Combining protein structure data with literature data . . . . . . . . 138<br />
8.2 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
8.3.1 Protein residue mapping between three data resources . . . . . . . . 140<br />
8.3.2 Rediscovery <strong>of</strong> <strong>active</strong> <strong>sites</strong> and catalytic residues . . . . . . . . . . . 142<br />
8.3.3 Search for novel catalytic residues . . . . . . . . . . . . . . . . . . . 145<br />
8.3.4 General correlation found between <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong> and<br />
extract <strong>functional</strong> <strong>annotation</strong>s. . . . . . . . . . . . . . . . . . . . . 146<br />
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
9 Conclusions and future work 150<br />
9.1 Summary <strong>of</strong> main contributions . . . . . . . . . . . . . . . . . . . . . . . . 150<br />
9.2 Limitations and future works . . . . . . . . . . . . . . . . . . . . . . . . . 152<br />
A Examples <strong>of</strong> errors in relation extraction. 171<br />
9
B Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s compared with UniProtKB173<br />
C Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for the protein p53 177<br />
D Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for the protein Jak2 183<br />
E Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> the category binding<br />
event 186<br />
F Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> <strong>active</strong> site residues 189<br />
G Glossary 192<br />
10
List <strong>of</strong> Figures<br />
1.1 The standard amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
1.2 Examples <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in proteins . . . . . . . . . . . . . . . . . . . . 18<br />
1.3 The protein universe and its knowledge representation . . . . . . . . . . . . 20<br />
2.1 Data banks in the protein universe . . . . . . . . . . . . . . . . . . . . . . 28<br />
2.2 Three hyperlinked protein data banks . . . . . . . . . . . . . . . . . . . . . 29<br />
2.3 Categories for protein sequence <strong>annotation</strong> UniProtKB . . . . . . . . . . . 32<br />
2.4 GO terms are not suitable for protein residue <strong>annotation</strong> . . . . . . . . . . 34<br />
3.1 Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed 3D pattern<br />
identification system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
3.2 Four classes <strong>of</strong> interactions within a 3-body . . . . . . . . . . . . . . . . . . 49<br />
3.3 Non-redundant structure set for 3D pattern mining . . . . . . . . . . . . . 53<br />
3.4 Distribution analysis <strong>of</strong> extracted residue triplets . . . . . . . . . . . . . . 57<br />
3.5 Comparison <strong>of</strong> extracted residue triplets based on their interaction type . . 58<br />
3.6 The effect <strong>of</strong> varying the cross-validation sample size on significance testing<br />
<strong>of</strong> residue interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4.1 A metal binding site with the 3Cys pattern in OLDFIELD . . . . . . . . . 68<br />
4.2 A metal binding site with the Cys-2His pattern in OLDFIELD . . . . . . . 69<br />
4.3 A metal binding site with the 3Cys pattern in SCOP40 . . . . . . . . . . . 70<br />
4.4 A metal binding site with the Cys-2His pattern in SCOP40 . . . . . . . . . 71<br />
11
4.5 Re-discovery <strong>of</strong> the catalytic triad as Asp-His-Ser pattern in OLDFIELD . 75<br />
5.1 Overview <strong>of</strong> processes and evaluation methods for the developed protein<br />
residue identification system . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />
5.2 Test corpora for information extraction evaluation . . . . . . . . . . . . . . 87<br />
5.3 Identified protein residues in MEDLINE . . . . . . . . . . . . . . . . . . . 95<br />
5.4 Cross-validation <strong>of</strong> citations from identified protein residues with UniProtKB/PDB 97<br />
6.1 Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed contextual<br />
feature extraction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
7.1 Performance evaluation <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system . . 127<br />
7.2 Cross-validation <strong>of</strong> text mined catalytic residues with CSA . . . . . . . . . 133<br />
7.3 Cross-validation <strong>of</strong> text mined binding residues with MSDsite . . . . . . . 134<br />
8.1 Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> combining the protein<br />
structure dataset and literature dataset . . . . . . . . . . . . . . . . . . . . 138<br />
8.2 Lookup table for PDB/UniProtKB mapping . . . . . . . . . . . . . . . . . 140<br />
8.3 Overview <strong>of</strong> the combined datasets from protein structure data and biomedical<br />
literature data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />
12
List <strong>of</strong> Tables<br />
3.1 Study on the effect <strong>of</strong> varying the interaction distance threshold in structure<br />
triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.1 Summary <strong>of</strong> extracted data at each protein structure data mining step . . 65<br />
4.2 Identification <strong>of</strong> metal binding <strong>sites</strong> in OLDFIELD . . . . . . . . . . . . . 66<br />
4.3 Convergent metal binding <strong>sites</strong> identified in SCOP40 . . . . . . . . . . . . 72<br />
4.4 List <strong>of</strong> cross-validated <strong>active</strong> site residues . . . . . . . . . . . . . . . . . . . 74<br />
4.5 Extending the catalytic triad into 4-bodies . . . . . . . . . . . . . . . . . . 76<br />
5.1 Regular expression patterns for the detection <strong>of</strong> residue mentions in text . 84<br />
5.2 Performance evaluation <strong>of</strong> residue entity recognition . . . . . . . . . . . . . 90<br />
5.3 Performance evaluation <strong>of</strong> protein entity recognition . . . . . . . . . . . . . 91<br />
5.4 Performance evaluation <strong>of</strong> organism entity recognition . . . . . . . . . . . . 91<br />
5.5 Performance evaluation <strong>of</strong> residue-protein-organism entity association detection<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
5.6 Performance evaluation <strong>of</strong> protein-organism and protein-residue entity association<br />
detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
5.7 A specialised performance evaluation between GC and XC2. . . . . . . . . 94<br />
6.1 Biological categories for the classification <strong>of</strong> protein residue related information<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
6.2 Category distribution in the text feature reference set . . . . . . . . . . . . 115<br />
13
6.3 Evaluation <strong>of</strong> syntactical language parser performance . . . . . . . . . . . . 117<br />
6.4 Performance analysis <strong>of</strong> the classifiers (confusion matrix) . . . . . . . . . . 119<br />
6.5 Performance evaluation <strong>of</strong> the classifiers (precision, recall, F1 measure) . . 120<br />
8.1 Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />
8.2 Identified catalytic residues from MEDLINE extraction . . . . . . . . . . . 144<br />
8.3 Catalytic triad residues available from the mined <strong>functional</strong> <strong>annotation</strong>s . . 145<br />
8.4 Functional <strong>annotation</strong>s <strong>of</strong> protein residues in <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong>. . . 147<br />
8.5 Homology-based transfer <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for protein<br />
residues in the mined pattern data. . . . . . . . . . . . . . . . . . . . . . . 148<br />
A.1 Examples <strong>of</strong> errors in the relation extraction for the detection <strong>of</strong> contextual<br />
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />
B.1 Comparison <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s from GC with UniProtKB. 174<br />
C.1 Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in p53. . . . . 178<br />
D.1 Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in Jak2. . . . 184<br />
E.1 Mined <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues with information on binding<br />
events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187<br />
F.1 Identified catalytic triad residues from MEDLINE exraction. . . . . . . . . 190<br />
14
Chapter 1<br />
Introduction<br />
1.1 Proteins and <strong>functional</strong> <strong>sites</strong><br />
The genomic information encodes the blueprint to build an organism. The decoding and<br />
implementation <strong>of</strong> genetic information depends on the functions <strong>of</strong> the proteins. Each protein<br />
is the result <strong>of</strong> transcribing a gene into mRNA, which is translated into a polypeptide.<br />
Hence, a protein is a gene product. The elementary units <strong>of</strong> a protein are the 20 natural<br />
standard amino acids, each with four invariant parts: a central chiral alpha carbon<br />
(Cα), an amine group (NH2), a carboxylic acid group (COOH), hydrogen (H), and a<br />
characteristic side chain (R). Apart from the invariant amine and carboxylic acid group,<br />
which gives every amino acid the property <strong>of</strong> a zwitterion, distinctive physicochemical<br />
properties are defined by the side chain group. These can be polar, acidic/basic, aromatic,<br />
bulky, conformational flexible, contain cross-linking ability, show hydrogen-bond<br />
capability, or chemical reactivity. Figure 1.1 lists all the standard amino acids and their<br />
common classification on the basis <strong>of</strong> the nature <strong>of</strong> the side chain group.<br />
During biosynthesis, ribosomes catalyses the polymerisation <strong>of</strong> amino acids through<br />
condensation and form peptide bonds between the NH2 and COOH groups <strong>of</strong> two consecutive<br />
amino acids. The backbone (main chain) <strong>of</strong> the resulting polypeptide is the repeating<br />
sequence <strong>of</strong> NH2-C-CO-[NH-C-CO] n -NH-C-CO. This is the primary structure <strong>of</strong> a protein<br />
15
Amino Acid 3-Letter 1-Letter Side-chain polarity<br />
Alanine Ala A nonpolar<br />
Arginine Arg R polar<br />
Asparagine Asn N polar<br />
Aspartic acid Asp D polar<br />
Cysteine Cys C nonpolar<br />
Glutamic acid Glu E polar<br />
Glutamine Gln Q polar<br />
Glycine Gly G nonpolar<br />
Histidine His H polar<br />
Isoleucine Ile I nonpolar<br />
Leucine Leu L nonpolar<br />
Lysine Lys K polar<br />
Methionine Met M nonpolar<br />
Phenylalanine Phe F nonpolar<br />
Proline Pro P nonpolar<br />
Serine Ser S polar<br />
Threonine Thr T polar<br />
Tryptophan Trp W nonpolar<br />
Tyrosine Tyr Y polar<br />
Valine Val V nonpolar<br />
Figure 1.1: The standard amino acids. The trivial names, 3-letter and 1-letter abbreviations are listed<br />
along with the physicochemical properties <strong>of</strong> their side chains.<br />
and it will fold spontaneously due to different interactions <strong>of</strong> its amino acid composition<br />
with environmental factors, e.g. solvent, salt, chaperones. The most prominent formation<br />
during the folding process is the hydrophobic core, which stabilises the protein structure.<br />
Amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, and methionine,<br />
are clustered in the interior <strong>of</strong> a protein, while charged or polar side chains are turned to<br />
the solvent-exposed surface and interact with surrounding water molecules. Minimising<br />
the exposition <strong>of</strong> hydrophobic side chains to water is the principal driving force <strong>of</strong> folding.<br />
The process <strong>of</strong> protein folding involves the formation <strong>of</strong> regular secondary structure<br />
elements (SSE), such as alpha helix and beta strand, which are stabilised by intramolecular<br />
hydrogen bonds and contacts between side chain atoms (van der Waals interaction). By<br />
following a helical path, the carboxyl group <strong>of</strong> residue i and the amino group <strong>of</strong> residue i+4<br />
<strong>of</strong> the main chain are arranged in alignment and stabilise the local structure by hydrogenbond<br />
formation. The side chains protrude out from the helically coiled backbone and<br />
define the surface <strong>of</strong> the helix. In contrast, beta strands are formed by hydrogen bonds<br />
between distant regions on the peptide. Depending on the direction <strong>of</strong> the peptide region,<br />
16
two adjacent strands can be characterised as parallel or antiparallel. Because the backbone<br />
adopts almost a fully extended conformation, every side chain <strong>of</strong> i + 2 residue is facing<br />
the same direction. A set <strong>of</strong> interacting strands is called a sheet. Within the process <strong>of</strong><br />
intramolecular stabilisation <strong>of</strong> the main chain, the regions between secondary structure<br />
elements adopt a loosely defined conformation such as turns and random coils or loops.<br />
The attr<strong>active</strong> and repulsive forces (e.g. ionic or van der-Waals interaction between<br />
residues) among the SSEs balance each other during the folding process and lead to a<br />
relatively stable and complex three-dimensional structure. Stabilisation <strong>of</strong> the conformation<br />
may involve covalent bonding, e.g. disulphide bridges between two cysteine residues<br />
or the formation <strong>of</strong> metal binding-motifs. The spatial arrangement <strong>of</strong> sequentially proximate<br />
or distant residues allows the generation <strong>of</strong> biochemical <strong>functional</strong> <strong>sites</strong>. To identify<br />
those and other novel biologically <strong>functional</strong> regions in the protein is one <strong>of</strong> the greatest<br />
research interests in the protein bioinformatics community, because they explain phenological<br />
data, e.g. cellular processes. Figure 1.2 lists some <strong>of</strong> the well known <strong>functional</strong><br />
<strong>sites</strong> in various proteins classified according to my own designed categorisation scheme.<br />
Finally, the formation <strong>of</strong> quaternary structure is the assembly <strong>of</strong> tertiary structures<br />
within a multi-chain protein. In this respect, each polypeptide chain is regarded as an<br />
individual <strong>functional</strong> unit (subunit or domain). Within the interfaces <strong>of</strong> the subunits,<br />
a multi-domain based <strong>functional</strong> site can be formed, which is not present or <strong>functional</strong><br />
in the individual domains. For example, the proteins cAMP-dependent protein kinase<br />
(PDBID:1rdq), hexokinase (PDBID:1bdq), or maltodextrin phosphorylase (PDBID:1l5w)<br />
contain ligand binding <strong>sites</strong> consisting <strong>of</strong> more than one protein structure domain (A.<br />
Kahraman, pers. comm.). The identification <strong>of</strong> these multi-domain <strong>functional</strong> <strong>sites</strong> is<br />
another great challenge in protein bioinformatics.<br />
First, the prediction system has to<br />
find the correct assembly <strong>of</strong> tertiary structures (a crystal structure <strong>of</strong> a protein does<br />
not necessarily reflect the biological state <strong>of</strong> assembly).<br />
Second, the structure models<br />
have to be adjusted (proteins are not rigid molecules and have flexible parts), and finally<br />
17
site<br />
1. evolutionary site<br />
1.1. conserved site<br />
2. <strong>functional</strong> site<br />
2.1. interaction site<br />
2.1.1. <strong>active</strong> site<br />
2.1.1.1. catalytic site / re<strong>active</strong> site<br />
2.1.1.1.1. catalytic residue<br />
2.1.1.1.2. donor site<br />
2.1.1.1.3. acceptor site<br />
2.1.1.2. binding site / contact site / substrate binding site / ligand binding site /<br />
binding site / recognition site<br />
2.1.1.2.1. specificity residue / specific site<br />
2.1.1.2.1.1. high affinity binding site<br />
2.1.1.2.1.2. low affinity binding site<br />
2.1.1.2.2. peptide binding site<br />
2.1.1.2.3. protein binding / receptor site<br />
2.1.1.2.3.1. nf kappab site<br />
2.1.1.2.3.2. antibody binding site<br />
2.1.1.2.3.3. antigen binding site<br />
2.1.1.2.3.4. actin binding site<br />
2.1.1.2.4. sugar binding<br />
2.1.1.2.5. lipid binding<br />
2.1.1.2.6. nucleic acid binding<br />
2.1.1.2.6.1. atp binding site<br />
2.1.1.2.7. metal binding site<br />
2.1.1.2.7.1. calcium binding site / ca(2+) binding site<br />
2.1.1.2.7.2. copper site<br />
2.1.2. passive site / target site<br />
2.1.2.1. cleavage site / lesion site / processing site / proteolytic cleavage site<br />
2.1.2.2. PTM site<br />
2.1.2.2.1. phosphorylation site<br />
2.1.2.2.1.1. tyrosine phosphorylation site<br />
2.1.2.2.2. glycosylation site<br />
2.1.2.2.3. regulatory site<br />
2.1.2.2.4. inhibitory site<br />
2.1.2.2.5. activation site<br />
2.2. structural site<br />
2.2.1 hydrophobic site<br />
2.2.1.1 hydrophobic core<br />
2.2.1.2. hydrophobic patch<br />
2.2.2. n terminal site<br />
2.2.3. c terminal site<br />
2.2.4. transmembrane site<br />
2.2.5. intracellular site / cellular site<br />
2.2.6. extracellular site<br />
2.2.7. anionic site<br />
2.2.8. cationic site<br />
2.2.9. nucleation site<br />
Figure 1.2: Examples <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in proteins. A proposition <strong>of</strong> a classification scheme (excerpt)<br />
is represented based on my own perspective <strong>of</strong> biomolecular function <strong>of</strong> specific residue configurations in<br />
protein structures.<br />
18
co-factors, e.g. metal ions, have to be considered.<br />
1.2 Motivation<br />
The understanding <strong>of</strong> the biological function <strong>of</strong> proteins remains a central challenge in<br />
biology.<br />
Our knowledge <strong>of</strong> the protein universe can be partitioned into at least three<br />
knowledge spaces (cf. figure 1.3): protein sequence space, protein structure space, and<br />
protein function space. Each space represents a specific view <strong>of</strong> proteins. For example, the<br />
protein structure space contains information about the number <strong>of</strong> biological conformations<br />
<strong>of</strong> protein structures (cf. figure 1.3, top panel). Whereas, the function space describes the<br />
spectrum <strong>of</strong> protein function. Although information from each space partially overlaps,<br />
only little data are available to explain their relationship.<br />
For example, site-directed<br />
mutational analysis is <strong>of</strong>ten reported in context <strong>of</strong> gain or loss <strong>of</strong> a protein function,<br />
while the biological correlation between sequence and function is not understood. This is<br />
because the mechanism <strong>of</strong> protein function is not explained by information within sequence<br />
space. In contrast, structural data are more expressive than sequence data, because a<br />
protein structure provides spatial context <strong>of</strong> residues. Proteins are physical entities and<br />
as such, they perform interactions with other proteins or ligands. The shape <strong>of</strong> a protein,<br />
or more precisely, the spatial configuration <strong>of</strong> a set <strong>of</strong> residues in a <strong>functional</strong> site, is<br />
one explanation for protein function. While protein structure data mining is concerned<br />
with the prediction <strong>of</strong> novel <strong>functional</strong> <strong>sites</strong> in proteins, a mined structural pattern has<br />
no evidences <strong>of</strong> biological function.<br />
In contrast, biomedical literature reports a range<br />
<strong>of</strong> biological function <strong>of</strong> protein residues without a structural context and explanation <strong>of</strong><br />
molecular mechanism (cf. figure 1.3, middle panel). The combination <strong>of</strong> information from<br />
protein structure space and protein function space seems to be an obvious approach in<br />
order to gain new knowledge on protein function.<br />
19
Figure 1.3: The protein universe and its knowledge representation. Information on a protein can be collected<br />
from at least three different knowledge domains: crystallography provides the spatial coordinate <strong>of</strong><br />
a protein, protein sequencing determines the linear composition <strong>of</strong> amino acids in a protein, and biochemical<br />
experiments characterises the biological function (top panel). In principle protein function prediction<br />
can be done based on information from each domain knowledge spaces, however the combination <strong>of</strong> them<br />
can overcome some domain specific limitations (middle panel).<br />
20
1.3 Objective<br />
This thesis aims to discover hypothetical <strong>functional</strong> <strong>sites</strong> from Protein Data Bank (PDB)<br />
and annotate them with <strong>functional</strong> information from biomedical literature.<br />
The main<br />
idea is to combine the information from currently two detached data resources, protein<br />
structure information from PDB, and <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> residues from MEDLINE<br />
(cf. figure 1.3, lower panel). More specifically, this research focuses on the prediction <strong>of</strong><br />
<strong>active</strong> <strong>sites</strong> by data mining recurrent spatial residue configurations (3D pattern) in proteins.<br />
Contextual features <strong>of</strong> residues are extracted from biomedical literature to provide<br />
<strong>functional</strong> <strong>annotation</strong>s. The results from both datasets are then combined to verify <strong>predicted</strong><br />
<strong>functional</strong> <strong>sites</strong> by evidences <strong>of</strong> biological function. While existing approaches in<br />
protein structure data mining and biomedical literature mining has been used to generate<br />
data for each research domain, the combination <strong>of</strong> the datasets is a novel approach in<br />
protein bioinformatics research.<br />
1.4 Related works<br />
To verify a <strong>predicted</strong> protein function with <strong>functional</strong> <strong>annotation</strong>s extracted from biomedical<br />
literature, two different levels have to be considered: the protein level, and the residue<br />
level (i.e. groups <strong>of</strong> residues forming a <strong>functional</strong> site).<br />
The recent publication <strong>of</strong> [JGLRS08] is one example for case (1): The prediction <strong>of</strong><br />
protein function is based on the search for a conserved and connected subgraph (CCS) in<br />
protein-protein interaction graphs, generated from several biological databases. Within<br />
the set <strong>of</strong> CCS, all available <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> a protein in a database are transferred<br />
to homologous proteins. The <strong>annotation</strong>s consist <strong>of</strong> Gene Ontology (GO) terminologies<br />
and the transfer is the prediction <strong>of</strong> protein function. The verification <strong>of</strong> a <strong>predicted</strong><br />
function was done by identifying GO terms in abstract texts <strong>of</strong> the corresponding protein.<br />
The approach <strong>of</strong> this thesis has some similarities to this report [JGLRS08], e.g. in<br />
21
oth approaches, results from data mining were verified by information extracted from<br />
biomedical literature. However, there are crucial differences between the two that need<br />
to be considered when assessing the result <strong>of</strong> this thesis. First, in contrast to the CCS<br />
identification, the data mining part in this work does not aim to identify known patterns,<br />
but wants to discover new structural features that may represent a novel <strong>functional</strong> site.<br />
Secondly, in [JGLRS08] the prediction <strong>of</strong> protein function utilises terminologies <strong>of</strong> a welldeveloped<br />
public resource, the Gene Ontology, while the same resource is not suitable<br />
for <strong>annotation</strong> <strong>of</strong> protein residues. This is because GO is designed to describe function<br />
<strong>of</strong> genes and gene products. From a conceptual point <strong>of</strong> view, terminologies in GO describe<br />
a high level <strong>of</strong> biological function, while the description <strong>of</strong> residue function are <strong>of</strong> a<br />
lower level. For example, description <strong>of</strong> protein-protein interaction is found in context <strong>of</strong><br />
metabolomics, signal-transduction or other cellular processes. In contrast, the function <strong>of</strong><br />
a protein residue can be explained in light <strong>of</strong> molecular interactions or chemical reaction<br />
mechanisms. Finally, the distribution <strong>of</strong> information on biological function is expected to<br />
be different in biomedical publications. Because protein function is conceptually a high<br />
level <strong>of</strong> biological function, it is likely that abstract texts <strong>of</strong> biomedical articles contain<br />
information on this level. Conversely, the interaction <strong>of</strong> protein residues is a detailed description<br />
<strong>of</strong> protein function, and key information are expected to be mentioned in results<br />
or discussion sections <strong>of</strong> full-text articles. To my knowledge, the most related relevant<br />
work in terms <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues (case (2)) is the system called<br />
Mutation extraction and STRucture Annotation Pipeline (mSTRAP) [KCRB07]. The key<br />
feature <strong>of</strong> mSTRAP is the visualisation <strong>of</strong> mutation <strong>annotation</strong>s, which is projected onto<br />
a structure <strong>of</strong> a protein <strong>of</strong> interest. The advantage <strong>of</strong> mSTRAP is to interpret impacts <strong>of</strong><br />
mutation in context <strong>of</strong> the protein structure. However, the prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong><br />
is done by visual analysis <strong>of</strong> the protein structure. The provided <strong>annotation</strong>s are sets<br />
<strong>of</strong> complete sentences extracted from MEDLINE, which means that the interpretation <strong>of</strong><br />
the information requires expert knowledge.<br />
22
The developed system in this work differs from mSTRAP, in that the extracted information<br />
is not exclusively used to annotate point mutations, but rather other <strong>functional</strong><br />
descriptions <strong>of</strong> wild-type residues are also collected. Another distinction to mSTRAP is,<br />
the mined information is represented in a so called predicate-argument structure (PAS)<br />
format; only relevant text segments from sentences are extracted that describe a biological<br />
function or a biological context <strong>of</strong> a mentioned residue. The structured format allows to<br />
some extent queries for specific information in the extracted <strong>annotation</strong> dataset.<br />
In conclusion, only few related works have been reported that describe an automated<br />
system to verify a <strong>predicted</strong> protein function by using <strong>functional</strong> <strong>annotation</strong>s extracted<br />
from the literature. This work retains its originality, because it aims to find novel <strong>functional</strong><br />
<strong>sites</strong> in proteins by mining the PDB, and by extracting <strong>functional</strong> <strong>annotation</strong>s from<br />
a wide range <strong>of</strong> biomedical literature data.<br />
1.5 Challenges<br />
Is it possible to identify a <strong>functional</strong> site, e.g.<br />
an <strong>active</strong> site, on the basis <strong>of</strong> mining<br />
PDB and the literature, and then combine the information <strong>of</strong> both?<br />
We can expect<br />
that a significant population <strong>of</strong> similarly arranged residues in a protein can be identified<br />
from a non-redundant protein set, if this evolutionary conserved interaction provides a<br />
<strong>functional</strong> or structural advantage. We can also expect that residues are mentioned in<br />
conjunction with their corresponding protein, and that the biological role <strong>of</strong> a protein<br />
residue is reported in context <strong>of</strong> gain or loss <strong>of</strong> function <strong>of</strong> the overall protein in biomedical<br />
literature.<br />
One task presented in this thesis is the identification <strong>of</strong> textual features as <strong>functional</strong><br />
<strong>annotation</strong>. The problem differs from other information extraction tasks, e.g. the <strong>annotation</strong><br />
<strong>of</strong> proteins, because the target is to provide knowledge on the biological role <strong>of</strong> a<br />
residue. For example, to extract protein-protein interactions from text, a list <strong>of</strong> protein<br />
names is used, and the task is reduced to finding only associations between listed pro-<br />
23
teins. In contrast, to extract a protein residue and its corresponding biological function<br />
is difficult, because an adequate dictionary <strong>of</strong> terms is not available.<br />
1.6 Guide to remaining chapters<br />
Chapter 2 presents background knowledge that are important for this work. Four different<br />
data resources are reviewed and their limitations discussed in context <strong>of</strong> this<br />
thesis. Then follows an explanation <strong>of</strong> methods in the field <strong>of</strong> protein structure data<br />
mining and biomedical literature mining. Some <strong>of</strong> the introduced methodologies are<br />
reused in this work, while ideas and approaches <strong>of</strong> others were adopted to develop<br />
task specific extraction systems.<br />
Chapter 3 describes the developed protein structure data mining system for the identification<br />
<strong>of</strong> 3D patterns in PDB. Algorithms for the identification <strong>of</strong> conserved<br />
spatial residue configurations are explained and the effects <strong>of</strong> algorithm-related and<br />
data-related parameters are discussed.<br />
Chapter 4 demonstrates the biological implication <strong>of</strong> the mined 3D patterns from chapter<br />
3. Two examples <strong>of</strong> rediscovered <strong>functional</strong> <strong>sites</strong> in proteins are shown to justify<br />
the presented data mining approach. The first biological validation is the identification<br />
<strong>of</strong> metal binding <strong>sites</strong>, while the second validation is the rediscovery <strong>of</strong> catalytic<br />
triad from the mined data.<br />
Chapter 5 is the first <strong>of</strong> three text mining chapters in this thesis. It explains the developed<br />
protein residue identification system, which consists <strong>of</strong> two main modules:<br />
biological entity recognition <strong>of</strong> residue, protein, and organism, and association detection<br />
<strong>of</strong> the entity triplet.<br />
Chapter 6 describes the approach to detect contextual features <strong>of</strong> a mentioned residue in<br />
text. An automatic method is introduced to assign semantic labels to the extracted<br />
24
textual features.<br />
Chapter 7 presents the third part <strong>of</strong> the three text mining chapters. Both text mining<br />
modules from the previous chapters (protein residue identification, and contextual<br />
feature extraction) are combined to form the <strong>functional</strong> <strong>annotation</strong> extraction system.<br />
The overall performance <strong>of</strong> this information extraction system is studied. The<br />
validity <strong>of</strong> the extracted information as <strong>functional</strong> <strong>annotation</strong> is demonstrated by<br />
manual analysis on two example proteins (p53 and Jak2), and by cross-validation<br />
<strong>of</strong> identified catalytic or binding residues with two reference databases: CSA and<br />
MSDsite.<br />
Chapter 8 presents results on combining protein structure data with literature data.<br />
The validity is studied by examining the correlation <strong>of</strong> <strong>predicted</strong> <strong>active</strong> site residues<br />
with enzyme-related <strong>functional</strong> <strong>annotation</strong>s.<br />
Chapter 9 summarises the thesis and presents limitations and open questions for follow<br />
up research.<br />
25
Chapter 2<br />
Background<br />
In the previous chapter, I have presented the motivation and objective <strong>of</strong> this thesis. The<br />
purpose <strong>of</strong> this chapter is to familiarise the reader with relevant concepts in protein science,<br />
data mining, and literature mining. The limitations <strong>of</strong> each reviewed data resource or<br />
methodology are discussed in context <strong>of</strong> this research work.<br />
2.1 Protein related data resources<br />
Proteins are both building blocks <strong>of</strong> cellular structures and the major machinery in cells.<br />
In order to perform their functions, proteins need to fold into their three-dimensional<br />
structures and thereby form <strong>functional</strong> <strong>sites</strong>. The prediction <strong>of</strong> a structural pattern associated<br />
with a biological function is an important aspect in protein bioinformatics. To<br />
interpret the multiple functions <strong>of</strong> proteins, <strong>annotation</strong>s are linked with results from<br />
bioinformatics analysis tools. In addition, data are extracted from generic and specific<br />
databases, biological knowledge accumulated in literature, and data from genome-wide<br />
experiments, such as transcriptomics and proteomics, are collected. One major goal is to<br />
describe protein function within biological context by using a standardised hierarchical<br />
classification scheme and controlled vocabulary.<br />
The biological community has developed databases and <strong>functional</strong> <strong>annotation</strong> schemes<br />
26
that are not only used to archive protein data, but also to describe protein function on<br />
a molecular, cellular and phenotypical level. Figure 2.1 shows some <strong>of</strong> the most popular<br />
and relevant databases in the field <strong>of</strong> protein bioinformatics. These protein-related data<br />
resources are hyperlinked in order to foster bioinformatical research works. A statistic <strong>of</strong><br />
three example databanks and their hyperlinked references is given in figure 2.2.<br />
2.1.1 Protein Data Bank<br />
The Protein Data Bank (PDB) is an archive <strong>of</strong> 3D structures <strong>of</strong> large biological molecules,<br />
such as proteins and nucleic acids. Currently, PDB lists 43,099 proteins determined by<br />
crystallography (version November 2008).<br />
Despite the large amount <strong>of</strong> structure data<br />
available for a range <strong>of</strong> proteins, the information in the PDB has three significant limitations.<br />
First <strong>of</strong> all, the structure data have a low correlation with sequence data. In<br />
comparison to the sequence data in UniProtKB (cf. section 2.1.2), the coverage <strong>of</strong> the sequence<br />
space is much larger than the structure space. Therefore, the derived information<br />
from PDB is only applicable to a limited set <strong>of</strong> proteins.<br />
The second limitation is the coverage <strong>of</strong> <strong>annotation</strong> available for proteins.<br />
In the<br />
PDB, there are some facilities to annotate proteins, for example the SITE record is used<br />
to annotate protein residues that are part <strong>of</strong> <strong>active</strong> <strong>sites</strong>. However, <strong>annotation</strong>s are not<br />
mandatory and many other <strong>sites</strong> are not updated, although new evidences <strong>of</strong> biological<br />
<strong>functional</strong>ity <strong>of</strong> these residues were found. An automatically derived database called PDB-<br />
SITE [IPGK05] stores the SITE record information and makes the search for these data<br />
accessible. Another, rather predictive, database <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in protein structures is<br />
the MSDmotif [GH08], which provides information about ligands, sequence and structure<br />
motifs, their relative position, and their neighbour environment. Another database <strong>of</strong> <strong>predicted</strong><br />
<strong>functional</strong> <strong>sites</strong> is MSDtemplate [Old02], which contains small fragments generated<br />
by data mining on a structurally unique protein set from PDB. Examples <strong>of</strong> biologically<br />
relevant fragments were identified in this data collection, such as the catalytic triad and<br />
27
Figure 2.1: Data banks in the protein universe. This figure shows my interpretation <strong>of</strong> how our knowledge<br />
about proteins can be categorised. A selection <strong>of</strong> the most relevant data resources and web services<br />
are reproduced in this figure. UniProtKB = Universal Protein Knowledge base [WAB + 06]; PIR = Protein<br />
Information Resource [BGH + 00]; PDB SELECT = representative list <strong>of</strong> PDB chain identifiers [HSSS92];<br />
PISCES = Protein Sequence Culling Server [WD03]; UniqueProt = web-service to create representative<br />
protein sequence sets [MR03]; MEROPS = the Peptidase Database [RMK + 07]; CAZy = Carbohydrate-<br />
Active enZYmes [CCR + 08]; TC-DB = Membrane Transport Protein Classification Database [STB06];<br />
PMD = Protein Mutant Database [KON99]; Phospho.ELM = a database <strong>of</strong> S/T/Y phosphorylation <strong>sites</strong><br />
[DCG + 04]; PROSITE = Database <strong>of</strong> protein domains, families and <strong>functional</strong> <strong>sites</strong> [HBB + 08]; PRINTS<br />
= Protein Motif Fingerprint Database [Att02]; BMC = Biomedical Center [BMC08]; PMC = PubMed<br />
Central [PMC08]; PDB = Protein Data Bank [BWF + 00]; SCOP = Structural Classification <strong>of</strong> Proteins<br />
[HMBC97]; CATH = Class, Architecture, Topology, Homologous superfamily - Protein structure classification<br />
[OMJ + 97]; Relibase = database <strong>of</strong> protein-ligand complexes [HBGK03]; CSA = Catalytic Site<br />
Atlas [PBT04]; MSDmotif = an integrated resource <strong>of</strong> protein structure motifs.<br />
28
Figure 2.2: Three hyperlinked protein data banks. Illustrated is the size <strong>of</strong> three databanks, PDB,<br />
UniProtKB, and MEDLINE, along with their cross-references. For example, the PDB contains in total<br />
42,943 PDB identifiers (version November 2008) with cross-references to 42,085 out <strong>of</strong> 333,445 Uniprot<br />
identifiers, which in return points to 10,466 biomedical journal articles (PMIDs). Notice that PDB<br />
also holds for each record a small number <strong>of</strong> primary citations, however, these are mainly pointers to<br />
crystallographic publications and provide little hints <strong>of</strong> biological function <strong>of</strong> the protein or <strong>annotation</strong><br />
<strong>of</strong> <strong>functional</strong> <strong>sites</strong>.<br />
29
various metal binding <strong>sites</strong>. The Catalytic Site Atlas (CSA) [PBT04] is another database<br />
documenting <strong>active</strong> <strong>sites</strong> in enzymes <strong>of</strong> 3D structures.<br />
The data are either manually<br />
curated or <strong>predicted</strong>, based on searches for homologous proteins.<br />
Another serious limitation <strong>of</strong> PDB is its use for statistical analysis <strong>of</strong> structure data.<br />
The PDB represents a redundant and biased snapshot <strong>of</strong> the protein universe. Redundancy<br />
is due to the fact that many highly similar structures or identical folds are deposited<br />
in the database leading to an over-representation <strong>of</strong> some proteins. In the past, structure<br />
determination has been guided by hypothesis-driven experiments, short-listed target<br />
proteins in the medical or commercial field, and by the methodologically tractable small<br />
proteins for crystallisation.<br />
Consequently, the fold-space has not been fully explored<br />
yet. Although techniques in protein crystallography are improving, there are still other<br />
underrepresented proteins, e.g. membrane proteins or large proteins, which define the<br />
boundaries <strong>of</strong> representativeness <strong>of</strong> the structure data.<br />
While there is little we can do about exploring the complete ensemble <strong>of</strong> folds from<br />
a bioinformatics point <strong>of</strong> view, the over-representation can be filtered.<br />
For example,<br />
protein sequence based clustering [AGM + 90] [AMS + 97] is the principle method to produce<br />
the following datasets: PDB SELECT [HSSS92], PISCES [WD03], UniqueProt [MR03].<br />
However, this approach is limited by the assertion <strong>of</strong> sequence-structure relation in the<br />
so called twilight zone, i.e. below 30 per cent sequence identity proteins may or may not<br />
have similar folds [Ros99]. Another critical issue with sequence based clustering is the<br />
comparison <strong>of</strong> protein chain sequences rather than the alignment <strong>of</strong> segments defined by<br />
protein domain boundaries.<br />
Structure based approaches cluster the data on the basis <strong>of</strong> domain structures. Several<br />
databases <strong>of</strong> domain based structure clustering were created with the most prominent<br />
ranging from entirely manual work (SCOP [HMBC97]), semi-automatic approach (CATH<br />
[OMJ + 97]), to entirely non-supervised methods (FSSP-Dali, [HS94]). Differences in these<br />
classification were studied by [HJ99] and [DBAD03].<br />
30
2.1.2 Universal Protein Knowledge base<br />
The major repository <strong>of</strong> protein sequence data is the Universal Protein Knowledge base<br />
(UniProtKB). Along with the collection <strong>of</strong> sequence data is the listing <strong>of</strong> protein names<br />
and synonyms, taxonomic data, citation references, and other manually curated information<br />
from literature survey.<br />
One important aspect <strong>of</strong> UniProtKB when evaluating<br />
structure-function relationships is the <strong>annotation</strong> <strong>of</strong> protein residues. In the feature table<br />
the biological function <strong>of</strong> a residue site is described along with several other key categories<br />
(cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific<br />
<strong>annotation</strong>s (version from January 2008).<br />
Despite the high quality data contained in UniProtKB, the process <strong>of</strong> extracting <strong>functional</strong><br />
<strong>annotation</strong>s from literature remains a laborious human expert curation work. The<br />
curator surveys the biomedical literature, represents the experimentally determined <strong>functional</strong><br />
information, and formulates the precise <strong>functional</strong> role by utilising standardised<br />
semantic resources (cf. section 2.1.3). Despite the highly reliable quality <strong>of</strong> manual curation,<br />
this approach is evidently inefficient considering the amount <strong>of</strong> full-text publications<br />
curators have to distil. According to Frishman, if we assume<br />
”[...] that one needs on average roughly 30 min to assess published fact<br />
and bioinformatics evidence for one protein, one thousand annotators would<br />
have to work 1 year long, 8 h a day, to annotate all 5 million sequences that<br />
are currently known. However, since the size <strong>of</strong> the protein database has been<br />
consistently doubling every 18 months, the moving target <strong>of</strong> annotating all<br />
proteins will never be achieved.” [Fri07]<br />
Considering that the estimated total number <strong>of</strong> proteins is in excess <strong>of</strong> 10 10 [CK06],<br />
an automatic or semi-automatic solution is needed to facilitate the laborious human expert<br />
work.<br />
Currently, methods for the automatic expansion <strong>of</strong> citation set [YLPV07]<br />
[HLC04] [LHC07] and the automatic <strong>annotation</strong> <strong>of</strong> protein function with GO terminologies<br />
[CSL + 06] [GJYLRS08] [RSKA + 07] are being developed in the field <strong>of</strong> text mining.<br />
31
Key<br />
INIT MET<br />
SIGNAL<br />
PROPEP<br />
TRANSIT<br />
CHAIN<br />
PEPTIDE<br />
TOPO DOM<br />
TRANSMEM<br />
DOMAIN<br />
REPEAT<br />
CA BIND<br />
ZN FING<br />
DNA BIND<br />
NP BIND<br />
REGION<br />
COILED<br />
MOTIF<br />
COMPBIAS<br />
ACT SITE<br />
METAL<br />
BINDING<br />
SITE<br />
NON STD<br />
MOD RES<br />
LIPID<br />
CARBOHYD<br />
DISULFID<br />
CROSSLNK<br />
VAR SEQ<br />
VARIANT<br />
MUTAGEN<br />
CONFLICT<br />
Description<br />
Initiator methionine.<br />
Extent <strong>of</strong> a signal sequence (prepeptide).<br />
Extent <strong>of</strong> a propeptide.<br />
Extent <strong>of</strong> a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome etc.).<br />
Extent <strong>of</strong> a polypeptide chain in the mature protein.<br />
Extent <strong>of</strong> a released <strong>active</strong> peptide.<br />
Topological domain.<br />
Extent <strong>of</strong> a transmembrane region.<br />
Extent <strong>of</strong> a domain, which is defined as a specific combination <strong>of</strong> secondary structures organised<br />
into a characteristic three-dimensional structure <strong>of</strong> fold.<br />
Extent <strong>of</strong> an internal sequence repetition.<br />
Extent <strong>of</strong> a calcium-binding region.<br />
Extent <strong>of</strong> a zinc finger region.<br />
Extent <strong>of</strong> a DNA-binding region.<br />
Extent <strong>of</strong> a nucleotide phosphate-binding region.<br />
Extent <strong>of</strong> a region <strong>of</strong> interest in the sequence.<br />
Extent <strong>of</strong> a coiled-coil region.<br />
Short (up to 20 amino acids) sequence motif <strong>of</strong> biological interest.<br />
Extent <strong>of</strong> a compositionally biased region.<br />
Amino acid(s) involved in the activity <strong>of</strong> an enzyme.<br />
Binding site for a metal ion.<br />
Binding site for any chemical group (co-enzyme, prosthetic group, etc.).<br />
Any interesting single amino-acid site on the sequence, that is not defined by another feature<br />
key. It can also apply to an amino acid bond which is represented by the positions <strong>of</strong> the<br />
two flanking amino acids.<br />
Non-standard amino acid.<br />
Posttranslational modification <strong>of</strong> a residue.<br />
Covalent binding <strong>of</strong> a lipid moiety.<br />
Glycosylation site.<br />
Disulfide bond.<br />
Posttranslationally formed amino acid bonds.<br />
Description <strong>of</strong> sequence variants produced by alternative splicing, alternative promoter usage,<br />
alternative initiation and ribosomal frameshifting.<br />
Authors report that sequence variants exist.<br />
Site which has been experimentally altered by mutagenesis.<br />
Different sources report differing sequences.<br />
Figure 2.3: Categories for protein sequence <strong>annotation</strong> in UniProtKB. Key categories used to describe<br />
regions or <strong>sites</strong> <strong>of</strong> interest in a protein sequence are listed. The key and the corresponding information<br />
(value) are stored in the feature table (FT line) in UniProtKB. Along with the listed categories are their<br />
definitions presented in this figure.<br />
32
Clearly, the <strong>annotation</strong> for a whole protein cannot be transferred to residue site <strong>annotation</strong>,<br />
because different groups <strong>of</strong> residues in the protein structure have different function.<br />
In this respect, the biological community is missing an information extraction system for<br />
the <strong>annotation</strong> <strong>of</strong> proteins at residue level.<br />
2.1.3 Gene Ontology<br />
The Gene Ontology (GO) [AL02] [GOC06] is one <strong>of</strong> the most widely used <strong>functional</strong><br />
classification scheme including all <strong>of</strong> the most important criteria for <strong>annotation</strong>s <strong>of</strong> biological<br />
data [PKS06]. Currently, the ontology lists a total <strong>of</strong> 26,302 terms with 15,643<br />
biological process terms, 2,233 cellular component terms, and 8,426 molecular function<br />
terms (version November 2008). The UniProtKB/InterPro group at the <strong>European</strong> Bioinformatics<br />
Institute (EBI) belongs to the Gene Ontology Consortium, and use its standard<br />
vocabulary to the <strong>annotation</strong> <strong>of</strong> protein function. The vocabulary is meant to describe<br />
biological phenomenology <strong>of</strong> genes and gene products (proteins). This is the reason why<br />
terminologies in GO are not suitable to describe the function and property <strong>of</strong> a protein<br />
residue. Figure 2.4 lists some examples where the identification <strong>of</strong> GO terms [GJYLRS08]<br />
did not find the more relevant keywords for the <strong>annotation</strong> <strong>of</strong> residues. At the moment,<br />
an ontology dedicated solely for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues has not been<br />
developed. However, terminologies can be in general collected from other considerable resources,<br />
such as the Open Biomedical Ontologies [SAR + 07] which contains, for example,<br />
REX (an ontology <strong>of</strong> physico-chemical processes), and PSI-MOD (an ontology describing<br />
protein chemical modifications).<br />
2.1.4 Biomedical literature<br />
Biomedical research tackles biological questions from a number <strong>of</strong> perspectives and the<br />
published experimental data are always heterogeneous. The sum <strong>of</strong> description <strong>of</strong> biological<br />
phenomenon enables scientists to understand mechanisms in biology within various<br />
33
Annotation<br />
Sentence Manual GO<br />
”The catalytic mechanism <strong>of</strong> the<br />
non-phosphorylating glyceraldehyde-<br />
3-phosphate dehydrogenase and the<br />
other aldehyde dehydrogenases resembles<br />
a thioester mechanism involving<br />
the universally conserved cysteine 298<br />
(pea GAPN).” (PMID:9461340)<br />
thioester mechanism, conserved<br />
cysteine<br />
glyceraldehyde-3-phosphate<br />
dehydrogenase (NADP+)<br />
(phosphorylating activity),<br />
glyceraldehyde-3-phosphate<br />
biosynthesis, glyceraldehyde-<br />
3-phosphate catabolism, phosphoglycerate<br />
dehydrogenase<br />
activity<br />
Annotation<br />
Sentence Manual GO<br />
”However, mutations <strong>of</strong> a key residue,<br />
His48, show significant deviation from<br />
the relationship, implying a role<br />
for the side chain in protection <strong>of</strong><br />
the complex from hydroxide attack.”<br />
(PMID:2690955)<br />
protection <strong>of</strong> the complex from<br />
hydroxide attack<br />
AT DNA binding, tRNA, tyrosine<br />
tRNA ligase activity<br />
Annotation<br />
Sentence Manual GO<br />
”Second, this re<strong>active</strong> cysteinyl<br />
residue, which is required for L-<br />
cysteine desulfurization activity, was<br />
identified as Cys325 by the specific<br />
alkylation <strong>of</strong> that residue and by sitedirected<br />
mutagenesis experiments.”<br />
(PMID:81615929)<br />
L-cysteine desulfurization activity<br />
pyridoxal biosynthesis, phosphate<br />
binding, mutagenesis,<br />
nitrogenase activity, L-alanine<br />
biosynthesis, pyridoxal phosphate<br />
binding<br />
Figure 2.4: GO terms are not suitable for protein residue <strong>annotation</strong>. The presented examples demonstrate<br />
that <strong>predicted</strong> GO terms are not always suitable for protein residue <strong>annotation</strong>. The prediction <strong>of</strong><br />
GO terms was done with an information theory based parser [GJYLRS08].<br />
34
contexts. This summary <strong>of</strong> text has also been compared with an ”unstructured knowledge<br />
database”, where information is present, but difficult to retrieve due to the complexity <strong>of</strong><br />
natural language. According to Sidhu,<br />
”[...] it is generally acknowledged that only 20 per cent <strong>of</strong> biological knowledge<br />
and data is available in a structured format or a database. The remaining<br />
80 per cent <strong>of</strong> biological information is hidden in the unstructured, free text<br />
<strong>of</strong> scientific publications.” [SDC06]<br />
In context <strong>of</strong> information extraction, the data to be extracted from an article are<br />
words (keywords) regarding biological concepts that could summarise the key message<br />
<strong>of</strong> the article.<br />
At first glance, abstract texts have a high density <strong>of</strong> keywords but a<br />
low coverage <strong>of</strong> information, while full-texts cover a larger but disperse quantity <strong>of</strong> data<br />
[FKY + 01] [YHF + 02] [SPIBA03] [SWS + 04] [NBD + 06].<br />
Another key distinction between abstract texts and full-texts is the availability <strong>of</strong><br />
data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE<br />
without restriction, while full-texts from various journals are only available for subscribed<br />
customers.<br />
Although some full-text articles are accessible through various initiatives<br />
[BMC08] [Plo08] [PMC08], the extraction <strong>of</strong> information from a whole document is expected<br />
to be much more complex than from an abstract text. For example, a biological<br />
feature <strong>of</strong> a residue may be expressed over several sentences, requiring a co-reference<br />
resolution <strong>of</strong> the residue and the feature.<br />
2.2 Protein structure data mining<br />
Data mining is an analytic method to identify valid, and novel patterns in data. A general<br />
data mining solution does not exist. Instead human data mining expertise and human<br />
domain expertise are required to solve each specific data mining problem. A data mining<br />
35
process consists <strong>of</strong> the following main processes: data selection, feature extraction, and<br />
correlation analysis.<br />
In respect <strong>of</strong> protein structure data mining, data selection means the identification<br />
<strong>of</strong> a non-redundant set <strong>of</strong> protein structures from PDB (cf. section 2.1.1). Although a<br />
protein structure contains only geometrical information, it is important to distinguish<br />
the types <strong>of</strong> structural features to be analysed. Following are the options <strong>of</strong> structural<br />
feature as target: the configuration <strong>of</strong> amino acids as Cα, the configuration <strong>of</strong> backbone<br />
atoms, the spatial arrangement <strong>of</strong> chemical groups [JIDG03] [YEC + 07] [Rus98] [SSR03]<br />
[Old02], and the physicochemical environments [OCR01] [YEC + 07]. In order to discover<br />
new information from the data, a developed data mining algorithm must not contain any<br />
biochemical knowledge. The target should be a mathematical model and not a biological<br />
template.<br />
2.2.1 Hypothesis-driven data analysis<br />
”Within the field <strong>of</strong> bioinformatics research, the term data mining is used very loosely to<br />
describe any type <strong>of</strong> data analysis. (T. Oldfield, pers. comm.).” Hypothesis-driven data<br />
analysis consists <strong>of</strong> defining a biological target (hypothesis), and searching for the target.<br />
Consequently, the result <strong>of</strong> a hypothesis-driven data analysis is not the discovery <strong>of</strong> new<br />
information.<br />
A number <strong>of</strong> methods were published that predicts a known protein function on the<br />
basis <strong>of</strong> protein structure information. Initially, the research work focused on global fold<br />
recognition [HS96] [WR97] [MB99] [KH04] [HPS + 03] [AZP + 05] to identify evolutionary<br />
distant, but structurally conserved homologues. Once a match is found <strong>functional</strong> <strong>annotation</strong>s<br />
are transferred from the target to the query. Another more specific approach<br />
focuses on the search for matching local substructures in the proteins. The rational is,<br />
that a biological function can be mapped to a particular residue configuration in the<br />
protein, which is independent in function from the global fold <strong>of</strong> the structure. One obvi-<br />
36
ous approach was to design structure templates, which contains all the essential residues<br />
for a biological function. Several specific types <strong>of</strong> <strong>sites</strong> or motifs have been studied in<br />
detail to capture metal binding <strong>sites</strong> [Glu91], the catalytic triad <strong>of</strong> the serine proteases<br />
[FWLN94] [WBT97], and binding <strong>sites</strong> for anions such as sulphate and phosphate [Cha93]<br />
[CB94]. Computer assisted methods were developed in the following to help experts to<br />
design templates by analysing motifs over large sets <strong>of</strong> proteins corresponding to <strong>active</strong><br />
<strong>sites</strong> [APG + 94] [Rus98] [SSR03] [Kle99] [FS98] [FGS98] [WBT97] [BT03] [PB06], surface<br />
patches or clefts [Las95] [KJ94] [LEW98] [SPNW04] [BFL04] or structural binding site<br />
locations [GPP + 03] [KN03].<br />
2.2.2 Discovery-driven data mining<br />
The key feature in a discovery-driven data mining is the search for common characteristics<br />
(pattern) in the data, without providing any domain knowledge. More specifically, the<br />
target is mathematically defined and the system aims to identify over-representations,<br />
data variations, or singularities in the dataset. Hence discovery-driven data mining can<br />
deliver novel information, while the biological significance <strong>of</strong> the result is not trivial.<br />
One important aspect in identifying residue interactions in protein structures is the<br />
consideration <strong>of</strong> contextual information, such as interaction distance, chemical environment,<br />
and evolutionary conservation, in the data mining algorithm. The systems called<br />
ET/MA [CFK + 05] and ConSurf uses evolutionary information in combination with structural<br />
and chemical data, in order to highlight region <strong>of</strong> local structures with <strong>functional</strong><br />
importance. In contrast, the systems PINTS [Rus98] [SSR03] and SIDEMINE [Old02] find<br />
patterns within the distribution <strong>of</strong> non-redundant structure set, by using solely mathematical<br />
model <strong>of</strong> interactions. One critical issue in the development <strong>of</strong> these data mining<br />
methods was the improvement <strong>of</strong> the signal/noise ratio. In order to boost the signal frequency,<br />
two structural features are merged if one is biologically equivalent to the other.<br />
While the analysis showed that the mined output contained biological valid data, the<br />
37
esult actually incurs some bias, because biological knowledge was introduced.<br />
2.3 Biomedical literature mining<br />
Biomedical text mining extracts information from text for the integration into biological<br />
databases. Due to the complexity <strong>of</strong> natural language, text processing involves structuring<br />
the text input by means <strong>of</strong> parsing and the <strong>annotation</strong> <strong>of</strong> some linguistic features,<br />
e.g. part-<strong>of</strong>-speech tags. The majority <strong>of</strong> biological text analysis is concerned about the<br />
extraction <strong>of</strong> explicitly stated facts from text; a task referred as biological information extraction<br />
[Hob02]. Biomedical text mining processes typically consist <strong>of</strong> two main analysis<br />
steps: biological entity recognition, and biological relation extraction.<br />
The vast amount <strong>of</strong> published biomedical articles contains phenomenological data on<br />
proteins, such as their molecular function. The information is encoded in unstructured<br />
text and requires different level <strong>of</strong> complexity to mine the data. There are several levels<br />
<strong>of</strong> text mining challenges to extract <strong>functional</strong> <strong>annotation</strong>: the identification <strong>of</strong> mutations<br />
[LHC07] [WK07] [BW05] [RSMA + 04] [HLC04] or genetic sequences [MG03], identification<br />
<strong>of</strong> gene or protein names [RSAG + 08] [PJYLRS08] [TMA08] [Fuk98] and chemical entities<br />
[CMR06], the extraction <strong>of</strong> <strong>annotation</strong> <strong>of</strong> molecular function [GJYLRS08] [RSKA + 07]<br />
[DS05] [KNT05] [GDAW03] [HNR + 05], and the identification <strong>of</strong> semantic relations between<br />
the biological entities [BLK + 08] [LCM03] [SB06].<br />
2.3.1 Biological entity recognition<br />
The process <strong>of</strong> entity recognition (ER) can be split into three parts: location <strong>of</strong> the mentioned<br />
entity in text, classification <strong>of</strong> the entity into a predefined category, and normalising<br />
the entity by referencing to an entry in a database.<br />
Biological entities are <strong>of</strong>ten ambiguous in terms <strong>of</strong> their boundaries and categories.<br />
Probably the most challenging task is the correct identification <strong>of</strong> protein or gene names.<br />
38
For example, ”hunchback” is a protein in Drosophila, while it is also a general English<br />
term. Furthermore, protein names consist mostly <strong>of</strong> multiple words, e.g. ”Rho-like protein”<br />
or ”HIV-1 envelope glycoprotein gp120”. An ER system needs to identify all the<br />
constituents <strong>of</strong> a protein name in order to relate the detected entity to its reference entry<br />
in a database. The BioCreAtIvE challenge addressed this problem with the 1B subtask;<br />
the target is the identification <strong>of</strong> protein/gene names in text, and the <strong>annotation</strong> <strong>of</strong> their<br />
correct gene identifier. Various solutions were published ranging from rule-based methods<br />
[HFM + 05] [TW02] [Fuk98] to machine learning approaches [CMP05]. The developed<br />
methods are, in general, reusable for any other biological entity recognition or terminology<br />
identification problem.<br />
Works have also been published that focused on the extraction <strong>of</strong> protein point mutations<br />
[RSMA + 04] [HLC04] [BW05] [LHC07] [YLPV07], which is one category <strong>of</strong> protein<br />
residue terminology. Other categories are residue sequence or residue interaction pair.<br />
The most widely adopted method to identify these terminologies is the design <strong>of</strong> regular<br />
expression patterns.<br />
2.3.2 Biological relation extraction<br />
Relation extraction (RD) aims to find associations between entities, or between an entity<br />
and a terminology within a text phrase. One objective in biomedical information<br />
extraction is the mining <strong>of</strong> biological facts from text. An example <strong>of</strong> biological fact is<br />
the semantic relation between two biological entities, such as protein-protein interaction<br />
[TOT04].<br />
Until now, three strategies have been investigated for biological relation extraction: the<br />
co-occurrence based analysis [LC05] [SB05], pattern-based approach [HZH + 04] [LCM03],<br />
and machine learning based methods [BM05] [BM06]. The common limitation <strong>of</strong> all <strong>of</strong><br />
these extraction systems is, that only the relation targets, e.g. proteins within a proteinprotein<br />
interaction, are extracted. By no means are contextual information considered in<br />
39
the extraction that would describe or explain the association <strong>of</strong> the entities. Within the<br />
information extraction community, a consensus has been reached, that deeper analysis <strong>of</strong><br />
sentence structures is required in order to adequately acquire biomolecular relations from<br />
text [WSC04].<br />
In respect <strong>of</strong> biological relation extraction, two classes <strong>of</strong> syntactical parsers were studied.<br />
The first is the shallow parsing technique, which aims in detecting main constituents<br />
<strong>of</strong> a sentence, without determining the complete syntactical structure. Results were published,<br />
where protein-protein interactions [KNT05] and general biological entity relations<br />
[LCM03] were extracted based on shallow parsing. The second class <strong>of</strong> syntactical parser<br />
is the full parser, which attempts a deep analysis <strong>of</strong> the syntactical structure <strong>of</strong> a sentence.<br />
Several systems have been reported [NED03] [FKY + 01] that utilises full parsing<br />
for relation extraction from biomedical literature. One interesting full parser is ENJU<br />
[YMTT05] [MT05], a so called head-driven phrase structure grammar (HPSG) parser,<br />
which identifies predicate-argument structure (PAS) from a text sentence.<br />
The use <strong>of</strong> PAS, as template for biomolecular relation extraction, was firstly reported<br />
in [TOT04] [YMTT05]. Recently, two proposition bank were reported, that are designed<br />
to capture relations in molecular biology: PASBio [WSC04] and BioProp [TCS + 07].<br />
Within this work, there are two types <strong>of</strong> semantic relations to be extracted.<br />
The<br />
first is the residue-protein association.<br />
The system called MEMA [RSMA + 04] uses a<br />
word distance metric to associate a list <strong>of</strong> residue-protein pairs with the smallest word<br />
distance.<br />
Another approach is to look up valid associations between a residue and a<br />
protein in context <strong>of</strong> a predetermined association <strong>of</strong> a protein and an organism. Three<br />
systems have been reported, that adopt this approach: MuteXt [HLC04], MutationMiner<br />
[BW05], and MutationGraB [LHC07].<br />
The other semantic relation to be extracted in this work is the association between<br />
a residue entity and its description <strong>of</strong> function. The systems MuteXt [HLC04], MEMA<br />
[RSMA + 04], MutationMiner [BW05], and MutationGraB [LHC07] are all dedicated to<br />
40
the extraction <strong>of</strong> point mutations, but provide no extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>. In<br />
a recent publication [WK07], an ontological model was proposed that should hold information<br />
extracted from MutationMiner as well as point mutation <strong>annotation</strong>s. However,<br />
the author did not provide any results <strong>of</strong> feature extraction nor was a strategy proposed.<br />
2.4 Conclusion<br />
In this chapter, I have reviewed some <strong>of</strong> the most relevant data resources and research<br />
works in the field <strong>of</strong> protein structure data mining and text mining. Some <strong>of</strong> the data<br />
resources are used in this thesis. In the following, I will present the extraction systems I<br />
have developed during my PhD.<br />
41
Chapter 3<br />
Mining residue interactions as triads<br />
from PDB<br />
In this chapter, I present a novel approach in mining 3D patterns from protein structures.<br />
More specifically, a pattern is defined as the irreducible interaction <strong>of</strong> a chemical and<br />
spatial configuration <strong>of</strong> residues. The goal is to identify new information from a nonredundant<br />
dataset on the basis <strong>of</strong> using solely mathematical targets.<br />
The mined 3D<br />
patterns represent prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in proteins.<br />
3.1 Algorithms<br />
The novelty <strong>of</strong> this presented 3D pattern mining approach is based on the classification <strong>of</strong><br />
residue triplets into one <strong>of</strong> four interaction classes. The idea <strong>of</strong> analysing side chain interactions<br />
within a residue triplet is based on the work <strong>of</strong> [Old02], while the classification <strong>of</strong><br />
residue interaction relies on the methodology developed by [JB04]. The developed data<br />
mining method consists <strong>of</strong> three processing steps: structural feature extraction, detection<br />
<strong>of</strong> significant configurations as interactions, and grouping and selection <strong>of</strong> frequent configurations.<br />
Figure 3.1 illustrates the procedures <strong>of</strong> the entire protein structure data mining<br />
system developed in this thesis.<br />
42
Figure 3.1: Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed 3D pattern identification<br />
system.<br />
43
3.1.1 Structural feature extraction<br />
Theory<br />
Residue triplet as spatial pattern unit.<br />
The presented protein structure data mining<br />
algorithm aims to identify significant interaction <strong>of</strong> residues within a triplet configuration.<br />
The rational <strong>of</strong> analysing residue triplets is described in the following. In order to<br />
form a <strong>functional</strong> site in a protein structure, residues need to be physically in closed contact.<br />
In other words there exists a mutual dependency or interaction among the residues.<br />
The interaction can be studied on a two-residue basis (doublet 3D pattern). However,<br />
regarding the size <strong>of</strong> structure data the probability <strong>of</strong> any two-residue configurations is<br />
too high to be detected as specific. Hence, the signal/noise ratio issue is the reason why<br />
a two-residue 3D pattern is not the target <strong>of</strong> protein structure data mining [Old02].<br />
A two residue contact is completely defined by a scalar property, while a three residue<br />
contact is defined by vectors. Consequently, a three residue constellation encodes much<br />
more information. This makes information theory based methods tractable to find conserved<br />
residue interactions as signals.<br />
In reality, <strong>functional</strong> <strong>sites</strong> can be composed <strong>of</strong> more than three residues, e.g. various<br />
metal binding <strong>sites</strong> used four coordinative cysteine residues. However, data sparseness<br />
and the mathematical complexity [CL64] [Sin04] in modelling four or larger residue interactions<br />
makes it infeasible. In principle, the more variables are introduced in modelling<br />
residue interactions, the more specific the data mining. It should be noted, that the identification<br />
<strong>of</strong> N-body interactions <strong>of</strong> residues can be solved from a combinatorial approach.<br />
Two triplets are combined, if there is equality in two out <strong>of</strong> three residues from each<br />
triplet [Old02]. This approach was adopted in this study to demonstrate that larger interaction<br />
configurations are extractable. However, this investigation concentrates mainly<br />
on the identification <strong>of</strong> three residue interactions. The assumption is, that if the output<br />
<strong>of</strong> a data mining provides valid result, the approach is justified and more complex residue<br />
configurations may inherit this property.<br />
44
Side chain interaction model.<br />
The determination <strong>of</strong> residue interactions requires a<br />
transformation <strong>of</strong> a full atom model into a simpler representation. This is because the<br />
mathematical model, that needs to describe all combinations <strong>of</strong> atom interactions <strong>of</strong> two<br />
residues, would be too complex. The solution is to replace the all-atom structure model<br />
with a coarse grained model, by reducing each residue to a single point. In principle,<br />
a residue point can be calculated either by the centre <strong>of</strong> mass, or the geometric centre<br />
(centroid). Each representation can be calculated from main chain atoms, main and side<br />
chain atoms, or side chain atoms only.<br />
The focus in this study is the side chain interactions within residue triplet configuration.<br />
For this reason, a protein structure is represented as a point spread <strong>of</strong> side chain<br />
centroids.<br />
Protein structure triangulation.<br />
The extraction <strong>of</strong> residue triplets from a protein is<br />
based on triangulation <strong>of</strong> structures. Here structures are triangulated on the basis <strong>of</strong> three<br />
criteria. The first is the compositional constraint. Each residue in a triplet must be an<br />
element <strong>of</strong> the 20 natural amino acids, while hetero atoms are excluded. One prominent<br />
reason is that there are not many examples <strong>of</strong> residue-hetero atom interactions in the<br />
dataset that would support a statistical analysis.<br />
The second condition <strong>of</strong> triplet extraction requires that none <strong>of</strong> the residues are direct<br />
neighbours in the protein sequence. The assumption made here is, that any covalently<br />
bonded residues have a higher likelihood than any other two residues being next to each<br />
other in space that are not bonded. Similarly, the probability <strong>of</strong> finding three residues in<br />
space that are connected, is higher than finding unconnected triplets <strong>of</strong> residues. Consequently,<br />
the distribution <strong>of</strong> interacting residues in space would be over-represented. The<br />
definition <strong>of</strong> residue neighbourhood affects the data mining result, e.g. by requiring a<br />
pair interaction in the triplet to have a distance <strong>of</strong> more than one residue, patches <strong>of</strong><br />
residues at one side <strong>of</strong> a beta-sheet may not be discovered. While tuning this parameter<br />
can modify the result <strong>of</strong> the data mining, the objective here is to discover new knowledge<br />
45
from the input data set by providing as little as possible <strong>of</strong> biological information.<br />
The last criterion in triplet extraction is concerned with the geometrical property <strong>of</strong> a<br />
triplet. The Euclidean distances between the residues must fulfil the triangular inequality,<br />
while only two interaction distances <strong>of</strong> less than 6Åwere allowed. Although the interaction<br />
distance threshold is based on an empirical study <strong>of</strong> a number <strong>of</strong> protein structures, this<br />
value may not be adequate, because it would prefer close contacts <strong>of</strong> large side chains<br />
<strong>of</strong> residue pairs. For example the pair interaction <strong>of</strong> two tryptophans may have a near<br />
maximal allowed interaction distance <strong>of</strong> the centroids, while the distance <strong>of</strong> the contacting<br />
atoms are actually very close. The alternative is to set up a threshold system for residue<br />
pairs or triplets, which depends on the types <strong>of</strong> residues. Although this approach was not<br />
studied in this thesis, future work could improve the developed algorithm. Yet another<br />
approach in selecting residue interactions from a protein structure is based on the analysis<br />
<strong>of</strong> surface contacts <strong>of</strong> the side chain groups. While not all <strong>functional</strong> <strong>sites</strong> require their<br />
constituents to be in physicochemical contact (e.g. a metal binding site consists <strong>of</strong> metal<br />
ion coordinating residues without physical contacts), a protein binding site is an example<br />
where residues <strong>of</strong> two different proteins are in non-covalent interaction.<br />
However, the<br />
presented data mining approach aims in the unbiased search for residue interactions from<br />
a dataset <strong>of</strong> monomeric protein structure domains, and therefore a surface-based selection<br />
criterion will biased the analysis.<br />
Implementation<br />
A coarse grained representation is used in this protein structure analysis. From a full atom<br />
model <strong>of</strong> a protein structure, centroid positions <strong>of</strong> each protein residue were calculated<br />
on the basis <strong>of</strong> their side chain atoms. The resulting simplified structure model is then<br />
triangulated based on three criteria: (1) each residue in a triplet must be an element <strong>of</strong><br />
the 20 natural amino acids; (2) pairs <strong>of</strong> residues in the triplet must not have a sequential<br />
relation in respect <strong>of</strong> their protein sequence position; and (3) only two pairs <strong>of</strong> residues<br />
46
can have a maximal interaction distance <strong>of</strong> 6Å, and only one pair with an interaction<br />
distance <strong>of</strong> less than 12Å.<br />
For the interaction analysis it is necessary to define a hash table, based on integer<br />
values <strong>of</strong> centroid distances, and the name <strong>of</strong> residue. The integer value <strong>of</strong> a distance is<br />
calculated by dividing the measured distance by a precision value (hash precision), which<br />
was set at +/- 0.5Å. Given a 3-body with<br />
trip = (A, B, C), (3.1)<br />
a three-dimensional hash table is defined as<br />
HT (A, B, C) = 3D hash bin[i][j][k], (3.2)<br />
where i, j, and k are the integer values <strong>of</strong> measured distances between two spatial coordinate<br />
<strong>of</strong> residues. The integer values are given by the equation<br />
i = INT (dist(A, B)/hash precision)<br />
j = INT (dist(B, C)/hash precision)<br />
(3.3)<br />
k = INT (dist(A, C)/hash precision).<br />
For a detailed definition <strong>of</strong> the implemented hashtable cf. [Old01].<br />
3.1.2 Detection <strong>of</strong> significant configurations as interactions<br />
Theory<br />
The method for residue interaction detection relies on the comparison <strong>of</strong> two probabilistic<br />
models: the reductionistic part-to-whole approximation model, and the holistic reference<br />
model. Part-to-whole approximation is modelled with a collection <strong>of</strong> marginal distributions<br />
defined by subsets <strong>of</strong> the variables. Formally, a 3-body consists <strong>of</strong> three variables<br />
(cf. equation 3.1). To verify whether the probability <strong>of</strong> a triplet, P (A, B, C), can be<br />
47
factorised, we attempt to approximate it by using all attainable marginals<br />
M = {P (A, B), P (A, C), P (B, C), P (A), P (B), P (C)}. (3.4)<br />
If the approximation fits the data, i.e.<br />
the probability <strong>of</strong> finding a particular triplet<br />
is explained by the approximation model, then there is no evidence for an interaction.<br />
In other words, a significant interaction is given when the two models are significantly<br />
different.<br />
The difference between two joint probability density functions O and M is<br />
measured by the Kullback-Leibler divergence<br />
D(O||M)<br />
= ∑ O(i)<br />
i<br />
O(i)log( ). (3.5)<br />
M(i)<br />
In this context O usually refers to the observed probability or the reference model,<br />
while M is the approximation model. The null hypothesis in testing the interaction model<br />
is that the part-to-whole approximation matches the observed data. The alternative one<br />
is that the approximation does not fit and that there is an interaction. Three cases can<br />
be listed:<br />
D(O||M) > 0 : there is a pattern among k attributes<br />
D(O||M) = 0 : there is no pattern <strong>of</strong> order k<br />
(3.6)<br />
D(O||M) < 0 : there is redundancy among the parts.<br />
Within a 3-body system, four different configurations <strong>of</strong> interactions can be defined<br />
(cf. figure 3.2): no-interaction, one-pair interaction, two-pair interactions, and three-pair<br />
interactions. For each <strong>of</strong> these configurations it is possible to formulate a part-to-whole<br />
approximation model, i.e. the interaction can be factorised. In the case <strong>of</strong> no-interaction,<br />
the probability <strong>of</strong> the observable is expected to be estimated by its singlet probabilities<br />
{<br />
k = 0 :<br />
ˆP 0 (A, B, C) = P (A)P (B)P (C) , (3.7)<br />
48
Figure 3.2: Four classes <strong>of</strong> interactions within a 3-body. A circle represents a protein residue, and an<br />
intersection resembles an interaction between two residues. k=0: no-interaction; k=1: one-way or one<br />
pair interaction; k=2: two-way or two-pair interactions; k=3: three-way or three-pair interactions.<br />
49
whereas in a system with one-pair interaction, two variables are dependent on each other.<br />
Consequently, within a 3-body state there are three is<strong>of</strong>orms <strong>of</strong> one-pair interactions:<br />
⎧<br />
⎪⎨<br />
k = 1 :<br />
⎪⎩<br />
ˆP 1,1 (A, B, C)<br />
ˆP 1,2 (A, B, C)<br />
ˆP 1,3 (A, B, C)<br />
=<br />
=<br />
=<br />
P (A,B)P (C)<br />
P (A)P (B)<br />
P (A,C)P (B)<br />
P (A)P (C)<br />
P (B,C)P (A)<br />
P (B)P (C)<br />
. (3.8)<br />
There are two forms <strong>of</strong> three variable interactions, but with different dependencies:<br />
two-pair interaction (k=2) and three-pair interaction (k=3). These interactions represent<br />
the target <strong>of</strong> 3D pattern mining. In a two-pair interaction, two pairs <strong>of</strong> variables are dependent<br />
on each other, while sharing a common attribute. For example, given A interacts<br />
with B, and B interacts with C, there is no clear observation that A also interacts with<br />
C. Three is<strong>of</strong>orms are formulated for this interaction:<br />
⎧<br />
⎪⎨<br />
k = 2 :<br />
⎪⎩<br />
ˆP 2,1 (A, B, C) =<br />
ˆP 2,2 (A, B, C) =<br />
ˆP 2,3 (A, B, C) =<br />
P (A,B)P (B,C)<br />
P (B)<br />
P (A,C)P (A,B)<br />
P (A)<br />
P (B,C)P (A,C)<br />
P (C)<br />
. (3.9)<br />
In case <strong>of</strong> a three-pair interaction, all three variables are dependent on each other, and<br />
the approximation model is defined as<br />
{<br />
k = 3 :<br />
ˆP 3,1 (A, B, C) =<br />
P (A,B)P (B,C)P (A,C)<br />
P (A)P (B)P (C)<br />
. (3.10)<br />
If the state is disturbed, e.g. by exchanging one variable, a partial interaction will<br />
not be observed. In respect <strong>of</strong> protein biology, this could mean that a residue mutation<br />
abolishes an intramolecular stabilising network. However, as this does not provide an<br />
evolutionary advantage the conservation <strong>of</strong> this residue is likely to be promoted and can<br />
be detected as a recurrent structural feature.<br />
The determined sets <strong>of</strong> two-way (k=2) and three-way (k=3) interactions are the targets<br />
in this data mining.<br />
50
Implementation<br />
Triplets <strong>of</strong> residues are classified into one <strong>of</strong> the four defined interaction configurations.<br />
The classification is based on a non-parametric cross-validation sampling method described<br />
by [JB04]. A significant interaction is given when the two models O and M are<br />
significantly different. Because the data can be regarded as a sample <strong>of</strong> a multinomial distribution,<br />
the representativeness <strong>of</strong> the approximation model can be tested by the self-loss<br />
function D(P ′ ||P ). Here, P ′ and P are the probability distributions from two equal sample<br />
sizes. The weight <strong>of</strong> evidence <strong>of</strong> accepting the null hypothesis, i.e. the approximation<br />
model, can be estimated by p cv -values from a 2-fold cross-validation. For each random<br />
sampling the dataset is partitioned into two equally sized subsets: one training set and<br />
one test set. From these subsets two joint probability distribution functions, P ′ and P<br />
are determined from the training and test set, respectively. The marginal distributions,<br />
singlets and doublets, are determined from P ′ to construct the part-to-whole approximation<br />
ˆP ′ . The p cv -value is defined as the probability where the self-loss is greater or equal<br />
to the approximation loss<br />
p cv {D(P ||P ′ ) ≥ D(P || ˆP ′ )}. (3.11)<br />
On the basis <strong>of</strong> p cv -values, an interaction is discovered if p cv ≤ α, and an interaction<br />
is rejected when p cv > α. High threshold values <strong>of</strong> α, e.g. 0.95, will bias towards an<br />
interaction and risk overfitting, while lower values, e.g. 0.05, moves the bias towards nointeraction<br />
model and risk underfitting. In this study, a reductionistic bias approach was<br />
chosen, to prefer a simpler no-interaction model, by selecting α = 0.05. The used value<br />
<strong>of</strong> α is based on the research work <strong>of</strong> [JB04].<br />
51
3.1.3 Grouping and selecting frequent configurations<br />
Theory<br />
The result <strong>of</strong> data mining protein structures can be a large set <strong>of</strong> 3D pattern.<br />
The<br />
data needs to be clustered in order to select the most frequent pattern. The assumption<br />
behind data clustering is, that residue configurations in protein structures are unlikely<br />
to be absolute and static. By grouping spatially similar configurations, the geometrical<br />
variation <strong>of</strong> patterns can be compensated and their frequencies improved.<br />
Implementation<br />
The objective in this section is to identify frequent groups <strong>of</strong> geometrically similar triplets<br />
with identical chemical configurations. Data clustering was done in two steps. For each<br />
residue triplet combinations, the initial step is to group geometrically similar patterns,<br />
and then count the combined frequencies<br />
i+1 j+1<br />
∑ ∑ ∑k+1<br />
G(HT (i, j, k)) =<br />
HT (i, j, k), (3.12)<br />
i−1 j−1 k−1<br />
where HT is a hash table <strong>of</strong> the residue triplets (cf. equation 3.2). Then local geometrical<br />
peaks were searched by comparing the frequencies <strong>of</strong> the grouped triplets<br />
arg max G(HT (a, b, c)) < G(HT (i, j, k)), (3.13)<br />
where HT (a, b, c) ≠ HT (i, j, k) with a = {i − 1, i, i + 1}, b = {j − 1, j, j + 1} and<br />
c = {k − 1, k, k + 1}.<br />
The second step in data clustering finds subgroups <strong>of</strong> triplets from a local peak, based<br />
on an all atom structure alignment. The determined clusters are ranked by their proba-<br />
52
Dataset PDBIDs Domains Domain definition Data selection Properties<br />
OLDFIELD 1,442 2,320 mathematical Sequence alignment<br />
SCOP40 3,449 4,734 human expert Sructure comparison<br />
Homologous structural<br />
features <strong>of</strong> divergent<br />
proteins.<br />
Convergent structural<br />
features <strong>of</strong> divergent<br />
proteins.<br />
Figure 3.3: Non-redundant structure set for 3D pattern mining. The dataset OLDFIELD is based on<br />
the publication <strong>of</strong> [Old02], and SCOP40 was obtained from ASTRAL Compendium [BKL00]. The size<br />
<strong>of</strong> the datasets, the method for data selection, and key properties are summarised.<br />
bility scores, which is defined as:<br />
P (cluster) =<br />
#cluster member<br />
#peak member . (3.14)<br />
On the basis <strong>of</strong> P (cluster) a cluster <strong>of</strong> residue interaction is selected if P (cluster) ≥<br />
τ. In this study, the threshold tau for selecting a cluster was set to 0.66.<br />
3.2 Analysing available non-redundant protein structure<br />
sets<br />
The significance <strong>of</strong> this data mining result is greatly dependent on the representativeness<br />
<strong>of</strong> the data. For the frequencies <strong>of</strong> structural features to be true, they would have to be<br />
taken from protein structures <strong>of</strong> all <strong>of</strong> the naturally occurring protein folds. However,<br />
such a data resource is not available at present (cf. section 2.1.1). This effectively means<br />
that protein structure data mining is bound by the availability <strong>of</strong> fold examples. While<br />
from a bioinformatical point <strong>of</strong> view, little can be done to improve the coverage <strong>of</strong> the<br />
fold space, a number <strong>of</strong> efforts have been dedicated to the compilation <strong>of</strong> non-redundant<br />
datasets from PDB.<br />
The results in this thesis are based on the study <strong>of</strong> two non-redundant protein structure<br />
sets: OLDFIELD [Old02] and SCOP40 [HMBC97] [BKL00]. Table 3.3 summarises<br />
53
key features <strong>of</strong> each dataset. The major distinction between both datasets lies in the<br />
definition <strong>of</strong> a non-redundant dataset. The purpose in compiling OLDFIELD is to create<br />
a dataset that allows the detection <strong>of</strong> interesting structural equivalence from the<br />
non-specific structural features. The primary data selection is in sequence space. The<br />
resulting dataset contains only sequentially dissimilar protein fragments, while common<br />
fold motifs are preserved. This allows the detection <strong>of</strong> homologous structural components<br />
<strong>of</strong> divergent proteins. In contrast, SCOP represents a biased view <strong>of</strong> protein data by defining<br />
classes in structure space. The assignment to a class, <strong>of</strong> a novel protein, is based on<br />
structure and sequence comparisons. SCOP40 is the data subset <strong>of</strong> SCOP, where sequentially<br />
divergent proteins with convergent structural features are retained. Because the<br />
classification contains structurally divergent proteins, any identified recurrent structural<br />
feature in SCOP40 is an indication <strong>of</strong> convergent evolution.<br />
Another distinction between OLDFIELD and SCOP40 is the method <strong>of</strong> identifying<br />
domain structures. In OLDFIELD, protein fragmentation was done mathematically by<br />
analysis Cα distances [Old01], while in SCOP40 human experts were recruited to process<br />
a batch <strong>of</strong> protein structures. Both approaches have their advantages and caveats. On one<br />
hand, an automatic structure domain identification system can deliver reproducible data,<br />
while the results may not be justified in some cases. On the other hand, expert curated<br />
data represent a single precision view, but the information is difficult to be reproduced<br />
as new data become available.<br />
The difference in automatic and manual data selection is also reflected in the size <strong>of</strong><br />
the datasets. In 2002, the compiled non-degenerated domain structure set from OLD-<br />
FIELD listed 2,320 domain structures, corresponding to 1,442 PDB identifiers. In contrast,<br />
SCOP40 contained 4,734 domain structures determined from 3,449 PDB identifiers<br />
in the same year.<br />
54
3.3 Evaluation methods<br />
The presented 3D pattern identification system is a discovery-driven data mining solution.<br />
The assessment <strong>of</strong> performance is done on two levels: the study <strong>of</strong> parameter dependency<br />
(presented in this chapter), and the validation <strong>of</strong> biological significance <strong>of</strong> the data (cf.<br />
chapter 4).<br />
The effect <strong>of</strong> data-related parameters was studied by comparing the mined results from<br />
OLDFIELD and SCOP. In the first part <strong>of</strong> the analysis, the distributions <strong>of</strong> extracted<br />
residue triplets were compared. Then the determined sets <strong>of</strong> k=2 and k=3 interactions<br />
were studied.<br />
The developed data mining method is a three step process, and the study <strong>of</strong> algorithmrelated<br />
parameter effects was studied on two levels. Although, the developed data mining<br />
method is controlled by many different parameters, the following key parameters were<br />
studied: residue interaction distance, and size <strong>of</strong> cross-validation to compute p-values.<br />
The effect <strong>of</strong> the interaction distance parameter was studied by varying the maximal<br />
distance between the centroids <strong>of</strong> residues. Three different distance settings were tested:<br />
4Å, 6Å, and 8Å.<br />
Repeated cross-validation sampling was used to determine confidence values for residue<br />
triplet classification. Various iterations were tested (from 100 to 1,500 in steps <strong>of</strong> 100) to<br />
study the effect on the size <strong>of</strong> interaction datasets.<br />
3.4 Results<br />
3.4.1 Identification <strong>of</strong> residue interactions is dependent on data<br />
selection<br />
The result <strong>of</strong> a data mining analysis is greatly dependent on the input dataset.<br />
The<br />
objective in this section is to study the effect <strong>of</strong> data-related parameters by comparing<br />
55
esults from data mining on OLDFIELD and SCOP40.<br />
With 590,255 unique triplet configurations in SCOP40 and 429,471 in OLDFIELD,<br />
the common set <strong>of</strong> triangulated triplets is 381,578 (cf. figure 3.4). Due to the difference<br />
in the probability distributions <strong>of</strong> both datasets, the classification <strong>of</strong> residue interactions<br />
resulted in different sizes <strong>of</strong> interaction classes. A set analysis on the classification data<br />
shows, that the classes have different sizes <strong>of</strong> overlaps (cf. figure 3.5). For example,<br />
OLDFIELD/k=3 and SCOP40/k=3 have a large common set <strong>of</strong> residue configurations <strong>of</strong><br />
around 89 per cent for OLDFIELD and 44 per cent for SCOP. In contrast, the common<br />
set <strong>of</strong> k=2 interaction is much lower, i.e. 21 per cent for OLDFIELD and 13 per cent for<br />
SCOP40. The analysis also found two proportions <strong>of</strong> non-agreed classifications (k2/k3<br />
between OLDFIELD/SCOP40).<br />
These results highlight the effect <strong>of</strong> data selection on the data mining result. A different<br />
probability distribution <strong>of</strong> residue triplets, singlets and doublets is the reason, why certain<br />
residue configurations were classified as k=2 in one dataset, and k=3 in another dataset.<br />
3.4.2 The interaction distance correlates with the distribution<br />
<strong>of</strong> residue triads<br />
The extraction <strong>of</strong> residue configurations is controlled by the data representation, feature<br />
extraction, and by the feature selection method. Structural features were extracted by<br />
triangulation <strong>of</strong> a protein structure, which was modelled by a point spread <strong>of</strong> side chain<br />
centroids. The goal in this section is to study the effect <strong>of</strong> varying the interaction distance<br />
parameter. For this analysis the dataset OLDFIELD was used.<br />
Table 3.1 summarises the determined set <strong>of</strong> residue triplets by using three different<br />
maximal interaction distances. With the change <strong>of</strong> the distance threshold, the amount<br />
<strong>of</strong> extracted triplets, and the probability distributions <strong>of</strong> the singlets and doublets are<br />
changed (data not shown). Consequently, the testing <strong>of</strong> significance <strong>of</strong> residue interactions<br />
returns different results. It must be noted, that a complete analysis with 8 Åinteraction<br />
56
Figure 3.4: Distribution analysis <strong>of</strong> extracted residue triplets. The determined residue triplet distribution<br />
from OLDFIELD is compared with SCOP40. The upper panel shows a set analysis <strong>of</strong> the extracted<br />
residue triplets (numbers are the unique counts <strong>of</strong> the residue configuration). The middle panel illustrates<br />
the frequency <strong>of</strong> each triplet (t) (represented as information, I(t)) from the set <strong>of</strong> triplets (T). For<br />
a better visualisation the difference <strong>of</strong> the distributions is measured by the Kullback-Leibler divergence<br />
(lower panel).<br />
57
Figure 3.5: Comparison <strong>of</strong> extracted residue triplets based on their interaction type. The determined<br />
k=2 and k=3 classification sets from OLDFIELD and SCOP40 are compared by a set analysis. Due to<br />
the interaction classification (k=2, and k=3) there is no intersection <strong>of</strong> all four datasets.<br />
Triplets<br />
Distance Total Unique k=2 k=3<br />
4 2,938 1,799 16 165<br />
6 1,379,545 429,471 9,681 134,465<br />
8 7,128,886 2,016,306 N/A N/A<br />
Table 3.1: Study on the effect <strong>of</strong> varying the interaction distance threshold in structure triangulation.<br />
The different determined sets <strong>of</strong> residue triplet configurations in OLDFIELD were achieved by using the<br />
interaction distance thresholds: 4Å, 6Å, and 8Å.<br />
58
distance was not done in this study.<br />
In conclusion, the effect <strong>of</strong> varying the interaction distance on the triangulation output<br />
is in agreement with the expected result. While the frequencies <strong>of</strong> ”small” triplet<br />
configurations are the same for incrementing interaction distance threshold, the calculated<br />
probabilities are different, because <strong>of</strong> the different distributions. This also affects<br />
the result <strong>of</strong> interaction classification.<br />
3.4.3 Interaction classification is sensitive to the size <strong>of</strong> crossvalidation<br />
Significance testing <strong>of</strong> residue interactions is a method for assigning confidence values to<br />
the classification <strong>of</strong> residue triplets. The p-values were calculated from a two-fold crossvalidation<br />
with n-iterations <strong>of</strong> random data sampling. Here, the effect <strong>of</strong> varying the size<br />
<strong>of</strong> iterations is studied. OLDFIELD is used as dataset for this analysis.<br />
Figure 3.6 shows the logarithmic dependency between iteration size and determined<br />
classification sets.<br />
Regression analysis indicates, that the finite classification set was<br />
not found after 1,500-iterations. The study <strong>of</strong> classified residue interactions from each<br />
iteration revealed, that the set from iteration i is always a subset from the iteration j<br />
with i < j.<br />
In conclusion, the result <strong>of</strong> varying the iteration sizes indicates, that the classification<br />
sets are stable and reproducible. With the increase <strong>of</strong> iteration size, the determined sets do<br />
not altered, meaning classification result is reliable but additional elements are identified.<br />
3.5 Discussion<br />
3D pattern identification is the result <strong>of</strong> a data mining method that finds recurrent structural<br />
features within a protein dataset. The developed analysis method consists <strong>of</strong> three<br />
major modules: triangulation <strong>of</strong> a protein structure, significance testing <strong>of</strong> residue inter-<br />
59
Figure 3.6: The effect <strong>of</strong> varying the cross-validation sample size on significance testing <strong>of</strong> residue<br />
interaction. The diagram shows the increasing but converging number <strong>of</strong> determined residue triplet<br />
configurations with one-way, two-way, and three-way interactions at various iteration steps (from 100 to<br />
1,500 in steps <strong>of</strong> 100) <strong>of</strong> a non-parametric cross-validation sampling.<br />
60
action, and data clustering <strong>of</strong> the determined residue interactions.<br />
Protein structure triangulation is the basis <strong>of</strong> collecting spatial configurations <strong>of</strong> residues.<br />
The definition <strong>of</strong> residue interaction is a complex task, because an amino acid consists<br />
<strong>of</strong> many atoms. Many <strong>of</strong> them are candidates <strong>of</strong> interaction partners. A coarse grained<br />
model was used to overcome this problem, however, with the cost <strong>of</strong> redefining the interaction<br />
distance. Instead <strong>of</strong> measuring interaction distances between atoms <strong>of</strong> two different<br />
amino acids, the distance between the side chain centroids is used. The theoretical physicochemical<br />
interaction distance between two atoms cannot be transferred to measure the<br />
centroid based side chain interactions. The upper bound <strong>of</strong> interaction distance <strong>of</strong> 6Åwas<br />
determined from several visual inspections and measurements <strong>of</strong> residue configurations.<br />
The analysis shows that with d = 6Å, various side chain rotamer configurations are captured,<br />
which may represent a physicochemical interaction. By reducing the interaction<br />
distance threshold, a bias towards tightly inert residue configurations is observed. Conversely,<br />
the increase in d results in a huge set <strong>of</strong> triplet combinations. Some <strong>of</strong> the larger<br />
triplets do not capture a 3-body interaction, but may be part <strong>of</strong> a four-body interaction,<br />
where the fourth residue is situated between all three residues. Although larger interaction<br />
states may reflect a complete picture <strong>of</strong> a structural unit, the primary aim here is to<br />
find local and adjacent interactions <strong>of</strong> residues.<br />
The performance <strong>of</strong> correlation analysis based on hash tables is sensitive to positional<br />
errors, which is typically translated into the computation <strong>of</strong> ”wrong” hash bin indices.<br />
Consider the sample values a = 3.99, b = 4.01, and c = 4.99, where a is assigned to hash<br />
bin index i(a) = 1, while b and c are assigned to i(b) = i(c) = 2. The difference between a<br />
and b is actually less than b and c. The correlation analysis with these hashed data seems<br />
to be inadequate, although the ”correct” hash bin is in the neighbourhood. A solution<br />
to this problem is to consider adjacent hash bins, i.e. rectangular region, <strong>of</strong> the table<br />
[LW91].<br />
The identification <strong>of</strong> an interaction class, e.g. a two-way interaction, is based on a<br />
61
probabilistic classification approach. Confidence values were assigned to the classification<br />
result, by calculating p-values from non-parametric cross-validation sampling. Theoretically,<br />
the more sampling iterations are used the more stable become the calculated p-<br />
values. At a certain point, the size <strong>of</strong> the determined interacting residues should converge<br />
to some value. The implication <strong>of</strong> determining a stable p-value is the identification <strong>of</strong> a<br />
finite set <strong>of</strong> residue interactions. Within this study, the final set was not determined and<br />
for practical reasons, a set after 100 iterations was used.<br />
The output <strong>of</strong> extracted patterns depends on the distribution <strong>of</strong> structural features<br />
in the input dataset. The introduced algorithm is based on the assumption that there<br />
are significant trends <strong>of</strong> residue configurations in proteins, if these interactions provide<br />
a significant <strong>functional</strong> or structural advantage. Obviously, we cannot expect that data<br />
mining on two differently defined data selection would deliver the same mining output.<br />
From a mathematical point <strong>of</strong> view, the results are still correct, because the algorithm is<br />
detecting recurrent residue configurations in the data.<br />
3.6 Conclusion<br />
In this chapter, I have presented a novel data mining approach for the discovery <strong>of</strong> 3D<br />
patterns in protein structures.<br />
A pattern is a residue triplet with two- or three-way<br />
interaction <strong>of</strong> residues. The extraction <strong>of</strong> 3D patterns is not only dependent on algorithmrelated<br />
parameters, but also on the data selection.<br />
The validity <strong>of</strong> the data mining<br />
approach is justified on the basis <strong>of</strong> knowing the limits and effects <strong>of</strong> data and parameters.<br />
In the following chapter, I will present the biological significance <strong>of</strong> the mined result.<br />
62
Chapter 4<br />
Prediction <strong>of</strong> functions for mined<br />
residue triads<br />
In the previous chapter, a data mining approach was introduced, that identifies recurrent<br />
interacting residues as triplets in protein structures. Assuming, that a certain residue<br />
configuration is conserved in evolution, if it provides a structural or <strong>functional</strong> advantage,<br />
then the mined 3D pattern may represent a <strong>functional</strong> site in the protein. The objective in<br />
this chapter, is to demonstrate the biological validity <strong>of</strong> the data mined results, by crossvalidation<br />
with a reference database. I present two example cases <strong>of</strong> validated residue<br />
interactions. The first example represents the validation <strong>of</strong> a metal binding site, where<br />
the mined patterns represent either a homologous or a convergent structural feature.<br />
The second validation identifies the catalytic triad from the mined data. The analysis<br />
includes the search for a 4-body configuration <strong>of</strong> the catalytic triad (quartet), in order to<br />
find a previously reported conserved serine residue. The result presented in this chapter<br />
demonstrates the biological significance <strong>of</strong> the mined data, and justify the data mining<br />
approach.<br />
63
4.1 Evaluation methods<br />
The biological significance <strong>of</strong> the mined 3D patterns is demonstrated by the rediscovery <strong>of</strong><br />
known residue interactions. A systematic performance analysis, in terms <strong>of</strong> coverage and<br />
accuracy is not possible, because a test set with complete <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> local<br />
residue interactions with biological function is not available. Therefore, various protein<br />
databases were used as references for cross-validations.<br />
The automatic cross-validation <strong>of</strong> metal binding <strong>sites</strong> is based on the comparison <strong>of</strong><br />
the mined 3D patterns with a metal binding site database. Two reference databases were<br />
used and the results compared with each other: MSDsite [GDO + 05] and MDB [CHR + 02].<br />
The identification <strong>of</strong> available metal binding <strong>sites</strong> in the input dataset considered only<br />
configurations with more than 2 residues. A hit was found, if all residues <strong>of</strong> a metal binding<br />
site were present in a protein structure. Likewise, a mined 3D pattern was identified as<br />
a metal binding site, if all residues <strong>of</strong> the pattern resemble a subset <strong>of</strong> a metal binding<br />
site. However, because a metal binding site can contain more than three residues, and<br />
the mined patterns can have two overlapping triplets, only identified metal binding <strong>sites</strong><br />
were counted and not every matched pattern. The coverage is computed as:<br />
ccoverage =<br />
#unique <strong>sites</strong> matched by all residues in a 3D pattern<br />
. (4.1)<br />
#available <strong>sites</strong> in protein structure set<br />
The result <strong>of</strong> metal binding site cross-validation is compared with the performance <strong>of</strong><br />
SIDEMINE [Old02] extraction. Because a similar experiment was not performed before,<br />
it was repeated here. The cross-validation <strong>of</strong> a metal binding site is analogous to the<br />
identification <strong>of</strong> <strong>active</strong> <strong>sites</strong> in the dataset (cf. above).<br />
The identification <strong>of</strong> a convergent metal binding site was done by a manual search in<br />
the mined output from SCOP40. The protein structures <strong>of</strong> a found metal binding site<br />
pattern were analysed in respect <strong>of</strong> their SCOP classification identifiers.<br />
64
OLDFIELD<br />
#Triangulated Interaction #Classified #Clustered #Pattern<br />
triplet type interactions patterns frequencies<br />
429,471<br />
k=2 9,681 925 5,697<br />
k=3 134,465 1,007 11,957<br />
SCOP40<br />
#Triangulated Interaction #Classified #Clustered #Pattern<br />
triplet type interactions patterns frequencies<br />
590,255<br />
k=2 15,455 765 927<br />
k=3 269,683 2,019 2,361<br />
Table 4.1: Summary <strong>of</strong> extracted data at each protein structure data mining step. The data mining<br />
was performed on OLDFIELD and SCOP40. The number <strong>of</strong> identfied residue triplet interactions is<br />
given in ”#Classified interactions”, while the column ”#Clustered patterns” indicates the size <strong>of</strong> unique<br />
residue interaction configurations after data clustering, and ”#Pattern frequencies” is the total amount<br />
<strong>of</strong> examples <strong>of</strong> the found residue interactions in the dataset.<br />
The automatic cross-validation <strong>of</strong> catalytic residues was done by comparing residues<br />
from <strong>active</strong> site templates in CSA [PBT04]. The validation <strong>of</strong> a catalytic <strong>active</strong> site for<br />
all example protein structures was based on manual analysis.<br />
To test whether the mined result contains a second conserved serine residue in the<br />
catalytic triad (quartet) (Asp-His-Ser/Ser), larger residue configurations were constructed.<br />
The method for finding N-bodies is based on the algorithm <strong>of</strong> [Old02]: two 3D patterns<br />
(triplets) from the same protein structure were combined, if they share two common<br />
residues. The analysis considered only the search for 4-, 5-, and 6-bodies.<br />
4.2 Results<br />
In the following sections, the biological significance <strong>of</strong> the mined 3D patterns is evaluated.<br />
Data mining was performed on the datasets OLDFIELD and SCOP40 with the following<br />
parameters: interaction distance d = 6Å, cross-validation iteration = 100, and selection<br />
<strong>of</strong> cluster based on τ = 0.66 (cf. section 3.4). Table 4.1 summarises the extracted data<br />
at each processing step.<br />
65
MSDsite<br />
Reference Dataset Determined Validated Coverage<br />
OLDFIELD 567 85 0.15<br />
SIDEMINE OLDFIELD 567 60 0.11<br />
MDB<br />
Reference Dataset Determined Validated Coverage<br />
OLDFIELD 302 36 0.12<br />
SIDEMINE OLDFIELD 302 18 0.06<br />
Table 4.2: Identification <strong>of</strong> metal binding <strong>sites</strong> in OLDFIELD. The available metal binding <strong>sites</strong> in the<br />
protein domain structures in OLDFIELD (input dataset) were determined by two reference databases<br />
(MSDsite and MDB). The figures were compared with the cross-validated metal binding <strong>sites</strong> in the<br />
mined 3D pattern dataset. A hit was found in the pattern data, if all three residues <strong>of</strong> a pattern is a<br />
subset <strong>of</strong> residues <strong>of</strong> a metal binding site. The performance was measured in terms <strong>of</strong> coverage.<br />
4.2.1 Identification <strong>of</strong> homologous metal binding <strong>sites</strong><br />
Metal binding proteins play a vital role in a wide range <strong>of</strong> biological processes, such as<br />
structural stability and complex formation. The identification <strong>of</strong> metal binding proteins<br />
is therefore crucial. The objective in this section is to identify metal binding <strong>sites</strong> within<br />
the mined 3D patterns from OLDFIELD by cross-validation with the reference databases<br />
MSDsite [GDO + 05] and MDB [CHR + 02].<br />
Table 4.2 lists the number <strong>of</strong> determined metal binding <strong>sites</strong> in the input dataset and<br />
the validated 3D patterns. The analysis shows that the determined coverage for both<br />
references is quite similar providing some confidence in the determined value.<br />
While<br />
the mined result covers only a small fraction <strong>of</strong> the available metal binding <strong>sites</strong>, the<br />
performance is comparable with SIDEMINE.<br />
A manual analysis shows, that some <strong>of</strong> the annotated metal binding <strong>sites</strong> can be partially<br />
recovered by merging two 3-bodies into a single 4-body. For example, the MSDsite<br />
lists the iron binding site, Asp-3His, for the PDB entry 1ar5 with the residues ASP161,<br />
HIS27, HIS75, and HIS165. The mined result from OLDFIELD contains the patterns<br />
66
2His-Trp and Asp-His-Trp, with the residues HIS27, HIS75, TRP126, and ASP161, HIS75,<br />
TRP126, respectively. Both triplets can be merged into the 4-body Asp-2His-Trp.<br />
A systematic analysis <strong>of</strong> false negatives is beyond the scope <strong>of</strong> this work. However,<br />
preliminary studies indicate, that the selection <strong>of</strong> interaction distance, plays an important<br />
role in discovering 3D patterns. For example, by setting the interaction distance d to 8Å,<br />
various triplet configurations can be extracted that contain the missing histidine, HIS165,<br />
from the example above.<br />
The validity <strong>of</strong> a mined 3D pattern as a metal binding site is demonstrated by manual<br />
analysis <strong>of</strong> several example structures. The examples shows that the residues <strong>of</strong> a metal<br />
binding site have a strong conservation <strong>of</strong> the side chain groups, indicating a high energy<br />
bond in the formation <strong>of</strong> a coordinative tetrahedral site. Figure 4.1 illustrates an example<br />
configuration with three cysteines from six structure examples. The listed proteins are<br />
heterogeneous in nature but are common in the 3Cys mediated ion binding site. Except<br />
for one entry all structures coordinate a zinc ion in a tetrahedral configuration.<br />
Another metal binding site with the configuration Cys-2His is shown in figure 4.2.<br />
The cluster lists 11 proteins with the majority being electron transfer proteins.<br />
In conclusion, the mined 3D pattern data contain validated metal coordinating residue<br />
configurations. The result indicates, that the presented data mining system is able to<br />
identify homologous structural features, which are recurrent in the dataset.<br />
4.2.2 Validation <strong>of</strong> convergent metal binding <strong>sites</strong><br />
Proteins with different folds can share a common structural feature. For example, various<br />
metal binding <strong>sites</strong> share a common residue arrangement, while the global fold <strong>of</strong> the<br />
metal binding proteins is quite different. In this case, the common pattern represents<br />
a convergent structural feature.<br />
The objective in this section is to test whether the<br />
developed data mining algorithm is able to find patterns <strong>of</strong> convergent structural features.<br />
For this analysis, the data mining was performed on SCOP40.<br />
67
PDBID Description Bound metal<br />
1h2r periplasmic hydrogenase nickel-iron<br />
1lat glucocorticoid receptor zinc<br />
2nll retinoic acid receptor zinc<br />
1ptq protein kinase c zinc<br />
2ohx alcohol dehydrogenase zinc<br />
4mt2 metallothionein is<strong>of</strong>orm II zinc<br />
Figure 4.1: A metal binding site with the 3Cys pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with 3D<br />
pattern from OLDFIELD identified the 3Cys configuration (top panel). List <strong>of</strong> protein structures with<br />
the common 3Cys residue configuration (bottom panel).<br />
68
PDBID Description Bound metal<br />
1kdi plastocyanin cu<br />
1aoz ascorbate oxidate cu<br />
6paz pseudoazurin cu<br />
1jer stellacyanin cu<br />
2azu azurin cu<br />
1bqk pseudoazurin cu<br />
1aac amicyanin cu<br />
1byo plastocyanin cu<br />
1as7 nitrite reductase cu<br />
1nic nitrite reductase cu<br />
1rcy rusticyanin cu<br />
Figure 4.2: A metal binding site with the Cys-2His pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with<br />
3D pattern from OLDFIELD identified the Cys-2His configuration (top panel). List <strong>of</strong> protein structures<br />
with the common Cys-2His residue configuration (bottom panel).<br />
69
PDBID Description Bound metal<br />
1iml metal-binding protein zn<br />
1zin phosphotransferase zn<br />
1kk1 translation zn<br />
1ibi metal-binding protein zn<br />
1dgs ligase zn<br />
1hc7 aminoacyl-trna synthetase zn<br />
1gax ligase/rna zn<br />
1dsv virus/virus protein zn<br />
1i50 transcription zn<br />
1ptq phosphotransferase zn<br />
1zbd complex (gtp-binding/effector) zn<br />
1kb4 transcription/dna zn<br />
1dcq metal binding protein zn<br />
1jj2 ribosome cd<br />
1vfy transport protein zn<br />
1ffy ligase/rna zn<br />
1dcq metal binding protein zn<br />
1dsz transcription/dna zn<br />
1d66 transcription regulation cd<br />
2alc dna binding protein zn<br />
1tfi transcription regulation zn<br />
4mt2 metallothionein zn<br />
1jr3 transferase zn<br />
1a5t zinc finger zn<br />
1jjd metal binding protein zn<br />
1bor transcription regulation zn<br />
1zbd complex (gtp-binding/effector) zn<br />
1g25 metal binding protein zn<br />
1pyi complex (dna-binding protein/dna) zn<br />
1hwt complex (activator/dna) zn<br />
1het oxidoreductase) zn<br />
Figure 4.3: A metal binding site with the 3Cys pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with<br />
3D pattern from SCOP40 identified the 3Cys configuration (top panel). List <strong>of</strong> protein structures with<br />
the common 3Cys residue configuration (bottom panel).<br />
70
PDBID Description Bound metal<br />
1ncs transcription regulation zn<br />
1rmd dna-binding protein zn<br />
2drp complex (transcription regulation/dna) zn<br />
1yuj complex (dna-binding protein/dna) zn<br />
1a1i complex (zinc finger/dna) zn<br />
1ubd complex (transcription regulation/dna) zn<br />
5znf zinc finger dna binding domain zn<br />
2gli complex (dna-binding protein/dna co<br />
1tf3 complex (transcription regulation/dna) zn<br />
1bhi dna-binding regulatory protein n/a<br />
1e53 transcription zn<br />
1g2a hydrolase ni<br />
1jym hydrolyse co<br />
Figure 4.4: A metal binding site with the Cys-2His pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with<br />
3D pattern from SCOP40 identified the Cys-2His configuration (top panel). List <strong>of</strong> protein structures<br />
with the common Cys-2His residue configuration (bottom panel).<br />
71
3Cys<br />
SCOP classification<br />
SCOP domain identifiers<br />
a.4.11.1 1i50j<br />
a.27.1.1 1ffya1<br />
a.60.2.2 1dgsa1<br />
b.35.1.2 1heta1<br />
c.26.1.1 1gaxa3<br />
c.37.1.8 1kk1a3<br />
c.37.1.13 1jr3a2, 1a5t 2<br />
g.38.1.1 1d66a1, 2alca , 1pyia1, 1hwtc1<br />
g.39.1.2 1kb4b , 1dsza<br />
g.39.1.3 1iml 2, 1ibia1, 1ibia2<br />
g.39.1.6 1jj2t<br />
g.40.1.1 1dsva<br />
g.41.2.1 1zin 2<br />
g.41.3.1 1tfi<br />
g.44.1.1 1bor , 1g25a<br />
g.45.1.1 1dcqa2<br />
g.46.1.1 4mt2 , 1jjda<br />
g.49.1.1 1ptq<br />
g.50.1.1 1vfya , 1zbdb<br />
g.56.1.1 1hc7a3<br />
Cys2His<br />
SCOP classification<br />
SCOP domain identifiers<br />
g.37.1.1 11ncs , d1rmd 1, d2drpa1, d2drpa2, d1yuja , d1a1ia1, d1ubdc1, d5znf ,<br />
d1ubdc2, d2glia4, d2glia2, d2glia3, d1tf3a1, d1bhi<br />
g.49.1.2 d1e53a<br />
d.167.1.1 d1g2aa , d1jyma<br />
Table 4.3: Convergent metal binding <strong>sites</strong> identified in SCOP40. The determined metal binding <strong>sites</strong><br />
from the 3D patterns in SCOP40 belong to different fold classes <strong>of</strong> unrelated proteins (convergent structural<br />
feature).<br />
Two patterns were identified in this study that represent metal binding <strong>sites</strong>. The<br />
3Cys configuration is the first example with 31 structure examples (cf. figure 4.3). The<br />
second metal binding configuration is the Cys-2His pattern with 17 structure examples<br />
(cf. figure 4.4). A visual analysis determined that the identified metal binding <strong>sites</strong> from<br />
SCOP40 are similar to the mined result from OLDFIELD (cf. previous section). According<br />
to the SCOP classification scheme, groups <strong>of</strong> protein structures can be determined,<br />
that have different domain structures, but share the same metal binding site (cf.<br />
table<br />
4.3). This indicates that the pattern was found as a recurrent structural feature in<br />
evolutionary distant proteins.<br />
72
The result <strong>of</strong> this analysis suggests that the developed data mining algorithm is able<br />
to find recurrent and convergent structural features in a non-redundant structure set.<br />
4.2.3 Recovering <strong>active</strong> <strong>sites</strong> and catalytic triads from the dataset<br />
The catalytic triad is one <strong>of</strong> the most characterised non-metal <strong>active</strong> <strong>sites</strong> <strong>of</strong> serine proteases.<br />
The enzymatic reaction is based on the conserved residues serine, aspartate, and<br />
histidine that work together in a specific spatial arrangement. Previously, the identification<br />
<strong>of</strong> the catalytic triad has been described as the key evaluation analysis in protein<br />
structure data mining, because the occurrence <strong>of</strong> this pattern is just above the noise level<br />
in a dataset <strong>of</strong> analogous proteins [Old02]. The objective in this section is the search<br />
for <strong>active</strong> <strong>sites</strong>, and the catalytic triad in particular, by cross-validation with CSA. The<br />
mined result from OLDFIELD was analysed in this study.<br />
Within OLDFIELD, 235 <strong>active</strong> <strong>sites</strong> were determined, while the number <strong>of</strong> crossvalidated<br />
<strong>active</strong> <strong>sites</strong> from the mined output was 27. Table 4.4 lists the validated protein<br />
residues.<br />
The majority <strong>of</strong> these residues are found in the Asp-His-Ser pattern, which<br />
was validated as the catalytic triad by manual analysis. The identified catalytic triad<br />
configuration lists 22 structure examples, with the majority belonging to the enzyme class<br />
hydrolase, and only a few belongs to the class oxidoreductase. In comparison, [Old02]<br />
identified 9 proteins, where 7 out <strong>of</strong> 9 were rediscovered in this analysis. The remaining<br />
15 out <strong>of</strong> 22 are additional and approved solutions. Figure 4.5 shows the superimposed<br />
structures for the Asp-His-Ser configuration.<br />
This study shows that the presented data mining system is able to find the catalytic<br />
triad in OLDFIELD. The mined result contains 15 additional valid solutions that were<br />
not discovered in [Old02].<br />
73
3D pattern (k=2)<br />
Cross-validated<br />
Pattern PDBID RID CSA SIDEMINE EC UID<br />
Ala-Arg-Asn 1qgj A ALA 71, A ARG 38, A ASN67 + 1.11.1.7 PER59 ARATH<br />
7atj A ALA 74, A ARG 38, A ASN 70 1.11.1.7 PER1A ARMRU<br />
His-2Ser 1elt A HIS 57, A SER 195, A SER 214 + 3.4.21.36 ELA1 SALSA<br />
1ppf E HIS 57, E SER 195, E SER 214 3.4.21.37 ELNE HUMAN<br />
1bma A HIS 60, A SER 203, A SER 222 + 3.4.21.36 ELA1 PIG<br />
1avw A HIS 57, A SER 195, A SER 214 + 3.4.21.4 N/A<br />
1hyl A HIS 57, A SER 195, A SER 214 + 3.4.21.- COGS HYPLI<br />
1bit A HIS 57, A SER 195, A SER 214 3.4.21.4 TRY1 SALSA<br />
1jrt A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRY1 BOVIN<br />
1try A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRYP FUSOX<br />
1au8 A HIS 57, A SER 195, A SER 214 3.4.21.20 CATG HUMAN<br />
1ct0 E HIS 57, E SER 195, E SER 214 + N/A N/A<br />
Asp-His-Ser 1a8q A ASP 223, A HIS 252, A SER 94 + 1.11.1.10 BPA1 STRAU<br />
1a7u A ASP 228, A HIS 257, A SER 98 + 1.11.1.10 PRXC STRAU<br />
1a88 A ASP 226, A HIS 255, A SER 96 + 1.11.1.10 PRXC STRLI<br />
1a8s A ASP 224, A HIS 253, A SER 94 + 1.11.1.10 PRXC PSEFL<br />
1tib A ASP 201, A HIS 258, A SER 146 3.1.1.3 LIP THELA<br />
3tgl A ASP 203, A HIS 257, A SER 144 3.1.1.3 LIP RHIMI<br />
1bs9 A ASP 175, A HIS 187, A SER 90 + 3.1.1.6 AXE2 PENPU<br />
1avw A ASP 102, A HIS 57, A SER 195 + + 3.4.21.4 N/A<br />
1acb E ASP 102, E HIS 57, E SER 195 + + 3.4.21.4 CTRA BOVIN<br />
1taw A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 N/A<br />
1au8 A ASP 102, A HIS 57, A SER 195 + + 3.4.21.20 CATH HUMAN<br />
1elt A ASP 102, A HIS 57, A SER 195 + 3.4.21.36 ELA1 SALSA<br />
3tgi E ASP 102, E HIS 57, E SER 195 + 3.4.21.4 TRY2 RAT<br />
1agj A ASP 120, A HIS 72, A SER 195 + 3.4.21.- ETA STAAU<br />
1auo A ASP 168, A HIS 199, A SER 114 + 3.4.22.38 CATK HUMAN<br />
1arb A ASP 113, A HIS 57, A SER 194 3.4.21.50 API ACHLY<br />
1jrt A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 TRY1 BOVIN<br />
1try A ASP 102, A HIS 57, A SER 195 3.4.21.4 TRYP FUSOX<br />
2tec E ASP 38, E HIS 71, E SER 225 + 3.4.21.66 THET THEVU<br />
1ppf E ASP 102, E HIS 57, E SER 195 + + 3.4.21.37 ELNE HUMAN<br />
1jfr A ASP 177, A HIS 209, A SER 131 N/A P83850 STREX<br />
1ct0 E ASP 102, E HIS 57, E SER 195 + + N/A N/A<br />
3D pattern (k=3)<br />
Cross-validated<br />
Pattern PDBID RID CSA SIDEMINE EC UID<br />
Ala-Asp-Ser 1brt A ALA 123, A ASP 228, A SER 98 + 1.11.1.10 BPOA2 STRAU<br />
1onr A ALA 225, A ASP 17, A SER 176 2.2.1.2 TALB ECOLI<br />
Asp-Cys-Lys 1nba A ASP 51, A CYS 177, A LYS 144 + 3.5.1.59 CSH ARTSP<br />
Table 4.4: List <strong>of</strong> cross-validated <strong>active</strong> site residues. The catalytic residues in the mined k=2 or k=3<br />
residue triplets were compared against <strong>active</strong> site templates in CSA. RID = a Residue identifier consisting<br />
<strong>of</strong> a chain identifier + a residue name + a residue sequence position.<br />
74
Figure 4.5: Re-discovery <strong>of</strong> the catalytic triad in OLDFIELD. Examples <strong>of</strong> protein structures with the<br />
Asp-His-Ser pattern were cross-validated by CSA.<br />
4.2.4 Discovering the conserved serine residue in the catalytic<br />
triad (quartet)<br />
The catalytic triad template (Asp-His-Ser) has been reported as a four residue configuration<br />
(Asp-His-Ser/Ser) [WBT97] [BFW + 94]. Based on the identified catalytic triad<br />
pattern in OLDFIELD (cf.<br />
previous section), the objective in this section is to test<br />
whether a 4-body or even larger residue configurations can be generated, based on the<br />
mined 3D patterns. In addition, the analysis searches the conserved serine residue in these<br />
extended configurations.<br />
The result <strong>of</strong> extending the catalytic triad is summarised in table 4.5. With 10 out <strong>of</strong><br />
22 structure examples having a single residue extension, only 7 out <strong>of</strong> the 10 determined<br />
4-bodies contain the conserved serine residue (Asp-His-2Ser).<br />
Other 4-bodies were also found with an additional alanine or cysteine residue. Preliminary<br />
studies indicate that even larger configurations can be obtained, by combining the<br />
determined 4-bodies into a 5- or 6-body. However, the biological validity <strong>of</strong> the additional<br />
75
PDBID Asp-His-Ser His-2Ser Ala-His-Ser Cys-His-Ser Ala-Asp-His<br />
1jrt + + +<br />
1au8 + + +<br />
1ppf + + +<br />
1avw + + + +<br />
1ct0 + + + +<br />
1elt + + + +<br />
1try + + + +<br />
3tgi + + + +<br />
1acb + + + +<br />
1arb + + +<br />
2tec +<br />
1agj +<br />
1taw +<br />
1a8s +<br />
1jfr +<br />
1a7u +<br />
1auo +<br />
1a88 +<br />
1a8q +<br />
1tib +<br />
3tgl +<br />
Table 4.5: Extending the catalytic triad into 4-bodies. Two pairs <strong>of</strong> residue triplets from the same<br />
protein structure are merged together if two <strong>of</strong> the residues are identical. The first column indicates the<br />
catalytic triad configuration, while the second column represents an extension with a previously reported<br />
conserved serine residue. The remaining columns shows other solutions <strong>of</strong> 3-body extensions with the<br />
catalytic triad.<br />
alanine or cysteine in a 4-body, or even other amino acids in larger residue configurations,<br />
needs to be determined.<br />
In conclusion, the presented algorithm is able to find the catalytic triad (quartet),<br />
i.e. the second conserved serine residue was rediscovered from data mining. While other<br />
residue configurations <strong>of</strong> 4-bodies were also found, the biological role <strong>of</strong> these residues is<br />
being investigated further.<br />
4.3 Discussion<br />
The biological cross-validation <strong>of</strong> the mined 3D patterns requires an adequate knowledge<br />
base as reference. A precision score cannot be estimated from cross-validation studies,<br />
because the result is the solution <strong>of</strong> discovery-driven data mining, and current knowledge<br />
bases have an incomplete coverage <strong>of</strong> <strong>functional</strong> <strong>sites</strong>.<br />
In this respect, the mined 3D<br />
patterns may contain known biological motifs, which are the detectable true positives,<br />
76
or unknown <strong>functional</strong> <strong>sites</strong>, which cannot be confirmed yet. In addition, the result may<br />
contain noise, which is impossible to detect as false positives. The biological significance<br />
<strong>of</strong> the presented data mining was evaluated by examples <strong>of</strong> known biological <strong>functional</strong><br />
<strong>sites</strong>: the metal binding site, and the catalytic triad. In particular, only known <strong>functional</strong><br />
<strong>sites</strong> for proteins in the input structure set were used as benchmark. An alternative to this<br />
stringent evaluation is to transfer <strong>functional</strong> <strong>sites</strong> from homologous proteins, e.g. based<br />
on the Homology-dervied Secondary Structure <strong>of</strong> proteins (HSSP) database [SS96], and<br />
consider these information as true positive reference.<br />
About one third <strong>of</strong> the data in the PDB are protein structures co-crystallised with<br />
metal ions, which allows the study <strong>of</strong> metal binding <strong>sites</strong> [BW03]. Within the analysis,<br />
only a small fraction <strong>of</strong> proteins with metal binding <strong>sites</strong> were rediscovered. A systematic<br />
optimisation <strong>of</strong> the developed data mining algorithm was not pursued, e.g. by modification<br />
<strong>of</strong> feature selection criteria, because this would have exceeded the limit <strong>of</strong> this thesis.<br />
Preliminary studies on the source <strong>of</strong> false negative rate indicates, that the interaction<br />
distance threshold is the first parameter to be optimised. However with the change <strong>of</strong> this<br />
parameter the probability distribution <strong>of</strong> triangulated structural features is also modified<br />
and the effect cannot be estimated easily.<br />
The datasets OLDFIELD and SCOP40 are quite different (cf. section 3.2). OLD-<br />
FIELD consists <strong>of</strong> sequentially dissimilar protein structures, while the proteins may still<br />
share structure similarity.<br />
This property allows the mining <strong>of</strong> homologous structural<br />
features <strong>of</strong> divergent proteins, such as metal binding <strong>sites</strong> or the catalytic triad. The developed<br />
data mining method was also tested, whether it can extract convergent structural<br />
features, by analysing SCOP40. This dataset consists only <strong>of</strong> divergent proteins with no<br />
global structural similarities. As a consequence, structural components are mainly represented<br />
by convergent features, and the detection <strong>of</strong> these residue configurations might be<br />
below detection level. That is, the occurrences <strong>of</strong> convergent structural features are similar<br />
to background level. However, metal binding <strong>sites</strong> are examples <strong>of</strong> convergent patterns<br />
77
that were found in this study. The coordination <strong>of</strong> metal ions is greatly dependent on the<br />
distances and orientations <strong>of</strong> the conjugating residues. For that reason, data mining can<br />
detect these convergent structural features in structurally unrelated proteins.<br />
The presented data mining system identifies local three residue interactions with respect<br />
<strong>of</strong> their spatial and chemical configuration. In addition, examples <strong>of</strong> 4- and 5-body<br />
interactions were shown as a solution in extending the catalytic triad pattern. The analysis<br />
shows, that larger residue configurations can be found with the presented combinatorial<br />
approach. However, the search for larger structural patterns might deliver only protein<br />
stabilising features or other biological units in protein structures that are difficult to<br />
interpret.<br />
4.4 Conclusion<br />
The solution <strong>of</strong> this developed data mining algorithm is justified by the cross-validation<br />
<strong>of</strong> biologically relevant structure motifs provided in this study.<br />
The mining system is<br />
able to detect recurrent homologous or convergent structural features in the dataset.<br />
More importantly, two biological motifs, the metal binding site, and the catalytic triad,<br />
were rediscovered indicating, that the mined output contains biologically valid solutions.<br />
While the prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong> is an important task in structural biology, the<br />
biological interpretation <strong>of</strong> a 3D pattern requires evidences <strong>of</strong> biological significance. The<br />
combination with published biochemical and experimental data can provide evidences and<br />
a biological context for data interpretation. In the next chapter, I will present a biomedical<br />
literature mining system, for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues.<br />
78
Chapter 5<br />
Identification <strong>of</strong> protein residues in<br />
MEDLINE<br />
In this chapter, I present a text mining method to identify protein residues in biomedical<br />
texts. In the first step, the algorithm identifies the biological entities <strong>of</strong> residue, protein,<br />
and organism, and then determines the association <strong>of</strong> entity triplets. As a result a residue<br />
is linked to its source protein, and the protein is mapped to its hosting organism. Because<br />
the developed text mining solution relies on information from UniProtKB, an identified<br />
protein residue is directly linked to a unique Uniprot entry.<br />
One application <strong>of</strong> this<br />
method is the search for abstract texts in MEDLINE with protein residues, and then use<br />
the result for the update <strong>of</strong> citations in UniProtKB. The identification <strong>of</strong> protein residues<br />
in biomedical texts is a prerequisite for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> residues.<br />
5.1 Algorithms<br />
The developed protein residue identification system is based on the algorithm <strong>of</strong> [HLC04].<br />
Basically, the developed method is a four step procedure: biological entity recognition<br />
<strong>of</strong> organism, protein, and residue, and the association <strong>of</strong> the entity triplet. Figure 5.1<br />
illustrates the procedures <strong>of</strong> this text mining system.<br />
79
Figure 5.1: Overview <strong>of</strong> processes and evaluation methods for the developed protein residue identification<br />
system.<br />
80
5.1.1 Protein and organism entity recognition<br />
Theory<br />
The recognition <strong>of</strong> protein and organism entities in text is based on a dictionary lookup<br />
approach. Basically, names <strong>of</strong> proteins, their synonyms, and their gene names are collected<br />
from UniProtKB to populate a protein terminology dictionary. The lookup <strong>of</strong> the<br />
protein dictionary considers the matching <strong>of</strong> morphological variants. The dictionary is<br />
not expanded by syntactical variants <strong>of</strong> terminological entries, like structural or formal<br />
variants, and addition <strong>of</strong> modifier or head word, because the lookup approach with the<br />
vast number <strong>of</strong> permutations requires much more computational memory resources. The<br />
alternative is to use a probabilistic approach.<br />
A similar method is also used to populate the organism terminology dictionary with<br />
names and synonyms from the NCBI Taxonomy database [WBB + 06].<br />
The lookup <strong>of</strong><br />
terminologies also considers the matching <strong>of</strong> morphological variants.<br />
Implementation<br />
The recognition <strong>of</strong> protein entities was based on an approach that combined dictionary<br />
lookup with basic disambiguation [RSKA + 07].<br />
All protein names and synonyms were<br />
collected from UniProtKB.<br />
Names <strong>of</strong> species were extracted from the NCBI Taxonomy references in UniProtKB,<br />
and their scientific and common names collected. The dictionary was complemented with<br />
terminologies describing only the referenced genus. Full organism names were augmented<br />
with abbreviated genus forms, i.e. first letter abbreviation <strong>of</strong> genus + specie.<br />
The fast and efficient method for annotating texts with protein and organism names<br />
was based on the publicly available web service called Whatizit [RSAG + 08]. The result is<br />
an <strong>annotation</strong> <strong>of</strong> protein and organism names in text with references to UniProtKB and<br />
NCBI Taxonomy.<br />
81
5.1.2 Entity recognition <strong>of</strong> protein residue<br />
Theory<br />
The identification <strong>of</strong> residue entities is based on the re-implementation <strong>of</strong> previously published<br />
regular expression patterns for point mutations [HLC04] [RSMA + 04]. Here, the<br />
patterns are extended to capture in total three types <strong>of</strong> residues: wild-type, point mutation,<br />
and range <strong>of</strong> residues or pair <strong>of</strong> residues.<br />
Although amino acid sequences can<br />
be considered in the residue entity identification, the lack <strong>of</strong> information about sequence<br />
position prevents the precise association detection with proteins.<br />
The first basic type <strong>of</strong> residue mention is the single protein residue sequence reference,<br />
which consists <strong>of</strong> the name <strong>of</strong> an amino acid, followed by the sequence position number,<br />
e.g. ”Gly-12”, ”arginine 4”, ”Tyr74”, ”Arg(53)”. A point mutation is the second type <strong>of</strong><br />
residue mention, where the description details the exchange <strong>of</strong> an amino acid at a given<br />
position.<br />
The common notation is the name <strong>of</strong> the amino acid, its sequence position<br />
number, followed by the exchange. The following are examples <strong>of</strong> point mutations found<br />
in text: ”W77R”, ”Cys560Arg”, ”ser-52->ala”, ”ala2-methionine”.<br />
Finally, the third<br />
type <strong>of</strong> residue mention describes either a range <strong>of</strong> residues or an interaction pair, e.g.<br />
”Tyr 85 to Ser 85”, ”Trp27–Cys29”. The correct identification <strong>of</strong> this type <strong>of</strong> residue<br />
mention requires the consideration <strong>of</strong> contextual information, which is not handled in<br />
this version. The common notation is the string sequence: amino acid name, sequence<br />
position, a connection symbol or connection word, amino acid name, and then sequence<br />
position.<br />
In addition to the abbreviated notation, protein residues can be expressed in syntactical<br />
form, e.g. ”isoleucine at position 3”, ”substitution <strong>of</strong> Ala at position 4 to Gly”,<br />
”Ser472 to glutamic acid”. Additional patterns were developed to accommodate these<br />
and other less precise defined residue mentions in syntactical form, e.g. ”residue at position<br />
22, 34, and 40”. Although the entity triplet association algorithm does not utilise<br />
the latter identified residue mentions, <strong>annotation</strong> can generally be extracted for these<br />
82
underspecified residues to increase the recall in information extraction.<br />
Implementation<br />
The extraction <strong>of</strong> residue mentions reuses the idea <strong>of</strong> designing regular expressions to find<br />
residue entities in text [RSMA + 04] [HLC04]. Some <strong>of</strong> the previously published regular<br />
expression patterns were adopted, while other patterns were created to cover other types<br />
<strong>of</strong> residue mentions, such as basic abbreviational point mutation patterns. In this thesis,<br />
sets <strong>of</strong> regular expressions were developed and implemented as finite state transducer to<br />
identify three types <strong>of</strong> residue entities (cf. table 5.1): wild-type, point mutation, and<br />
range or pair <strong>of</strong> residues. The result is an <strong>annotation</strong> <strong>of</strong> residue mention in text with<br />
normalised expressions.<br />
5.1.3 Association identification <strong>of</strong> the entity triplet organism,<br />
protein, and residue<br />
Theory<br />
The association <strong>of</strong> the entities organism, protein, and residue is a difficult text mining<br />
task. Unlike the association <strong>of</strong> two proteins, e.g. the physical interactions <strong>of</strong> two proteins<br />
(protein-protein interaction), the binary semantic relationships <strong>of</strong> organism-protein and<br />
protein-residue are not necessarily explicitly stated in biomedical texts. For example, a<br />
protein may be mentioned at the beginning <strong>of</strong> a paragraph, while a site-directed mutation<br />
on the same protein is described in later sections. This is one reason why approaches<br />
relying only on language patterns or word distance metrics are not feasible to find proteinresidue<br />
associations. The association task becomes more complex, when multiple proteins<br />
are mentioned in the text. Usually a residue has a one-to-one relationship with a protein,<br />
however two proteins can have the same residue at the same sequence position. While<br />
this ambiguity cannot be solved without deeper natural language processing techniques,<br />
the problem can be tackled with a knowledge based approach.<br />
83
RANGE-TO = ("-"+ ("to" "-+")? | "to");<br />
CONVERT-TO = ("to" | "-"+ ">"?);<br />
XAA = ( "X" | "XAA" | "xaa" );<br />
POS = (1-9)(0-9)*;<br />
RESN1<br />
RESN3<br />
= [ARNDCQEGHILKMFPSTWYVOUBZX];<br />
= ( [aA]la|ALA | [aA]rg|ARG | [aA]sn|ASN | [aA]sp|ASP | [cC]ys|CYS<br />
| [gG]ln|GLN | [gG]lu|GLU | [gG]ly|GLY | [hH]is|HIS | [iI]le|ILE<br />
| [lL]eu|LEU | [lL]ys|LYS | [mM]et|MET | [pP]he|PHE | [pP]ro|PRO<br />
| [sS]er|SER | [tT]hr|THR | [tT]rp|TRP | [tT]yr|TYR | [vV]al|VAL<br />
| [pP]yl|PYL | [sS]ec|SEC | [aA]sx|ASX | [gG]lx|GLX | [xX]aa|XAA);<br />
RESNF = ( [aA]lanine | [aA]rginine | [aA]sparagine | [aA]spart(ate|ic acid) |<br />
[cC]ysteine<br />
| [gG]lutamine | [gG]lutam(ate|ic acid) | [gG]lycine | [hH]istidine |<br />
[iI]soleucine<br />
| [lL]eucine | [lL]ysine | [mM]ethionine | [pP]henylalanine | [pP]roline<br />
| [sS]erine | [tT]hreonine | [tT]ryptophan | [tT]yrosine | [vV]aline<br />
| [pP]yrrolysine | [sS]elenocysteine | [aA]spartic acid or [aA]sparagine<br />
| [gG]lutamic acid or[gG]lutamine);<br />
SITE<br />
SITES<br />
= ( (RESN3 | RESNF) POS "residue"?<br />
| (RESN3 | RESNF) "-"+ POS "residue"?<br />
| (RESN3 | RESNF) "residue"? "at position"? POS "residue"?<br />
| (RESN3 | RESNF) "(" POS ")" "residue"?<br />
| "amino acid"? "residue" "at position"? POS<br />
| "amino acid" "residue"? "at position"? POS<br />
| RESNF "residue" POS);<br />
= ( RESNF"s" (("," | "and" | "or") RESNF"s")*<br />
| RESNF"s"? ("at position""s"?)? ("," | "and" | "or") (("at position""s"?)?<br />
("," | "and" | "or") POS)+<br />
| RESNF "residue""s"?<br />
| RESN3 "residue""s"? ("at position""s"?)? POS (("at position""s"?)? ("," |<br />
"and" | "or") POS)+<br />
| RESN3 "residue""s"?<br />
| "residue""s"? ("at position""s"?)? POS ("," | "and" | "or") POS)+<br />
| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS ("," | "and" | "or")<br />
POS)+<br />
| RESNF ("," | "and" | "or") POS)* "residue""s"?);<br />
RANGE/PAIR = ( "residue""s"? ("," | "and" | "or") RANGE-TO POS)+<br />
| "amino acid" "residue"? "s"? ("," | "and" | "or") RANGE-TO POS)+<br />
| ("resiude""s"?)? "at position""s"? ("," | "and" | "or") RANGE-TO POS)+<br />
| RESI RANGE-TO RESI);<br />
MUTATION<br />
= ( RESN1 POS RESN1<br />
| RESN1 "-" POS "-" RESN1<br />
| RESN1 "(" POS ")" RESN1<br />
| RESI CONVERT-TO (RESN3 | RESNF)<br />
| RESI RESN3<br />
| "from" (RESNF | RESN3) CONVERT-TO (RESNF | RESN3) "at position" POS<br />
| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS<br />
| RESI ("-"+ | CONVERT-TO) RESI "substitution");<br />
Table 5.1: Regular expression patterns for the detection <strong>of</strong> residue mentions in text. The patterns<br />
recognise single (SITE) or multiple wild-type residue <strong>sites</strong> (SITES), a sequence range or residue pair<br />
(RANGE/PAIR), and point mutation (MUTATION). The set covers abbreviated notations <strong>of</strong> residues<br />
as well as grammatical expressions found in text.<br />
84
The developed method in this work is based on the algorithm <strong>of</strong> [HLC04]. Basically,<br />
the identification <strong>of</strong> a protein residue can only be validated, if it is part <strong>of</strong> the protein<br />
sequence, as it is denoted in a reference database, e.g. UniProtKB. This requires that the<br />
protein mentioned in the text is further supported by evidence for the organisms under<br />
scrutiny to select the appropriate protein sequence from the bioinformatics database; that<br />
excludes the risk <strong>of</strong> using orthologous protein sequences.<br />
Implementation<br />
In this study, the developed system to identify the entity triplet association <strong>of</strong> organism,<br />
protein, and residue, was based on the algorithm described by [HLC04] with some modifications.<br />
In the first step proteins were associated with their hosting organisms. Given a<br />
protein, all pairs <strong>of</strong> protein-organism (specie) were determined from text and ranked according<br />
to a word distance measure. The word distance between two entities was defined<br />
by the smallest number <strong>of</strong> words between them. The identification <strong>of</strong> protein-organism<br />
began with the pair with the smallest word distance measure. A valid association was<br />
found, if a semantic relation was specified in UniProtKB. If an association was validated<br />
then the search was terminated, and the protein was annotated with the corresponding<br />
Uniprot identifier, otherwise the next entity pair from the list was tested. If no match<br />
between protein and organism (specie) was found, then the search was relaxed to genus<br />
matching. This relaxed matching is the expansion to the [HLC04] algorithm. Because<br />
entries in UniProtKB are species specific, the protein-organism (genus) association will<br />
result in a list <strong>of</strong> Uniprot identifiers as <strong>annotation</strong> <strong>of</strong> the protein.<br />
The second step <strong>of</strong> this algorithm was the association <strong>of</strong> residues with their source<br />
proteins. The procedure <strong>of</strong> selecting and ranking the residue-protein pairs was similar<br />
to the protein-organism association identification. For each pair that was to be tested<br />
the annotated Uniprot identifier <strong>of</strong> the protein was used to retrieve the protein sequence<br />
from the database. Three cases <strong>of</strong> results can be distinguished: (1) the residue correctly<br />
85
matches the protein sequence; (2) several alternative sequences are matching from a list<br />
<strong>of</strong> proteins; and (3) no match can be found for the residue with the available protein<br />
sequences. If a match was found, then the residue was annotated with references to the<br />
protein, otherwise the search continued with the next pair from the ranked list.<br />
5.2 The construction <strong>of</strong> evaluation test corpora<br />
UniProtKB is one <strong>of</strong> the most comprehensive protein knowledge bases (cf. section 2.1.2).<br />
It contains manually curated <strong>functional</strong> <strong>annotation</strong>s on three levels: protein, protein sequence,<br />
and protein residue. Information is derived from surveys <strong>of</strong> biomedical articles,<br />
and entries are annotated with citation references (PMIDs; PubMed identifiers). However,<br />
the precise association <strong>of</strong> a citation and a protein residue in context <strong>of</strong> <strong>functional</strong><br />
<strong>annotation</strong> is generally not available.<br />
The test dataset for the developed <strong>functional</strong> <strong>annotation</strong> extraction is based on the<br />
citation references from UniProtKB. A Uniprot corpus was generated by retrieving abstract<br />
texts from MEDLINE that are indexed by the knowledge base. From the 136,566<br />
citations listed in UniProtKB, a virtually complete set <strong>of</strong> 136,559 abstract texts was retrieved<br />
from MEDLINE. Although not all information presented in the UniProtKB are<br />
necessarily available in the Uniprot corpus, the Uniprot corpus is a starting point for the<br />
evaluation <strong>of</strong> the developed text mining modules. In particular three derived test corpora<br />
were generated from the Uniprot corpus: the gold standard corpus with manual <strong>annotation</strong><br />
(GC), and the two cross-validation corpora with annotated information derived from<br />
UniProtKB (XC1, and XC2). Figure 5.2 summarises key features in both test corpora.<br />
For the automatic evaluation <strong>of</strong> extracted data, a cross-validation corpus (XC) was<br />
derived from Uniprot corpus. This test set was used to analyse the performance <strong>of</strong> proteinorganism<br />
(XC1) and residue-protein (XC2) associations.<br />
The test set was annotated<br />
automatically, i.e. the biological entities were detected with the same ER systems. The<br />
documents in the Uniprot corpus were scanned for tri-occurrences <strong>of</strong> organism, protein,<br />
86
Dataset<br />
Gold standard corpus<br />
(GC)<br />
Cross-validation<br />
corpus (XC1)<br />
Cross-validation<br />
corpus (XC2)<br />
Abstracts count 100 55,998 4,503<br />
Method <strong>of</strong> <strong>annotation</strong> manual automatic automatic<br />
total/unique residues 362/262 (with N/A<br />
N/A<br />
262/191 having<br />
residue name +<br />
residue sequence<br />
position)<br />
total/unique proteins 990/511 N/A N/A<br />
total/unique organisms 323/123 N/A N/A<br />
total/unique associations 240/172 residueprotein-organism<br />
NA/70,401<br />
associations<br />
protein-organism<br />
as UTP<br />
as URP<br />
Application<br />
Test the the type,<br />
amount and reliability<br />
<strong>of</strong> the<br />
extracted information<br />
(reproduction<br />
<strong>of</strong> manually annotated<br />
information).<br />
Test set is assumed<br />
to contain the same<br />
type <strong>of</strong> information<br />
as GC, but certainty<br />
is not clear.<br />
Study the reproduction<br />
<strong>of</strong> information<br />
contained in<br />
the database.<br />
NA/10,152<br />
protein-residue<br />
Test set is assumed<br />
to contain the same<br />
type <strong>of</strong> information<br />
as GC, but certainty<br />
is not clear.<br />
Study the reproduction<br />
<strong>of</strong> information<br />
contained in<br />
the database.<br />
Figure 5.2: Test corpora for information extraction evaluation. Based on the citation references from<br />
UniProtKB a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora<br />
were derived from this corpus: (1) the gold standard corpus (GC), which resembles a manually annotated<br />
test set; and (2) the cross-validation corpora (XC1, XC2), which contains automatically assigned<br />
<strong>annotation</strong>s based on information from UniProtKB.<br />
and residue in text and a subset was retained if the combinations <strong>of</strong> the identifier triplet<br />
(UID+TID+PMID) for each document can be found in the database. UID is the Uniprot<br />
ID, TID is the NCBI Taxonomy ID, and PMID is the PubMed identifier. If at least a single<br />
match was found, then a document was selected. For the non-matching combinations the<br />
corresponding <strong>annotation</strong>s were removed from text. This results in the test set XC1 with<br />
the associated set <strong>of</strong> the triple identifier combinations UTP = (UID+TID+PMID). XC2<br />
is a subselection from XC1 by filtering for documents where the identifier combination<br />
URP=(UID+RID+PMID) were validated by entries in UniProtKB. RID is a residue<br />
identifier which consists <strong>of</strong> a residue name + sequence position. 70,401 UTPs from 55,998<br />
abstract texts were determined for XC1, and correspondingly 10,152 URPs were derived<br />
from 4,503 MEDLINE articles in XC2.<br />
The gold standard corpus (GC) was created through manual curation, since no suitable<br />
annotated corpora are available for this study.<br />
A random sample <strong>of</strong> 100 MEDLINE<br />
87
abstract texts was drawn from the Uniprot corpus, where every abstract text must contain<br />
the tri-occurrences <strong>of</strong> organism, protein and residue. Notice that the detection <strong>of</strong> the<br />
entities was based on the entity recognition (ER) systems described in the previous section.<br />
It is not expected that the ER systems are performing at top level, and therefore a certain<br />
proportion <strong>of</strong> the filtered abstract texts contains false positives <strong>of</strong> identified entities.<br />
From this set <strong>of</strong> 100 abstract texts, manual analysis provided four types <strong>of</strong> <strong>annotation</strong>s.<br />
The first type is the <strong>annotation</strong> <strong>of</strong> the biological entities <strong>of</strong> organism, protein, and residue,<br />
while the second is the <strong>annotation</strong> <strong>of</strong> entity triplet associations, i.e. organism-proteinresidue.<br />
Notice that this process did not include the grounding <strong>of</strong> protein or organism<br />
entities to entries in the specialised databases, i.e. UniProtKB and NCBI Taxonomy. In<br />
addition, text segments <strong>of</strong> sentences with a residue entity were annotated, if they represent<br />
keywords for <strong>functional</strong> <strong>annotation</strong>. Finally, the association <strong>of</strong> a keyword and a residue<br />
was also annotated in GC.<br />
Notice, that the set <strong>of</strong> documents in GC is partially contained in XC2; only 26 abstracts<br />
are shared among both datasets. From manual <strong>annotation</strong> 38 entity triplet associations<br />
were determined, while the corresponding number from XC2 was 58. The total number<br />
<strong>of</strong> manually annotated triplet associations in GC is 172 (cf. figure 5.2).<br />
The major difference between both evaluation corpora is, that GC contains manually<br />
confirmed biological entities and their associations. In contrast, the same <strong>annotation</strong>s<br />
in XC1 and XC2 were done with UniProtKB, based on the assumption that the same<br />
database information is present in abstract texts.<br />
The interpretation <strong>of</strong> performance<br />
analysis has to consider the properties <strong>of</strong> these evaluation test corpora.<br />
5.3 Evaluation methods<br />
The performance <strong>of</strong> each process <strong>of</strong> the developed protein residue identification system<br />
was scored against a manually annotated gold standard corpus.<br />
Proteins, where the<br />
protein entity recognition system and manual curation assigned the same entity (full<br />
88
term matching) were considered as true positives (TP). The same rule also applied for<br />
counting TP for the detection <strong>of</strong> residue and organism entities.<br />
The evaluation <strong>of</strong> the entity triplet association detections considered only associations<br />
as TP, if both pair relations organism-protein and protein-residue were determined correctly.<br />
If one <strong>of</strong> the relations was incorrect, a found association was counted as false<br />
positive (FP).<br />
In contrast, the automatic evaluation <strong>of</strong> the entity recognition and entity association<br />
detection systems were performed on XC. A true positive <strong>of</strong> an annotated entity within<br />
an abstract text was identified, if UniProtKB lists the same entity in context <strong>of</strong> the<br />
given PMID. For example, if organism X in text Y is also indexed in UniProtKB as a<br />
combination <strong>of</strong> TID+PMID, then a TP was counted.<br />
A correct protein-organism association was detected, if the determined identifier combination<br />
UTP was found in XC. Similarly, a correct residue-protein association was found,<br />
if the derived identifier combination URP was found in the test corpus.<br />
The effectiveness <strong>of</strong> the ER and the association detection systems was measured in<br />
terms <strong>of</strong> precision, recall and the balanced F-measure (F1):<br />
precision =<br />
#true positive<br />
#true positive + #false positive , (5.1)<br />
recall =<br />
#true positive<br />
#true positive + #false positive , (5.2)<br />
F 1 =<br />
2 ∗ precision ∗ recall<br />
. (5.3)<br />
precision + recall<br />
5.4 Results<br />
The developed protein residue identification system in this study consists <strong>of</strong> four modules.<br />
The following sections assess first performances <strong>of</strong> biological entity recognition, and then<br />
89
Unique residue entities<br />
Reference Dataset Available Extracted Common Precision Recall F1<br />
Gold standard corpus 191 203 187 0.92 0.98 0.95<br />
MutationGraB GPCR corpus N/A N/A N/A 0.98 0.77 0.86<br />
MutationMiner Xylanase corpus N/A N/A N/A 1.00 0.85 0.92<br />
MEMA Mutation corpus N/A N/A N/A 0.98 0.75 0.85<br />
Table 5.2: Performance evaluation <strong>of</strong> residue entity recognition. The performance is compared with other<br />
published residue entity recognition systems: MutationGraB (GPCR corpus) [LHC07]; MutationMiner<br />
(Xylanase corpus) [BW05]; and MEMA (Mutation corpus) [RSMA + 04]. Performance was measured in<br />
terms <strong>of</strong> precision, recall, and F1 measure.<br />
the association <strong>of</strong> the entity triplet organism, protein, and residue.<br />
The final section<br />
presents an application <strong>of</strong> the presented text mining solution that can be used to update<br />
the citation set <strong>of</strong> UniProtKB or any other derived databases.<br />
5.4.1 Evaluation <strong>of</strong> organism, protein, and residue entity recognition<br />
The goal <strong>of</strong> biological entity recognition, in this study, is to detect the mentions <strong>of</strong> residue,<br />
protein, and organism in biomedical abstract texts. In order to evaluate the performance<br />
<strong>of</strong> the developed ER systems, the detections were compared against the results from<br />
manual curated test set, the gold standard corpus (GC).<br />
The evaluation shows that the developed regular expression patterns are highly usable<br />
for the detection <strong>of</strong> residue mentions in biomedical texts. ER for residue mention yields<br />
in a precision <strong>of</strong> 0.92 and a recall <strong>of</strong> 0.98. With an F1 measure <strong>of</strong> 0.95 the performance<br />
<strong>of</strong> this ER system is within range <strong>of</strong> previous reports on point mutation identification<br />
[LHC07] [BW05] [RSMA + 04] (cf. table 5.2).<br />
The performance for protein mention identification is evaluated with 65% precision and<br />
60% recall (62% F1 measure). The result is difficult to compare to previously reported<br />
systems, e.g. ProMiner and MutationMiner (cf. table 5.3), due to the different experimental<br />
setup. ProMiner was evaluated on the BioCreAtIvE corpus (80% F1 measure)<br />
90
Unique protein entities<br />
Reference Dataset Available Extracted Common Precision Recall F1<br />
Gold standard corpus 511 471 305 0.65 0.60 0.62<br />
ProMiner BioCreAtIvE corpus N/A N/A N/A 0.8 0.8 0.8<br />
MutationMiner Xylanase corpus N/A N/A N/A 0.88 0.71 0.79<br />
Table 5.3: Performance evaluation <strong>of</strong> protein entity recognition. The performance is compared with the<br />
other published protein entity recognition systems: ProMiner (BioCreAtIvE corpus, Task 1B, protein<br />
and gene name identification) [HFM + 05]; and MutationMiner (Xylanase corpus) [BW05]. Performance<br />
was measured in terms <strong>of</strong> precision, recall, and F1 measure.<br />
Unique organism entities<br />
Reference Dataset Available Extracted Common Precision Recall F1<br />
Gold standard corpus 123 109 88 0.81 0.72 0.76<br />
MutationMiner Xylanase corpus N/A N/A N/A 0.88 0.71 0.79<br />
Table 5.4: Performance evaluation <strong>of</strong> organism entity recognition. The performance is compared with<br />
the NER system <strong>of</strong> MutationMiner (Xylanase corpus) [BW05]. Performance was measured in terms <strong>of</strong><br />
precision, recall, and F1 measure.<br />
which links the contained protein mentions to only a small set <strong>of</strong> organisms. However,<br />
we have repeated the experiment on the BioCreAtIvE dataset and the result suggests<br />
that our method yields a comparable performance (76% F1 measure). Conversely, the<br />
evaluation <strong>of</strong> MutationMiner not only considers abstract texts but also the content <strong>of</strong> the<br />
full-text articles which should improve the results (79% F1 measure).<br />
Although the developed organism entity recognition system relies on a similar dictionary<br />
lookup approach as protein entity recognition, the performance is higher (precision<br />
<strong>of</strong> 0.81 and recall <strong>of</strong> 0.72; cf. table 5.4). This indicates that the list <strong>of</strong> terminologies are<br />
precise and covers a wide range <strong>of</strong> expressions.<br />
In conclusion, with F1 measures <strong>of</strong> 0.95, 0.62, and 0.76 for the entity recognition <strong>of</strong><br />
residue, protein, and organism, the developed text mining system is able to detect these<br />
three biological entities in biomedical abstract texts.<br />
91
Unique resi.-prot.-org.-associations<br />
Reference Dataset Available Extracted Common Precision Recall F1<br />
Gold standard corpus 172 79 65 0.82 0.38 0.52<br />
MutationGraB Mutation corpus N/A N/A N/A 0.85 0.69 0.76<br />
MEMA Mutation corpus N/A N/A N/A 0.93 0.35 0.51<br />
MuteXt tinyGRAP N/A N/A N/A 0.88 0.83 0.85<br />
Table 5.5: Performance evaluation <strong>of</strong> residue-protein-organism entity association detection. The performance<br />
is compared with the other published point mutation detection systems: MutationGraB (Mutation<br />
corpus1) [LHC07]; and MEMA (Mutation corpus2) [RSMA + 04]. Notice that MEMA identified only associations<br />
but without grounding. Performance was measured in terms <strong>of</strong> precision, recall, and F1 measure.<br />
5.4.2 Performance study on the entity triplet association<br />
The objective <strong>of</strong> the developed association detection system is to identify the entity triplet<br />
<strong>of</strong> organism, protein, and residue.<br />
In this section, the performance <strong>of</strong> this detection<br />
system is studied by comparing the <strong>predicted</strong> association with the manually annotated<br />
associations in the gold standard corpus (GC).<br />
With a precision <strong>of</strong> 0.82 and a recall <strong>of</strong> 0.38 the developed detection system is a reliable<br />
method for association detection, and the precision is comparable to other related reports<br />
(cf. table 5.5). In comparison to the systems, MutationGraB and MuteXt, the low recall<br />
can be explained by the differences in the test corpora; both systems were evaluated on<br />
protein family specific full-text articles. The evaluated precision <strong>of</strong> MEMA is different<br />
from this study, because MEMA identifies only associations without grounding to Uniprot<br />
entries.<br />
Manual analysis isolated two main reasons for the low recall. First, the association <strong>of</strong><br />
all the three entities failed in several cases, because the system did not find an association<br />
between protein and organism.<br />
Other cases were also encountered, where a proteinorganism<br />
association was correctly identified, but a protein-residue association could not<br />
be found. A detailed explanation is given in the discussion section.<br />
Despite the low recall <strong>of</strong> this text mining module, the evaluation indicates that the<br />
developed method is able to detect associations <strong>of</strong> residue, protein, and organism. More<br />
92
UTP<br />
Dataset Available Extracted Common Precision Recall F1<br />
XC1 70,401 77,407 62,068 0.82 0.88 0.85<br />
URP<br />
Dataset Available Extracted Common Precision Recall F1<br />
XC2 10,152 10,876 9,325 0.86 0.92 0.89<br />
Table 5.6: Performance evaluation <strong>of</strong> protein-organism and protein-residue entity association detection.<br />
A cross-validation corpus (XC) from UniProtKB was obtained from MEDLINE, by first retrieving<br />
abstract texts from MEDLINE, searching for tri-occurrences <strong>of</strong> the named entities residue, protein, organism,<br />
and then retaining only those entries for which the identifier combination <strong>of</strong> UTP (Uniprot identifier<br />
+ NCBI Taxonomy identifier + PubMed identifier) was found in UniProtKB. The result is the test set<br />
XC1 for protein-organism association study. XC2 is a subset <strong>of</strong> XC1 by scaning for documents where<br />
the identifier combination URP identifier combination (Uniprot identifier + Residue identifier + PubMed<br />
identifier) was validated by UniProtKB. Performance was measured in terms <strong>of</strong> precision, recall, and F1<br />
measure.<br />
importantly, the detected associations are in accordance with manually identified semantic<br />
relations between the three biological entities. With a precision <strong>of</strong> 0.82 the developed<br />
method is able to identify precisely protein residues in biomedical texts.<br />
5.4.3 Cross-validation <strong>of</strong> identified residues with UniProtKB<br />
In the previous section the system for the association <strong>of</strong> the entity triplet organism,<br />
protein, and residue, was evaluated manually on the gold standard corpus. The objective<br />
in this section is to perform an analysis on a larger test set by cross-validation with<br />
UniProtKB. For this task, the cross-validation corpora XC1 and XC2 were used. The<br />
analysis consists <strong>of</strong> a two-step association study, i.e. the association <strong>of</strong> protein-organism<br />
and residue-protein were evaluated individually. Table 5.6 summarises the results.<br />
With a precision <strong>of</strong> 0.82 and a recall <strong>of</strong> 0.88, the result for organism-protein association<br />
indicates that the system is able to extract correct semantic relations from XC1. The second<br />
step <strong>of</strong> the evaluation determines the performance <strong>of</strong> the residue-protein association<br />
detection. A similar precision score <strong>of</strong> 0.86 was determined, while the recall (0.92) was<br />
93
triplet association/UTRP<br />
Resource Available Extracted Common Precision Recall F1<br />
GC 38 61 29 0.48 0.76 0.59<br />
XC2 58 61 52 0.84 0.90 0.87<br />
Table 5.7: A specialised performance evaluation between GC and XC2. The test set consists <strong>of</strong> the 26<br />
common documents between GC and XC2. A comparison <strong>of</strong> the annotated entity triplet associations<br />
from both resources shows that the list <strong>of</strong> targets are different.<br />
almost twice as high as the triple entity association determined with GC (cf. table 5.5).<br />
This can be explained by the differences <strong>of</strong> the used <strong>annotation</strong> methods for both test<br />
corpora. The entities and their associations in GC were determined manually and did not<br />
considered a grounding step.<br />
To better compare the performance between the GC and XC2 data the common set <strong>of</strong><br />
26 abstract texts from both corpora were studied (cf. section 5.2). By reusing the URP<br />
information from the cross-validation corpus the determined performance is similar to the<br />
one evaluated on the whole XC2 dataset (compare table 5.7 with table 5.6). However,<br />
the XC2-based evaluation is different form the manual-based <strong>annotation</strong> study.<br />
However, this result is different from the evaluation based on manual <strong>annotation</strong>. A<br />
detailed analysis shows that manual <strong>annotation</strong> determined 38 entity triplets, whereas<br />
XC2 lists 58 associations and only 25 <strong>of</strong> these are common among both data sets (data<br />
not shown). This indicates that the annotated targets in GC and XC2 are different and<br />
cannot be compared directly.<br />
The results indicate that the developed method is able to detect correct associations<br />
<strong>of</strong> residue, protein, and organism.<br />
5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins<br />
The developed text mining system annotates an identified protein residue in a text passage<br />
with references to its source protein and its hosting organism. Therefore, each MEDLINE<br />
94
Figure 5.3: Identified protein residues in MEDLINE. From a MEDLINE extraction, a subset <strong>of</strong> 2,884<br />
Uniprot proteins were identified, with cross-references to 14,007 PDB entries, and a corresponding set <strong>of</strong><br />
18,427 MEDLINE records. In comparison, the citation set <strong>of</strong> the corresponding entries in UniProtKB<br />
has only 4,652 PMIDs. Only 657 out <strong>of</strong> 18,427 PMIDs are cross-validated by UniProtKB data. Dashed<br />
line = MEDLINE based extraction; solid line = database values.<br />
record with an identified protein residue can be used to update the citation set <strong>of</strong> a<br />
correspondent protein entry in UniProtKB, or any other hyperlinked database, e.g. PDB<br />
(UniProtKB/PDB). In this study, the whole MEDLINE was scanned with the developed<br />
protein residue identification method, and the determined set <strong>of</strong> PMIDs compared with the<br />
citation sets in UniProtKB/PDB (cf. figure 5.3; for an overview <strong>of</strong> databanks hyperlinks<br />
and citation references cf. section 2.1).<br />
The protein residue identification system found a total <strong>of</strong> 40,750 MEDLINE records<br />
where residues were associated with co-mentioned proteins. The unique count <strong>of</strong> Uniprot<br />
proteins within the entity triplet associations is 9,354, where 2,884 out <strong>of</strong> 9,364 proteins<br />
have hyperlinks to 14,007 PDB entries. Corresponding to these 2,884 Uniprot proteins<br />
95
is the set <strong>of</strong> 18,427 out <strong>of</strong> 40,750 PMIDs. In comparison, UniProtKB indexes for these<br />
2,884 Uniprot entries a set <strong>of</strong> 4,652 PMIDs. A set analysis determined that both datasets<br />
are common in 657 PMIDs. This means that only 3.6 per cent <strong>of</strong> the identified PMIDs<br />
can be cross-validated with UniProtKB (cf. figure 5.4).<br />
The low number <strong>of</strong> rediscovery can be explained, in that most <strong>of</strong> the <strong>annotation</strong>s<br />
in UniProtKB are done from sections only available in full-text articles. Although the<br />
analysis was based on MEDLINE, the extraction was already able to find a large number<br />
<strong>of</strong> relevant abstract texts for citation expansion. With a precision <strong>of</strong> 0.82 (determined<br />
by gold standard evaluation), the estimated number <strong>of</strong> true positives in the PMID set is<br />
15,110. In context <strong>of</strong> the 4,652 citations from the database for the 2,884 Uniprot proteins,<br />
and the consideration <strong>of</strong> the 657 re-discovered abstract texts, the result <strong>of</strong> MEDLINE<br />
analysis expands the citation set by 3 fold.<br />
In conclusion, the presented text mining system can be used to determine relevant<br />
literature data for the update <strong>of</strong> the citation sets in UniProtKB/PDB.<br />
The extracted abstract texts for those proteins provide the basis for <strong>functional</strong> <strong>annotation</strong><br />
extraction.<br />
5.5 Discussion<br />
The presented text mining method identifies protein residues in biomedical texts. The<br />
first step is the recognition <strong>of</strong> the entities residue, protein, and organism in texts. The<br />
language expressions <strong>of</strong> all three biological entities are quite different. A residue entity,<br />
for example, is generally mentioned in the text by its three-letter abbreviation form +<br />
protein sequence position. The regular expression patterns were designed specifically for<br />
these and other derived expressions, which explains the high precision and recall <strong>of</strong> the<br />
residue entity recognition system. However, a residue can also be expressed by its oneletter<br />
abbreviation or syntactical form.<br />
While the latter expression is considered and<br />
implemented in this thesis, it was suggested that these expressions resemble only a small<br />
96
Figure 5.4: Cross-validation <strong>of</strong> citations from identified protein residues with UniProtKB/PDB. For a<br />
subset <strong>of</strong> UniProtKB/PDB proteins (i.e. proteins with UID and PDBID) the determined PMIDs can be<br />
cross-validated with the relevant citation set from UniProtKB. Dashed line = the number <strong>of</strong> common<br />
PMIDs; uni = UniProtKB/PDB based citations; med = protein residue identification based citations;<br />
comm = common set <strong>of</strong> citations between uni and med.<br />
97
fraction [LHC07] in biomedical texts.<br />
The implementation <strong>of</strong> one-letter abbreviation<br />
would increase the recall, but the method would become less precise. For example the<br />
matched string ”C4” could be a nucleotide, a gene, an atom in a chemical compound, or<br />
any other acronym.<br />
The identification <strong>of</strong> protein terminologies in text is a great challenge in the biomedical<br />
text mining community. This is based on the fact that protein names are not standardised,<br />
and the usage <strong>of</strong> many alternative names are common, e.g. abbreviations, pet names,<br />
or synonymous names. In addition, there is no guideline in the construction <strong>of</strong> names,<br />
therefore a name can be short or long in respect <strong>of</strong> word counts, e.g. ”MAP kinase kinase”<br />
and ”MAP kinase kinase kinase”.<br />
The developed protein entity recognition system is<br />
based on a lookup <strong>of</strong> names and synonyms in a dictionary. Because the entries are finite,<br />
syntactical variants <strong>of</strong> protein names cannot be detected, if they are not covered by the<br />
dictionary. This explains the low recall <strong>of</strong> this ER system. In contrast, sub-matching <strong>of</strong><br />
a whole protein name or the tagging <strong>of</strong> ambiguous protein names reduces the precision<br />
<strong>of</strong> the method. For example, ”SNF” could be a protein in yeast or the funding agency<br />
”Swiss National Science Foundation.<br />
The principle method for organism entity recognition is the same as protein name<br />
identification in this investigation. A list <strong>of</strong> terms from NCBI taxonomy was utilised to<br />
generate an organism name dictionary. Although the developed method is the same as<br />
protein entity recognition, the system yielded in a higher performance. One explanation is,<br />
that the dictionary contains predominantly unambiguous terminologies. However, some<br />
ambiguous terms can also be found, e.g. ”RAT” could be a protein, an organism, or a<br />
method. To my knowledge, a dedicated research in organism entity recognition has not<br />
been published nor is a gold standard for performance evaluation available.<br />
Based on the finding <strong>of</strong> residue, protein, organism entities in a text, the developed system<br />
identifies semantic relations between these biological entities. The approach is based<br />
on the idea <strong>of</strong> reusing explicitly stated relations contained in UniProtKB. The correct<br />
98
association between protein and residue relies on several factors: the ER performance,<br />
the correct protein sequence retrieval, which is dependent on the correct organism-protein<br />
association, and the correct alignment <strong>of</strong> a residue with a protein sequence at the specified<br />
position. On one hand, a low recall in residue-protein association can be explained by<br />
a missing protein sequence variant in the repository. On the other hand, an incorrect<br />
protein-organism association leads to the retrieval <strong>of</strong> a wrong protein sequence. Another<br />
consideration is, that the protein sequence in the database could deviate from the author’s<br />
data, because either side may have used different indexing rules. Conversely, the<br />
true positive rate can also be blurred by the same reason that a non corresponding residue<br />
sequence index results in a by chance matching with a protein sequence. One solution to<br />
this specific problem is to consider all residues <strong>of</strong> the same protein in the sequence alignment.<br />
However, this method may only be applicable for full-text analysis, as abstract<br />
texts rarely mention multiple residues <strong>of</strong> the same protein.<br />
The evaluation <strong>of</strong> the entity recognition and the association detection systems was<br />
done by a manual analysis on the gold standard corpus, and by an automatic crossvalidation<br />
study. This has the following reasons. Protein <strong>annotation</strong>s in UniProtKB are<br />
primarily derived from manual information extraction from full-text articles. Although a<br />
considerable amount <strong>of</strong> these information may not be present in MEDLINE, the combination<br />
<strong>of</strong> X+PMID, where X is either UID or TID, can be used to estimate the information<br />
extraction performance. However, the false positive rate in this cross-validation study<br />
cannot be determined, because the knowledge base is incomplete with information, and<br />
even for the indexed citations. Therefore, manual evaluation on a gold standard test set<br />
has the advantage to study the false positive and false negative rate.<br />
An identified protein residue is annotated with references to its source protein (Uniprot<br />
identifier) and the hosting organism (NCBI Taxonomy identifier). Based on these <strong>annotation</strong>s<br />
a link can be made between MEDLINE and biological knowledge bases.<br />
One<br />
immediate application is to scan MEDLINE for protein residues and use the Uniprot<br />
99
identifier <strong>annotation</strong>s in combination with the MEDLINE identifier (or PubMed identifier;<br />
PMID) to update the citation sets <strong>of</strong> corresponding Uniprot entries. The significance<br />
<strong>of</strong> this approach was studied by automatic cross-validation analysis. Although, the results<br />
indicate that only a small proportion <strong>of</strong> Uniprot proteins can be found and associated with<br />
residues from MEDLINE analysis, the identified set <strong>of</strong> PMIDs has only a small overlap<br />
with the corresponding citation sets. One explanation is, that <strong>annotation</strong>s were extracted<br />
from full-text articles, where the same information is not present in the abstract texts;<br />
they represent the true negative fraction in sense that the information cannot be identified<br />
from abstract sections. Another explanation is based on the fact that curators provide<br />
only a list <strong>of</strong> relevant citations from a batch <strong>of</strong> processed biomedical articles. In other<br />
words, the information <strong>of</strong> irrelevant citations (false positives) or the complete list <strong>of</strong> true<br />
positives <strong>of</strong> citations, from the sample <strong>of</strong> reviewed biomedical articles, is not available in<br />
UniProtKB which would have allowed a more precise evaluation.<br />
5.6 Conclusion<br />
The developed text mining solution identifies protein residues in text and annotates them<br />
with references to UniProtKB and NCBI Taxonomy. Based on these references, a link<br />
between MEDLINE and UniProtKB is created. Although the identification <strong>of</strong> protein<br />
residues in MEDLINE does not necessarily mean that <strong>functional</strong> <strong>annotation</strong>s are present<br />
in abstract texts, the analysis is a prerequisite for the mining <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>.<br />
The extraction <strong>of</strong> contextual feature as <strong>annotation</strong>s <strong>of</strong> a protein residue is the topic <strong>of</strong> the<br />
following chapter.<br />
100
Chapter 6<br />
Information extraction from the<br />
context <strong>of</strong> a residue in text<br />
In the previous chapter, I have introduced a method for the identification <strong>of</strong> protein<br />
residues in biomedical texts. The objective, in this chapter, is to extract textual features<br />
from the context <strong>of</strong> protein residues that can be used as <strong>functional</strong> <strong>annotation</strong>. Because a<br />
terminological resource is not utilised, the developed method can discover new information<br />
from text.<br />
The extracted contextual features are then enriched with semantic labels<br />
according to a categorisation scheme. The design <strong>of</strong> this scheme was data-driven, and<br />
contains concepts <strong>of</strong> biological interests. The overall result <strong>of</strong> this text mining solution<br />
is the <strong>annotation</strong> <strong>of</strong> protein residues with text segments that are classified by a set <strong>of</strong><br />
biological categories.<br />
6.1 Algorithms<br />
The developed information extraction system can be divided into two parts: extraction<br />
<strong>of</strong> contextual features associated with protein residues, and classification <strong>of</strong> the extracted<br />
textual features. Figure 6.1 illustrates the procedures involved in the developed information<br />
extraction system.<br />
101
Figure 6.1: Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed contextual feature extraction<br />
system.<br />
102
6.1.1 Extraction <strong>of</strong> contextual features<br />
Theory<br />
Finding <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues in biomedical text.<br />
In this<br />
study, several assumptions have been made for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s<br />
from biomedical texts, which are explained in the following. The first assumption is, that<br />
noun phrases in a text are semantically rich in sense, that they are able to represent<br />
a subject content (keyword) [JK95]. Consequently, they are good candidates <strong>of</strong> textual<br />
features for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues.<br />
The second assumption is, that a biological function <strong>of</strong> a protein residue, can be found<br />
as verbal or nominal expression in natural language. In other words, a syntactical relation<br />
between a residue and a term can capture their semantic relation. Therefore, a syntactical<br />
analysis <strong>of</strong> a sentence enables the identification <strong>of</strong> an explicitly stated biological function.<br />
For example, from the phrase<br />
”A inhibits B by phosphorylation <strong>of</strong> C”,<br />
the relations<br />
A—inhibits—by-phosphorylation-<strong>of</strong>-C<br />
A—inhibits—B-by-phosphorylation<br />
A—inhibits—B<br />
UNK—phosphorylate—C,<br />
can be identified. Although the identification <strong>of</strong> a residue-keyword association can be<br />
attempted with co-occurrence analysis, the target is to extract reliable associations with<br />
contextual information on their association. In other words the type <strong>of</strong> association expressed<br />
by a verb or by a preposition, and the context expressed by a prepositional phrase,<br />
are important bits <strong>of</strong> information that represent a justifiable <strong>functional</strong> <strong>annotation</strong>. A<br />
103
discussion on semantic relation and syntactical relation extraction can be found in section<br />
2.3.2.<br />
Generally, to identify description <strong>of</strong> biological function in text, the terminologies from<br />
GO can be reused. However, this ontology is actually not specialised on protein residues,<br />
for example the term ”<strong>active</strong> site” does not even appear as a stand-alone term in the<br />
repository. Generally, description <strong>of</strong> protein function refers to higher level <strong>of</strong> biological<br />
function, e.g. metabolomics or cell signalling. In contrast, the <strong>annotation</strong> <strong>of</strong> protein<br />
residues requires a different set <strong>of</strong> terminologies that describe molecular interactions or<br />
chemical reactions.<br />
Because a suitable terminological resource is not available, the extraction <strong>of</strong> syntactical<br />
relation focuses on semantic relations with the elements: residue entity and contextual<br />
feature (keyword). The following is a demonstration <strong>of</strong> how a description <strong>of</strong> function can<br />
be identified from a parsed sentence. Given the example sentence from MEDLINE<br />
”Parathyroid hormone inhibits renal phosphate transport by phosphorylation<br />
<strong>of</strong> serine 77 <strong>of</strong> sodium-hydrogen exchanger regulatory factor-1.”<br />
(PMID:17975671),<br />
a syntactical analysis produces the following phrase structure representation<br />
104
[Parathyroid hormone]/NP<br />
[inhibits]/V<br />
[renal phosphate transport]/NP<br />
[by]/P<br />
[phosphorylation]/NP<br />
[<strong>of</strong>]/P<br />
[serine 77]/NP<br />
[<strong>of</strong>]/P<br />
[sodium-hydrogen exchanger regulatory factor-1]/NP,<br />
where NP is a noun phrase, P a preposition, and V a verb. From this parsed sentence,<br />
the following semantic relations can be determined:<br />
Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation-<strong>of</strong>-serine<br />
77<br />
Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation<br />
Parathyroid hormone—inhibits—renal phosphate transport<br />
UNK—phosphorylate—serine 77.<br />
In the next section, a template for storing the extracted relation information is discussed.<br />
Semantic representation <strong>of</strong> extracted relations.<br />
The objective <strong>of</strong> syntactical relation<br />
extraction is to identify biological relations in a sentence, i.e. a semantic relation<br />
between a residue entity and a terminology. While the result is a set <strong>of</strong> syntactical relations<br />
with different contextual specification (cf. example in previous section), a suitable<br />
105
data collation method is necessary to avoid data redundancy. That is, the set <strong>of</strong> determined<br />
relations, within a given syntactic frame contains a relation, which is a specification<br />
<strong>of</strong> another one. For example, the relation<br />
A—inhibits—B-by-phosphorylation,<br />
is a specification <strong>of</strong> the relation<br />
A—inhibits—B.<br />
Here, the predicate-argument structure (PAS) is proposed as a semantic representation<br />
<strong>of</strong> extracted syntactical relations. A PAS is a template for information extraction,<br />
where the predicate and the arguments represent the slots to be filled. In this study, the<br />
predicate (pred) <strong>of</strong> a PAS is defined as the verb, while the arguments <strong>of</strong> the verb are<br />
the numerically labelled arguments arg1 and arg2, or even higher numerically labelled<br />
arguments. The arg1 label is assigned to arguments, which are understood as agents,<br />
causers, or experiencers, i.e. the semantic subject. Conversely, the arg2 label is usually<br />
assigned to the patient argument, i.e. the argument which undergoes the change <strong>of</strong> state<br />
or is being affected by the action.<br />
The transformation <strong>of</strong> the extracted relations into PAS data, does not consider the<br />
analysis <strong>of</strong> the semantic role <strong>of</strong> the verb arguments, i.e.<br />
argument modifiers, such as<br />
location, time, cause, etc. Noun phrases <strong>of</strong> the extracted relations can have prepositional<br />
attachments, and the preposition are <strong>of</strong>ten indicators <strong>of</strong> thematic roles <strong>of</strong> the verb arguments.<br />
Therefore, prepositional phrases are listed as modifiers <strong>of</strong> arguments with the<br />
following label notations: main argument label + preposition, e.g.<br />
arg1-<strong>of</strong>, and arg2-<br />
by. The following illustrates the transformation <strong>of</strong> relations into a PAS for the previous<br />
example:<br />
106
pred = inhibit<br />
arg1 = Parathyroid hormone<br />
arg2 = renal phosphate transport<br />
arg2-by = phosphorylation<br />
arg2-<strong>of</strong> = serine 77,<br />
which corresponds to the following verb frame set:<br />
inhibit sub-arg1 obj-arg2 P by-arg2 P <strong>of</strong>-arg2.<br />
Notice, that the defined PAS does not accord to PAS schemes <strong>of</strong> some propositional<br />
banks, e.g. PropBank or PASBio. For example, for the verb ”inhibit” PropBank lists the<br />
following frame set:<br />
inhibit sub-ARG0 obj-ARG1<br />
inhibit sub-ARG0 S-ARG1,<br />
while additional arguments are not defined (notice, that the definition <strong>of</strong> ARG0 in Prop-<br />
Bank is equivalent to arg1 in this definition, and ARG1 corresponds to arg2). Although<br />
verb frame sets from publicly available propositional banks can be considered in this study,<br />
the set <strong>of</strong> listed verbs have a low coverage with the set <strong>of</strong> verbs co-occurring with residue<br />
mentions in MEDLINE. The low coverage and the non-domain specific verb frame sets<br />
are the main reasons why these resources were not reused.<br />
Implementation<br />
The extraction <strong>of</strong> contextual features is based on a syntactical analysis <strong>of</strong> natural language<br />
sentences. Two approaches were developed in this work and compared in the performance<br />
107
evaluation study: shallow parser based relation extraction, and full parser based relation<br />
extraction.<br />
Shallow parser based relation extraction.<br />
The first approach was to develop a<br />
shallow parser, which aims to find the boundaries <strong>of</strong> major constituents in a sentence,<br />
such as noun phrases. The design is based on heuristics and the idea <strong>of</strong> finding general<br />
relations between closed-class English words [LCM03]. The reported parser finds verbal<br />
relations between noun phrases, and prepositional relations <strong>of</strong> a set <strong>of</strong> the most frequent<br />
prepositions, i.e. ”<strong>of</strong>”, ”in”, and ”by”. Here, the parser is implemented as a general<br />
relation extraction method, where the list <strong>of</strong> prepositions are not limited to the three<br />
mentioned ones. The purpose is to find more contextual features, and thereby discover<br />
more information.<br />
Initially, an abstract text was split into sentences, and then annotated with part<strong>of</strong>-speech<br />
(POS) tags using the CISTAGGER. The tagger was trained in the CISLEX<br />
lexical resource that contains a rich terminological set <strong>of</strong> the biomedical domain [Gue96].<br />
Based on a rule set and the POS information the developed shallow parser identified noun<br />
phrases, verb groups, verb phrases, and prepositional phrases for analysed sentences:<br />
NP = Det? (Adj|Adv|N)* N<br />
PP = P NP<br />
VG = (Adv|Aux|V|InfTo)* V<br />
VP = VG NP PP*.<br />
N is a noun, Det a determiner, Adj an adjective, Adv an adverb, P a preposition, PP a<br />
prepositional phrase, VP a verb phrase, and VG a verb group. Notice, that the grammar<br />
does not consider coordinating conjunctions, e.g. with ”and”, ”or” and ”,”. The grammar<br />
can be easily extended to capture conjunctions by<br />
108
NPx = NP (CC NP)*,<br />
where<br />
CC = (”and” | ”or” | ”,”){1,2}.<br />
However, the pattern would then also find false positives as illustrated in the following<br />
example. The sentence<br />
”Highly conserved phosphopantothenate binding residues include Asn59,<br />
Ala179, Ala180, and Asp183 from one monomer and Arg55’ from the<br />
adjacent monomer.” (PMID:12906824),<br />
contains the noun phrases<br />
NP1 = ”Asn59, Ala179, Ala180, and Asp183 from one monomer”<br />
NP2 = ”Arg55’ from the adjacent monomer”.<br />
The extended patterns would have extracted a single noun phrase, from which the identification<br />
<strong>of</strong> the correct post-nominal prepositional phrase attachment cannot be done<br />
easily:<br />
NPx =<br />
”Asn59, Ala179, Ala180, and Asp183 from one monomer and<br />
Arg55’ from the adjacent monomer”.<br />
Based on the determined phrase structure, the parser then extracts verbal relations <strong>of</strong><br />
noun phrases or prepositional phrases. A condition <strong>of</strong> the extraction is, that at least one<br />
relation element must contain one or more residue mentions:<br />
109
REL = NP PP* VP.<br />
The extracted relation is then transformed to fill the slots <strong>of</strong> the predefined PAS template.<br />
Full parser based relation extraction.<br />
The second approach in contextual feature<br />
extraction utilises the full parser ENJU [MT05] (version 2.3), which generates a so called<br />
head-driven parse tree from a sentence. The advantage <strong>of</strong> this parser is, that a parsing<br />
model adapted to biomedical text is utilised. This parser generates predicate-argument<br />
relations between words.<br />
Because the generated output contains a lot <strong>of</strong> information,<br />
different interpretations are possible. In this study, a wrapper was developed that converts<br />
the parser’s output into the presented PAS data format.<br />
The assumption is, that by<br />
following the direct links <strong>of</strong> a verb to its arguments in the tree, and then collecting all the<br />
sub-branches <strong>of</strong> each argument, the phrase structure <strong>of</strong> a verb argument can be found.<br />
The identified NP PP* VP structures are then decomposed to fill the PAS template.<br />
6.1.2 Categorisation <strong>of</strong> contextual features<br />
Theory<br />
A PAS captures a verb frame within a text sentence, where the arguments may represent a<br />
subject content. In order to evaluate the relevance <strong>of</strong> these arguments a semantic interpretation<br />
is needed. Here, a classification method was developed, that assigns automatically<br />
semantic labels to the arguments <strong>of</strong> a PAS. For this task, the categories have to be defined<br />
as suitable labels for information interpretation. Although an ontological model <strong>of</strong> protein<br />
residue function is not available, there are two approaches to this problem. The first is<br />
to adopt <strong>annotation</strong> schemes from various protein databases, e.g. the UniProtKB. This<br />
represents a top-down approach. One motivation for reusing the categorisation scheme <strong>of</strong><br />
UniProtKB is, that classified information with this scheme can be directly used to update<br />
110
the relevant fields in the database.<br />
Alternatively, a bottom-up approach can propose new categories. In this study, suitable<br />
text segments from MEDLINE were analysed, if they represent suitable <strong>functional</strong><br />
<strong>annotation</strong>s for residues. The result, is an overview <strong>of</strong> information distribution in MED-<br />
LINE, which has led to the proposition <strong>of</strong> a categorisation scheme. The defined categories<br />
<strong>of</strong> both schemes are compared in table 6.1. Both categorisation schemes reflect concepts<br />
<strong>of</strong> biological interest. However the bottom-up approach has the advantage that proposed<br />
categories are data-driven, while in a top-down approach examples <strong>of</strong> listed categories may<br />
not be present in natural language text, or other categories are missing in the scheme.<br />
The assignment <strong>of</strong> categories to contextual features is based on the endogenous classification<br />
approach [Cer00]. In contrast, the exogenous, i.e. corpus-based, approach requires<br />
large amounts <strong>of</strong> contextual cues, which are difficult to obtain. According to the author,<br />
the endogenous approach is more reliable to produce results even under conditions <strong>of</strong><br />
sparse data.<br />
From a reference set <strong>of</strong> terms with manually assigned labels according to a categorisation<br />
scheme, the algorithm computes the mutual information <strong>of</strong> the lexical constituents <strong>of</strong><br />
terms and their assigned categories. These scores are then used to calculate and select the<br />
highest scoring association <strong>of</strong> a term and a category. The algorithm was re-implemented<br />
and used in this study.<br />
Implementation<br />
The semantic interpretation <strong>of</strong> contextual features, which are the arguments <strong>of</strong> the extracted<br />
PAS, relies on the endogenous classification approach described by [Cer00]. The<br />
method was re-implemented in this study. The algorithm relies only on the mutual information<br />
<strong>of</strong> the lexical constituents <strong>of</strong> terms and their assigned categories.<br />
During the training phase, lexical constituents <strong>of</strong> multi-word terms were extracted<br />
from a labelled reference set. They represent the features <strong>of</strong> the predefined categories.<br />
111
MAN FEAT<br />
Category Defintion Category Defintion<br />
STR COMP<br />
Structure component. Class denoting concepts that<br />
represent pieces and parts <strong>of</strong> the protein structure.<br />
DOMAIN Extent <strong>of</strong> a domain, which is defined as a specific combination <strong>of</strong> secondary<br />
structures organised into a characteristic three-dimensional structure <strong>of</strong> fold.<br />
MOTIF Short (up to 20 amino acids) sequence motif <strong>of</strong> biological interest.<br />
TOPO DOM Topological domain.<br />
CHAIN Extent <strong>of</strong> a polypeptide chain in the mature protein.<br />
TRANSMEM Extent <strong>of</strong> a transmembrane region.<br />
COILED Extent <strong>of</strong> a coiled-coil region.<br />
CHEM MOD<br />
Chemical modification. Class denoting changes to<br />
the protein sequence and the chemical composition.<br />
VARIANT Authors report that sequence variants exist.<br />
MOD RES Posttranslational modification <strong>of</strong> a residue.<br />
PEPTIDE Extent <strong>of</strong> a released <strong>active</strong> peptide.<br />
VAR SEQ Description <strong>of</strong> sequence variants produced by alternative splicing, alternative<br />
promoter usage, alternative initiation and ribosomal frameshifting.<br />
LIPID Covalent binding <strong>of</strong> a lipid moiety.<br />
CARBOHYD Glycosylation site.<br />
STR MOD Structural modification. Class denoting the changes<br />
to the protein structure without changes to the<br />
chemical composition.<br />
REGION Extent <strong>of</strong> a region <strong>of</strong> interest in the sequence.<br />
SITE Any interesting single amino-acid site on the sequence, that is not defined by<br />
another feature key.<br />
BINDING Binding type. Class denoting different<br />
physico-chemical forces leading to a bond formation<br />
between a protein structure component and a<br />
chemical entity.<br />
BINDING Binding site for any chemical group (co-enzyme, prosthetic group, etc.).<br />
METAL Binding site for a metal ion.<br />
DISULFID Disulfide bond.<br />
CROSSLNK Posttranslationally formed amino acid bonds.<br />
DNA BIND Extent <strong>of</strong> a DNA-binding region.<br />
NP BIND Extent <strong>of</strong> a nucleotide phosphate-binding region.<br />
ZN FING Extent <strong>of</strong> a zinc finger region.<br />
CA BIND Extent <strong>of</strong> a calcium-binding region.<br />
ENZ ACT Enzymatic activity. Types <strong>of</strong> enzymatic reactions as<br />
a subpart to protein functions.<br />
ACT SITE Amino acid(s) involved in the activity <strong>of</strong> an enzyme.<br />
CELL Cellular phenotype. Class denoting different cellular<br />
phenotypes that can be affected by structural or compositional<br />
changes <strong>of</strong> a protein.<br />
N/A<br />
Table 6.1: Biological categories for the classification <strong>of</strong> protein residue related information. Two sets<br />
<strong>of</strong> schemes were used: a text data motivated definition <strong>of</strong> categories (MAN) determined from manual<br />
analysis <strong>of</strong> sentences with <strong>annotation</strong>s for protein residues from MEDLINE, and key categories from the<br />
feature table <strong>of</strong> UniProtKB (FEAT).<br />
112
The association between both, a feature (w) and a category (c), was estimated based on<br />
their mutual information score<br />
I(w, c) = log 2<br />
P (w,c)<br />
P (w)P (c) . (6.1)<br />
The association between the multi-word term T = {w i } n i=1<br />
and a category c was<br />
computed by the sum <strong>of</strong> the associations <strong>of</strong> its words<br />
A(T, c)<br />
= P ∗ (c) ∑ n<br />
i=1 I(w i, c), (6.2)<br />
where P ∗ (c) is the probability <strong>of</strong> a category associated with a term. The categorization<br />
<strong>of</strong> a multi-word term into one <strong>of</strong> the categories, amounts to the identification <strong>of</strong> the best<br />
fitting category C ∗ for a term, based on the words in a term<br />
c ∗ = arg max c A(T, c). (6.3)<br />
The reference set was generated, by using maximal length noun phrase (MLNP) analysis.<br />
The assumption <strong>of</strong> this approach is that textual features co-occurring with a residue<br />
within a noun phrase (NP r ) are good candidates <strong>of</strong> terms for <strong>functional</strong> <strong>annotation</strong>. In<br />
order to identify the boundaries <strong>of</strong> these candidate terms, the MLNP algorithm relies on<br />
the lookup <strong>of</strong> a determined set <strong>of</strong> noun phrases without nested residue entities (NP ¬r ). In<br />
other words, the algorithm assumes that nested terms in NP r are also expressed as standalone<br />
noun phrases, which can be identified by a broad syntactical analysis on MEDLINE.<br />
The following is an example for illustration. Consider the term<br />
”complex formation”,<br />
which is identified as a stand-alone noun phrase NP ¬r in the sentence<br />
113
”The GlyNH2 was removed and the re<strong>active</strong>-site peptide bond X18-<br />
Glu19 was synthesized by complex formation with proteinase K.”<br />
(PMID:9047374).<br />
The same term co-occurs with a residue entity within another noun phrase (NP(r))<br />
”Rb-E2F-DNA complex formation”<br />
in the sentence<br />
”MDM2 also interacts with Rb through its central acidic domain and inhibits<br />
Rb function in part by blocking Rb-E2F-DNA complex formation.”<br />
(PMID:16337594).<br />
The determined MLNP in this example is ”complex formation”.<br />
Once the set <strong>of</strong> MLNPs were extracted, each item (NP) was manually labelled, based<br />
on a categorisation scheme. Within this study, two categorisation schemes (cf. table 6.1)<br />
were used independently and studied: the categories defined by manual analysis on MED-<br />
LINE sentences (bottom-up approach), and the categories defined as keys in the feature<br />
table from UniProtKB (top-down approach). The sets <strong>of</strong> categories from the bottom-up<br />
approach and from the top-down approach are referred as MAN and FEAT in this study.<br />
Table 6.2 compares the distribution <strong>of</strong> labels within the reference set.<br />
An illustration, where a determined MLNP can be used to find relevant information<br />
from contextual features <strong>of</strong> a protein residue, is the following example. From the sentence<br />
114
MAN<br />
FEAT<br />
Category Frequency Category Frequency<br />
STR COMP 433 DOMAIN 28<br />
MOTIF 8<br />
TOPO DOM 4<br />
CHAIN 2<br />
TRANSMEM 2<br />
COIL 1<br />
CHEM MOD 361 VARIANT 275<br />
MOD RES 59<br />
PEPTIDE 13<br />
VAR SEQ 6<br />
LIPID 3<br />
CARBOHYD 1<br />
STR MOD 25 REGION 100<br />
SITE 246<br />
BINDING 195 BINDING 139<br />
METAL 25<br />
DISULFID 11<br />
CROSSLNK 10<br />
DNA BIND 6<br />
NP BIND 5<br />
ZN FING 2<br />
CA BIND 1<br />
ENZ ACT 90 ACT SITE 110<br />
CELL 161 N/A<br />
GEN BIOL 2,172 GEN BIOL 2,372<br />
GEN ENG 643 GEN ENG 651<br />
Table 6.2: Category distribution in the text feature reference set. The text feature reference set was<br />
compiled from maximal length noun phrase analysis (MLNP) from two sets <strong>of</strong> noun phrases: one without<br />
residue mentions and the other with identified protein residue entities. The features in the reference set<br />
were manually assigned with labels <strong>of</strong> the categorisation scheme MAN and FEAT. GEN BIOL = general<br />
biological terminologies; GEN ENG = general English words.<br />
115
”Mutation K241Q completely abolishes DNA glycosylase activity and<br />
covalent complex formation in the presence <strong>of</strong> NaBH4.” (PMID:9241232),<br />
the following relation can be identified<br />
mutation K241Q—abolish—covalent complex formation.<br />
A semantic label can be assigned to the relation argument ”covalent complex formation”<br />
because the term ”complex formation” is labelled in the reference set.<br />
6.2 Evaluation methods<br />
The extraction <strong>of</strong> contextual features <strong>of</strong> residues results in a set <strong>of</strong> syntactical relations,<br />
which are represented as PAS. The performance <strong>of</strong> this extraction module was evaluated<br />
by comparing the returned PAS data with manual <strong>annotation</strong>s in the gold standard test<br />
corpus (cf. section 5.2). A true positive was counted, if the syntactical relations in a PAS<br />
were correct, and if the arguments in the PAS contained the annotated residue entity and<br />
the marked keyword(s) in the test corpus. If any <strong>of</strong> these conditions were not met, then a<br />
false positive was registered. The performance was measured in terms <strong>of</strong> precision, recall<br />
and F1-measure, as described earlier in section 5.3.<br />
The performance <strong>of</strong> the developed classification method was evaluated by a 100 times<br />
5-fold cross-validation. For each iteration, terms in the reference set were shuffled, and<br />
partitioned into a test set (1/5 <strong>of</strong> the data) and a training set (4/5 <strong>of</strong> the data). The<br />
average precision, recall and F1-measure (cf. section 5.3) were calculated for each classifier<br />
from the determined confusion matrix.<br />
116
PAS<br />
Method Available Extracted Common Precision Recall F1<br />
Shallow parsing 117 82 56 0.68 0.48 0.56<br />
Full parsing 117 86 32 0.37 0.27 0.31<br />
Table 6.3: Evaluation <strong>of</strong> syntactical language parser performance. The performance <strong>of</strong> the two language<br />
parsers (shallow and full parsing) were evaluated on the basis <strong>of</strong> precision, recall and F1 measures by<br />
comparing the annotated PAS data in the test set with the returned PAS output from the parsers.<br />
6.3 Results<br />
In this section, the performances <strong>of</strong> contextual feature extraction and categorisation are<br />
studied. The test dataset is the gold standard corpus.<br />
6.3.1 Contextual feature extraction evaluated<br />
The objective in contextual feature extraction is to find textual features that are suitable<br />
as <strong>functional</strong> <strong>annotation</strong>s for protein residues.<br />
In this section, the performance <strong>of</strong> this extraction system is studied by comparing<br />
the results produced with two different language parsers: the shallow parser, and the full<br />
parser. Sentences from the gold standard corpus (GC) were used as test dataset for this<br />
analysis.<br />
Within this study, the analysis determined that the developed shallow parser has a<br />
better performance than the full parser ENJU. The shallow parser yielded in a F1 measure<br />
<strong>of</strong> 0.56 (precision <strong>of</strong> 0.68 and recall <strong>of</strong> 0.48), while the full parser ENJU has a F1 measure<br />
<strong>of</strong> 0.31 (precision <strong>of</strong> 0.37 and recall <strong>of</strong> 0.27) (cf. table 6.3).<br />
The results suggest that contextual information <strong>of</strong> a residue entity can be extracted<br />
from a syntactical analysis with a F1 measure <strong>of</strong> 0.56 and 0.31 for shallow parsing and<br />
full parsing, respectively.<br />
117
6.3.2 Performance analysis <strong>of</strong> the classifiers<br />
One problem in <strong>functional</strong> <strong>annotation</strong> extraction is the semantic interpretation <strong>of</strong> the<br />
extracted text data.<br />
The solution proposed in this work, is based on a classification<br />
approach.<br />
Two different categorisation schemes were tested in this study: MAN and<br />
FEAT. The performance <strong>of</strong> the developed classification method was evaluated by repeated<br />
cross-validation studies. Table 6.5 summarises the results from the determined confusion<br />
matrix (cf. table 6.4).<br />
For MAN, the top three performing classifiers with F1 measures <strong>of</strong> 0.62, 0.57, and 0.57<br />
are STR COMP (precision <strong>of</strong> 0.56, recall <strong>of</strong> 0.69), CHEM MOD (precision <strong>of</strong> 0.54, recall<br />
<strong>of</strong> 0.59) and BINDING (precision <strong>of</strong> 0.63, recall <strong>of</strong> 0.52). The average performance <strong>of</strong> the<br />
whole classification system for this categorisation scheme yielded in an average precision<br />
<strong>of</strong> 0.48 and an average recall <strong>of</strong> 0.42. In comparison the classification based on FEAT has<br />
a much lower average performance: average precision <strong>of</strong> 0.24, average recall <strong>of</strong> 0.18. The<br />
weak performances <strong>of</strong> the FEAT classifiers is explained by the distribution <strong>of</strong> examples<br />
in the categories; for some categories the number <strong>of</strong> corresponding features or examples<br />
is low (cf. table 6.2). A discussion is presented in section 6.4<br />
Examining the false positive rate in the confusion matrix <strong>of</strong> MAN reveals that the classifiers<br />
are confused with the category GEN BIOL (general biological terms) or GEN ENG<br />
(general English terms). This is not surprising considering that English terms are ambiguous.<br />
In addition, some categories show confusions with others, e.g. STR COMP with<br />
CHEM MOD, and ENZ ACT with STR COMP. One explanation is that some terms<br />
can be assigned to more than one category. For example, ”mutant structure” refers to<br />
an altered protein structure state, which is based on a chemical change in the protein<br />
sequence.<br />
Despite the average performances <strong>of</strong> some classifiers, the presented method can be<br />
used to assign categories to textual features. However, significant improvements on the<br />
performances <strong>of</strong> some classifiers are necessary before the system can be used automatically.<br />
118
Prediction<br />
BINDING GEN BIOL CELL CHEM MOD GEN ENG ENZ ACT STR COMP STR MOD<br />
BINDING 1,772 762 28 93 165 26 546 0<br />
A | GEN BIOL 560 15,815 525 1,496 4,514 159 1,714 65<br />
c | CELL 96 1,167 836 150 325 91 67 0<br />
t | CHEM MOD 38 1,103 12 3,742 761 79 546 25<br />
u | GEN ENG 144 2,556 126 510 1,820 46 480 35<br />
a | ENZ ACT 33 338 80 201 226 324 457 0<br />
l | STR COMP 160 783 64 551 592 35 4,914 11<br />
STR MOD 1 91 1 129 125 0 21 43<br />
Table 6.4: Performance analysis <strong>of</strong> the classifiers (confusion matrix). Classification with categories<br />
from MAN were analysed by cross-validation studies with 100-iterations. The result is represented as a<br />
confusion matrix.<br />
119
MAN<br />
FEAT<br />
Category Precision Recall F1 Category Precision Recall F1<br />
STR COMP 0.56 0.69 0.62 DOMAIN 0.50 0.24 0.32<br />
MOTIF 0.98 0.36 0.53<br />
TOPO DOM 0 0 0<br />
CHAIN 0 0 0<br />
TRANSMEM 0 0 0<br />
COIL 0 0 0<br />
CHEM MOD 0.54 0.59 0.57 VARIANT 0.50 0.69 0.58<br />
MOD RES 0.40 0.23 0.29<br />
PEPTIDE 0.05 0.06 0.05<br />
VAR SEQ 0 0 0<br />
LIPID 1 0.32 0.48<br />
CARBOHYD 0 0 0<br />
STR MOD 0.24 0.10 0.15 REGION 0.44 0.44 0.44<br />
SITE 0.40 0.55 0.46<br />
BINDING 0.63 0.52 0.57 BINDING 0.41 0.45 0.43<br />
METAL 0.05 0.02 0.03<br />
DISULFID 0.53 0.15 0.23<br />
CROSSLNK 0 0 0<br />
DNA BIND 0 0 0<br />
NP BIND 0 0.06 0<br />
ZN FING 0 0 0<br />
CA BIND 0 0 0<br />
ENZ ACT 0.43 0.20 0.27 ACT SITE 0.45 0.31 0.36<br />
CELL 0.50 0.31 0.38 N/A<br />
GEN BIOL 0.70 0.64 0.67 GEN BIOL 0.76 0.65 0.70<br />
GEN ENG 0.21 0.32 0.26 GEN ENG 0.23 0.32 0.27<br />
0.48 0.42 0.43 0.25 0.18 0.19<br />
Average<br />
Average<br />
Table 6.5: Performance evaluation <strong>of</strong> the classifiers (precision, recall, F1 measure).Evaluation <strong>of</strong> classification<br />
<strong>of</strong> textual features (noun phrases). Classification with categories from MAN and FEAT were<br />
analysed by cross-validation studies with 100-iterations. The performance was measured in terms <strong>of</strong><br />
precision, recall, and F1 measure.<br />
120
One option is to increase the number <strong>of</strong> training data, or the size <strong>of</strong> features for each<br />
classifier. Another alternative is to modify the definition <strong>of</strong> classes. The results suggest<br />
that the algorithm is, in generally, suitable for classification.<br />
6.4 Discussion<br />
The presented text mining solution extracts textual features from the context <strong>of</strong> residue<br />
entities. The identification <strong>of</strong> the contextual features, and the association with the residue<br />
entity, is based on the syntactical analysis <strong>of</strong> the sentence. More specifically, only a subset<br />
<strong>of</strong> semantic relations that are found in verbal and prepositional relations are extracted<br />
from text. The advantage <strong>of</strong> this approach is, that not only the semantic relation partners<br />
and the semantic relation type are found, but also contextual information is extracted.<br />
Within this study two approaches in syntactical analysis were compared, i.e. shallow<br />
parsing and full parsing, while the result indicates that the ENJU parser had a weaker<br />
performance than the developed shallow parser. Manual analysis on the false positive rate<br />
indicates that the source <strong>of</strong> incorrectly determined syntactical structure originates from<br />
false part-<strong>of</strong>-speech tagging. For example, in the sentence<br />
”Conversely, K382Q displays a highly altered responsiveness to the activator,<br />
suggesting that Lys(382) is involved in both activator binding and<br />
allosteric transition mechanism.” (PMID:10751408),<br />
both parsers identified ”altered” as a verb in past tense, although the correct POS is a<br />
noun modifier. The performance <strong>of</strong> the POS tagger is critical for the detection <strong>of</strong> phrase<br />
boundaries. However, both parsers rely on two different methods for POS tagging and the<br />
performance <strong>of</strong> the POS tagger has to be considered as well when comparing the shallow<br />
and full parser. Table A.1 lists some examples, where a parser failed in extracting the<br />
annotated PAS data from GC.<br />
121
The extracted information is difficult to normalise, because there is no gold standard<br />
<strong>of</strong> how to represent the association, and how to qualify the contextual information. In<br />
this work, the predicate-argument structure is used as a template for the extracted information.<br />
Although verb frame sets from PropBank or PASBio can be used to normalise<br />
the extracted data, they are not designed to capture description <strong>of</strong> protein residue function.<br />
On the other hand, this gives the extraction method the advantage to discover new<br />
knowledge. Because the extracted information is not normalised, the performance can<br />
only be measured in terms <strong>of</strong> sensitivity.<br />
The evaluation <strong>of</strong> the classification method indicates, that the presented approach can<br />
provide an automatic solution for text interpretation. However, some <strong>of</strong> the categories<br />
have only few examples, which is reflected in weak performances <strong>of</strong> the classifiers. One<br />
solution to this problem is to balance the example sets <strong>of</strong> each category, for example,<br />
by collecting more terminologies from MEDLINE. Alternatively, other categories may<br />
be defined to balance the ratio between a category and the associated set <strong>of</strong> examples.<br />
Yet another approach is not to classify arguments <strong>of</strong> a PAS, but cluster them based on<br />
their, for example, contextual usage.<br />
The advantage here is to find more information<br />
similarities among the PAS data by overcoming the information representativeness <strong>of</strong> a<br />
training (reference) set.<br />
Despite the fact, that semantic labels can be assigned to the arguments in a PAS,<br />
the developed method is not able to interpret the meaning <strong>of</strong> the whole extracted text<br />
segment. For example, in the sentence<br />
”Specific binding <strong>of</strong> the WT and mutant receptors Cys14Ala and<br />
Cys199Ala was inhibited in the presence <strong>of</strong> the disulfide bond reducing<br />
agent, DTT, implying that disulfide bonds are formed and can be<br />
reduced in these mutant receptors.” (PMID:9202220).<br />
The following information was extracted and semantic categories were assigned to the<br />
122
arguments <strong>of</strong> the PAS<br />
pred = inhibited<br />
arg1 = Specific binding<br />
arg1-<strong>of</strong> = [the WT and mutant receptors CYS14 ALA and<br />
CYS199 ALA]/CHEM MOD<br />
arg2-in = the presence<br />
arg2-<strong>of</strong> = the disulfide bond reducing agent.<br />
Although one part <strong>of</strong> the information in the example has been correctly assigned with the<br />
label CHEM MOD, the entire text phrase should be labelled with BINDING. A solution<br />
to this problem is not trivial and requires several levels <strong>of</strong> linguistic analysis.<br />
6.5 Conclusion<br />
In this chapter, I have presented the developed contextual feature extraction system for<br />
the <strong>annotation</strong> <strong>of</strong> residue entities. Because a suitable terminological resource is not available,<br />
the identification <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> is based on the extraction <strong>of</strong> syntactical<br />
relations between a residue entity and a noun phrase. The developed method allows the<br />
discovery <strong>of</strong> novel information that can provide key information for <strong>functional</strong> <strong>annotation</strong>.<br />
In the next chapter, I will demonstrate the validity <strong>of</strong> the extracted information as<br />
<strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues.<br />
123
Chapter 7<br />
Extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong><br />
for protein residues from MEDLINE<br />
In the previous two chapters, two fundamental text mining components for the <strong>functional</strong><br />
<strong>annotation</strong> extraction were presented. In this chapter, I provide results <strong>of</strong> the combined<br />
extraction result, and assesses the performance <strong>of</strong> the combined system. The objective in<br />
this study is to determine the qualitative and quantitative distribution <strong>of</strong> information in<br />
MEDLINE. Because the information is derived solely from biomedical abstract texts, it<br />
is necessary to examine the data in terms <strong>of</strong> validity, novelty, and biological significance.<br />
In the first part <strong>of</strong> the evaluation, the performance <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction<br />
is studied on the gold standard corpus. Then the biological significance <strong>of</strong> the<br />
extracted data from MEDLINE is studied on two example proteins, the suppressor protein<br />
p53, and the Janus kinase 2 protein. Finally, the distribution <strong>of</strong> information is examined<br />
by two specific analysis: the cross-validation <strong>of</strong> identified <strong>active</strong> site residues with CSA,<br />
and the cross-validation <strong>of</strong> binding residues with MSDsite.<br />
124
7.1 Evaluation methods<br />
The evaluation <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system was based on the performance<br />
analysis <strong>of</strong> its extraction components: protein residue identification, and contextual<br />
feature extraction (cf. section 5.3 and section 6.2).<br />
The analysis on the biological validity <strong>of</strong> the mined <strong>functional</strong> <strong>annotation</strong>s was done by<br />
manual analysis. For each protein residue, the set <strong>of</strong> extracted <strong>annotation</strong>s was reviewed<br />
and grouped by similar topics. Because a set <strong>of</strong> <strong>annotation</strong>s for each associated protein<br />
residue can be very large, random samples were drawn from a list <strong>of</strong> <strong>annotation</strong>s sorted<br />
by residue name and position. The result is a set <strong>of</strong> sample <strong>annotation</strong>s for each extracted<br />
residue <strong>of</strong> a protein. The information was compared with the corresponding <strong>annotation</strong>s<br />
in UniProtKB.<br />
The validation <strong>of</strong> catalytic residues was done by cross-validation with CSA [PBT04].<br />
The analysis was performed on three levels, i.e.<br />
the comparison <strong>of</strong> identified protein<br />
residues from MEDLINE with CSA, comparison <strong>of</strong> residues with extracted <strong>functional</strong> <strong>annotation</strong>s,<br />
and comparison <strong>of</strong> residues with extracted <strong>annotation</strong>s classified as ENZ ACT<br />
(cf. section 6.1.2). The residues were compared by using the combination <strong>of</strong> the identifiers<br />
RID+UID (cf. section 5.3).<br />
The validation <strong>of</strong> binding residues from MEDLINE extraction was done accordingly.<br />
The third level <strong>of</strong> validation compared residues with extracted <strong>annotation</strong>s classified as<br />
BINDING.<br />
125
7.2 Results<br />
7.2.1 Evaluation <strong>of</strong> the developed <strong>functional</strong> <strong>annotation</strong> extraction<br />
system<br />
The presented <strong>functional</strong> <strong>annotation</strong> extraction system consists <strong>of</strong> two basic modules:<br />
identification <strong>of</strong> protein residues, and contextual feature extraction. The following describes<br />
an analysis <strong>of</strong> the overall performance <strong>of</strong> the combined text mining system. The<br />
test set is the gold standard corpus (GC; cf. section 5.2). The evaluation was done<br />
in two respects: manual validation <strong>of</strong> extracted information, and cross-validation with<br />
UniProtKB <strong>annotation</strong>s.<br />
Manual validation <strong>of</strong> extracted information.<br />
The gold standard corpus consists<br />
<strong>of</strong> 100 abstract texts with tri-occurrences <strong>of</strong> the triplet protein, residue and organism.<br />
However, manual analysis identified only 51 abstract texts with residue entities that can<br />
be associated with their proteins and hosting organisms.<br />
The number <strong>of</strong> associations<br />
(OPR) is 172. This represents the target for protein residue identification.<br />
Corresponding to these OPRs is the set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s (PAS data). For 109<br />
out <strong>of</strong> 172 OPRs, keywords were co-mentioned in verbal relations. The number <strong>of</strong> PAS<br />
associated with the 109 OPRs is 117. This represents the target <strong>of</strong> <strong>functional</strong> <strong>annotation</strong><br />
extraction.<br />
Figure 7.1 summarises the performance <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction. With<br />
a previously determined precision <strong>of</strong> 0.82 and a recall <strong>of</strong> 0.38, the protein residue identification<br />
module detects 79 OPRs with 65 out <strong>of</strong> 79 being the correct ones. Contextual<br />
feature extraction for these 65 protein residues resulted in 35 PAS data. In comparison<br />
with the 117 annotated PAS <strong>of</strong> the 109 OPRs, only 16 out <strong>of</strong> 35 extracted PAS are true<br />
positives. However, the total number <strong>of</strong> extracted PAS is 46, which results in a precision<br />
<strong>of</strong> 0.35 and a recall <strong>of</strong> 0.13. A systematic analysis revealed, that the rate <strong>of</strong> false positives<br />
126
PAS data<br />
Dataset Available Extracted Common Precision Recall F1<br />
GC 117 46 16 0.35 0.13 0.25<br />
Figure 7.1: Performance evaluation <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system. The performance<br />
is dependent on the two combined text mining modules: protein residue identification; and contextual<br />
feature extraction. The performance was measured in terms <strong>of</strong> precision, recall, and F1 measure<br />
127
has the following sources: a false positive <strong>of</strong> OPR with extracted PAS, a true positive<br />
OPR with no annotated PAS, and a true positive <strong>of</strong> OPR with false positive <strong>of</strong> PAS.<br />
In comparison, if the system would have identified all protein residues correctly, the<br />
performance <strong>of</strong> the whole extraction would have yielded in a precision <strong>of</strong> 0.68 and a<br />
recall <strong>of</strong> 0.48 (cf. section 6.3). Considering, the presented text mining solution is a pilot<br />
approach to extract <strong>functional</strong> <strong>annotation</strong>s for the validation <strong>of</strong> <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong>,<br />
the result is good for this area and comparable to first studies in BioCreAtIvE or Critical<br />
Assessment <strong>of</strong> Techniques for Protein Structure Prediction (CASP). The recall can be<br />
explained by the performance <strong>of</strong> the contextual feature extraction module.<br />
The result indicates, that the extracted <strong>functional</strong> <strong>annotation</strong>s have a reasonable precision<br />
in this first attempt <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> extraction, but is low in coverage.<br />
This can be explained by the sum <strong>of</strong> the performances <strong>of</strong> each text mining module. On<br />
one hand, an incorrectly determined protein residue leads to a false positive <strong>of</strong> PAS. On<br />
the other hand, a failed entity recognition contributes to the false negative rate. In addition,<br />
language complexity, and incorrectly parsed sentences are the other reasons for the<br />
false positive and false negative rate <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> extraction.<br />
In conclusion, the presented <strong>functional</strong> <strong>annotation</strong> extraction system delivers precise<br />
information, but has a low coverage <strong>of</strong> extraction. However, in context <strong>of</strong> the bioinformatics<br />
work <strong>of</strong> this thesis, a precision-driven extraction system is prefered over a recall<br />
oriented text mining solution.<br />
Cross-validation with UniProtKB <strong>functional</strong> <strong>annotation</strong>s.<br />
Despite the low coverage<br />
<strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system, the extracted information is correct<br />
and reusable for the <strong>annotation</strong> <strong>of</strong> protein residues. Table B.1 lists the 16 verified PAS<br />
data, corresponding to 17 verified protein residues. A comparison with UniProtKB shows,<br />
that 5 out <strong>of</strong> 16 are rediscovered knowledge. The remaining 11 out <strong>of</strong> 16 contain novel<br />
information that can be used to update the protein knowledge base.<br />
The extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s is a multi-step system. Although the per-<br />
128
formances <strong>of</strong> each module may not be at optimal level, the results demonstrate that<br />
<strong>functional</strong> <strong>annotation</strong>s are available and extractable from MEDLINE.<br />
7.2.2 Studying mined <strong>functional</strong> <strong>annotation</strong>s for the proteins<br />
p53 and Jak2<br />
UniProtKB curates <strong>functional</strong> <strong>annotation</strong>s for proteins on three levels: protein level,<br />
protein domain level, and protein residue level. The objective in this section is to study the<br />
validity and novelty <strong>of</strong> mined <strong>functional</strong> <strong>annotation</strong>s from whole MEDLINE extraction.<br />
The result provides an indication <strong>of</strong> the biological significance for automatic extraction<br />
from MEDLINE. The <strong>annotation</strong>s <strong>of</strong> two example proteins, p53 and Jak2, are analysed<br />
and compared with relevant information from UniProtKB.<br />
Tumour suppressor protein p53.<br />
p53 plays a critical role in preventing human cancer<br />
formation. In the native state, the protein assembles to a tetrameric phosphoprotein.<br />
It consists <strong>of</strong> four <strong>functional</strong> domains: (1) the proline-rich, acidic, N-terminus, which is<br />
involved in transcriptional activation, e.g. Mdm2 binding; (2) the central core, which<br />
binds DNA; (3) the oligomerisation domain with nuclear localisation signals, which allows<br />
the transfer into the nucleus; and (4) the C-terminus, which regulates DNA-binding<br />
[SYH + 03].<br />
The extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s from MEDLINE for the human tumor protein<br />
p53 resulted in 1,665 PAS data.<br />
A manual analysis on samples <strong>of</strong> mined <strong>functional</strong><br />
<strong>annotation</strong>s indicates, that there are two main topics: the regulatory post-translational<br />
modification, and the binding activity <strong>of</strong> residues, where in some cases the interaction<br />
partner is also stated. Table C.1 lists example <strong>annotation</strong>s grouped by similar topics. For 5<br />
out <strong>of</strong> 6 <strong>of</strong> the identified residues with post-translational modification, i.e. THR18, SER46,<br />
SER15, THR55, and SER315, the extracted information is similar to the <strong>annotation</strong>s in<br />
the UniProtKB entry. The remaining residue, SER6, has no <strong>annotation</strong> in the UniProtKB.<br />
129
The knowledge base does not provide further information on the biological implication<br />
<strong>of</strong> these residues, while the extracted data contain more contextual information.<br />
For<br />
example:<br />
”[...]ATM-mediated phosphorylation <strong>of</strong> the ser15 site <strong>of</strong> p53[...]”<br />
(PMID:14757188),<br />
”[...]Ser46 phosphorylation activates p53-dependent apoptosis[...]”<br />
(PMID:17172844).<br />
The analysis also found <strong>annotation</strong>s for some critical residues that are not recorded in<br />
UniProtKB. For example:<br />
”[...]the amino acid change C135R generates the loss <strong>of</strong> TP53 DNAbinding<br />
activity[...]” (PMID:17914575),<br />
”[...]R248W abolish the association with p63[...]” (PMID:11172034).<br />
The activity <strong>of</strong> p53 is thought to be regulated through a number <strong>of</strong> post-translational<br />
modifications at the N- and C-terminal regions. Review articles report that seven serines<br />
(SER6, SER9, SER15, SER20, SER33, SER37, and SER46) and two threonines (THR18,<br />
and THR81) in the N-terminal domain are modified by kinases upon exposure <strong>of</strong> cells to<br />
ionising radiation or UV light. The analysis shows that MEDLINE extraction can recover<br />
this information for the residues SER6, SER15, SER46, and THR18.<br />
Janus Kinase 2 (Jak2).<br />
Jak2 plays a crucial part in various growth factors and cytokine<br />
signalling pathways. Similar to other protein tyrosine kinases <strong>of</strong> the Janus kinase<br />
family, Jak2 consists <strong>of</strong> a tyrosine kinase domain and a tyrosine kinase-like domain. It is<br />
thought that the kinase-like domain can negatively regulate the kinase domain.<br />
130
The set <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for Jak2 has the size <strong>of</strong> 624 PAS data, and<br />
contains only information on seven residues: L539 (1 <strong>annotation</strong>), W515 (1 <strong>annotation</strong>),<br />
K607 (2 <strong>annotation</strong>s), V617 (630 <strong>annotation</strong>s), F617 (5 <strong>annotation</strong>s; a reported variant<br />
associated with Budd-Chiari syndrome), V678 (3 <strong>annotation</strong>s), and D816 (1 <strong>annotation</strong>).<br />
A comparison with UniProtKB data shows, that the extracted information for F617, K607,<br />
and L539 are similar to the <strong>annotation</strong>s in the database. These and other <strong>annotation</strong>s for<br />
D816, V678, and W515 describe mutation events (data not shown).<br />
In order to assess the extracted information on V617, random samples were selected<br />
and studied manually. The result <strong>of</strong> the analysis indicates, that the set <strong>of</strong> <strong>annotation</strong>s<br />
contains a lot <strong>of</strong> redundant information. The data can be grouped into two main topics:<br />
disease, and genetical origin. Table D.1 lists some examples <strong>of</strong> extracted <strong>functional</strong><br />
<strong>annotation</strong>s.<br />
The effect <strong>of</strong> mutating residue 617 on cellular function, and its association with particular<br />
diseases has already been reported, but none <strong>of</strong> the extracted <strong>annotation</strong>s provide any<br />
molecular explanation. A survey <strong>of</strong> research publications on Jak2 revealed, that myeloid<br />
and lymphoid malignancies are associated with Jak2 V617F. It is proposed, that the<br />
residue 617 destabilises the kinase and kinase-like domain interactions, and thereby promotes<br />
activation <strong>of</strong> kinase activity [POHS05]. These results suggest that the extracted<br />
information reflects pieces <strong>of</strong> evidences, however, their biological relations may not be<br />
available in the mined output or even in MEDLINE.<br />
In summary, the study <strong>of</strong> the mined <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> residues for the two proteins<br />
presented here indicates, that MEDLINE contains information, which are recurrent<br />
in a number <strong>of</strong> abstract texts. Despite the data redundancy, some <strong>functional</strong> <strong>annotation</strong>s<br />
are not contained in UniProtKB, indicating that MEDLINE extraction retains its<br />
originality.<br />
131
7.2.3 Cross-validation <strong>of</strong> mined catalytic residues with CSA<br />
In the previous section, <strong>functional</strong> <strong>annotation</strong>s were extracted from MEDLINE, and for a<br />
range <strong>of</strong> <strong>annotation</strong>s, the contained information was analysed on its biological validity and<br />
novelty. This section focuses on enzyme-related information in the extracted <strong>annotation</strong>s.<br />
The objective is to study how reliable the extracted information is for the validation <strong>of</strong><br />
catalytic residues. The identified residues with these associated <strong>annotation</strong>s are compared<br />
with CSA. Figure 7.2 summarises the result <strong>of</strong> this analysis.<br />
The CSA lists 12,971 protein residues (RID+UID), <strong>of</strong> which 799 were identified in<br />
MEDLINE. The missing 12,172 protein residues in CSA can be explained by the performance<br />
<strong>of</strong> the identification system (cf. section 5.4). Another explanation is, that CSA<br />
is curated from full-text publication extraction, and the same information may not be<br />
available in MEDLINE.<br />
By selecting residues with extracted <strong>functional</strong> <strong>annotation</strong>s from MEDLINE, 691 out<br />
<strong>of</strong> 799 protein residues were retained. This result indicates that a lot <strong>of</strong> <strong>functional</strong> descriptions<br />
are available as contextual features <strong>of</strong> the identified protein residues. The result<br />
is consistent with previous performance evaluation studies (cf. section 6.4). With a precision<br />
<strong>of</strong> 0.43 and recall <strong>of</strong> 0.20, the classifier for the category ENZ ACT (cf. section 6.3)<br />
identified enzyme-related <strong>functional</strong> <strong>annotation</strong>s for 77 out <strong>of</strong> 691 protein residues. Manual<br />
analysis shows, that this reduction can be explained by the classifier’s performance.<br />
Another explanation is the absence <strong>of</strong> relevant contextual cues in the extracted text.<br />
A search for the term ”catalytic triad” in the sentences <strong>of</strong> the identified protein residues<br />
yielded in a sub-selection <strong>of</strong> 221 out <strong>of</strong> 46,750 residues. A comparison with CSA shows,<br />
that 44 out <strong>of</strong> 221 are re-discoveries <strong>of</strong> <strong>active</strong> site residues.<br />
The <strong>annotation</strong>s for the<br />
remaining 177 may contain supporting evidences to identify the residues as catalytic. A<br />
systematic analysis <strong>of</strong> these <strong>predicted</strong> catalytic residues should start with the 27 out <strong>of</strong><br />
177 residues, which have <strong>annotation</strong>s classified as ENZ ACT.<br />
In conclusion, the developed text mining system rediscovers <strong>active</strong> site residues, by<br />
132
Figure 7.2: Cross-validation <strong>of</strong> text mined catalytic residues with CSA. The analysis was done based<br />
on the comparison <strong>of</strong> the determined RID+UID pairs. The numbers reflect the determined RID+UID<br />
pairs. RID = Residue identifier; UID = Uniprot identifier.<br />
133
Figure 7.3: Cross-validaiton <strong>of</strong> text mined binding residues with MSDsite. Annotation was studied<br />
on the level <strong>of</strong> using solely the mentioned protein residue, the residue with PAS data, and residue with<br />
information on binding. The number indicates the counted RID+UID pairs in the data. RID = Residue<br />
identifier; UID = Uniprot identifier.<br />
solely mining abstract text from MEDLINE. While the rate <strong>of</strong> false positive is not known,<br />
the extraction identified 1,391 protein residues with enzyme-related <strong>functional</strong> <strong>annotation</strong>s.<br />
The significance <strong>of</strong> these potentially new CSA residues are further studied in<br />
ongoing work.<br />
7.2.4 Annotation <strong>of</strong> protein residues in MSDsite<br />
The MSDsite [GDO + 05] holds a number <strong>of</strong> <strong>predicted</strong> ligand binding <strong>sites</strong>, by automatically<br />
analysing ligand contacting residues in the PDB. The objective in this section is to analyse<br />
how many <strong>of</strong> these binding residues can be annotated from mining MEDLINE.<br />
134
The analysis shows that 512 out <strong>of</strong> the 46,750 identified protein residues in MEDLINE<br />
are also contained in MSDsite (cf. figure 7.3). A large proportion <strong>of</strong> these residues are<br />
associated with PAS data (429 out <strong>of</strong> 512), while only a smaller subset <strong>of</strong> 12 have information<br />
classified as BINDING. Manual analysis shows, that all <strong>of</strong> these 12 <strong>annotation</strong>s are<br />
correct. They can be used to validate the <strong>predicted</strong> ligand binding residues in MSDsite<br />
(table E.1).<br />
For the remaining 417 out <strong>of</strong> 512 residues, the associated PAS data may still contain<br />
valid information for the <strong>annotation</strong>. However, a systematic analysis was not performed<br />
at this stage <strong>of</strong> study.<br />
In summary, a relatively small set <strong>of</strong> protein residues recovered from MEDLINE extraction<br />
can be used for the <strong>annotation</strong> <strong>of</strong> MSDsite entries.<br />
7.3 Discussion<br />
The extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> is a multi-step process, and the quality <strong>of</strong> the<br />
result has to be interpreted in context <strong>of</strong> each subprocess’ performance. Although the<br />
performances <strong>of</strong> each extraction module may not be at optimal level, the evaluation results<br />
indicate that the mined output contains biologically meaningful data. Considering the<br />
validation <strong>of</strong> a <strong>predicted</strong> function requires any evidences <strong>of</strong> biological function, the developed<br />
text mining system can become a valuable tool, for example for the protein function<br />
prediction assessement in the Critical Assessment <strong>of</strong> Techniques for Protein Structure Prediction<br />
(CASP) [LRTV07]. With the improvement <strong>of</strong> the information extraction modules,<br />
the quality <strong>of</strong> mined <strong>functional</strong> <strong>annotation</strong>s is expected to become more reliable.<br />
The biological relevance <strong>of</strong> the extracted <strong>functional</strong> <strong>annotation</strong> was demonstrated on<br />
two different proteins, p53 and Jak2.<br />
The results show, that not only information in<br />
UniProtKB can be rediscovered from MEDLINE, but also novel information can be extracted<br />
as well. These <strong>functional</strong> <strong>annotation</strong>s can be considered to complement existing<br />
<strong>annotation</strong>s in UniProtKB. However, manual analysis on subsets <strong>of</strong> the extracted annota-<br />
135
tions indicates, that the information is represented redundantly in MEDLINE. One major<br />
reason is, that biological facts are expressed repeatedly within the biological community.<br />
The study <strong>of</strong> identifying catalytic residues and binding residues from the mined <strong>functional</strong><br />
<strong>annotation</strong>s, and the cross-validation with CSA and MSDsite shows, that the developed<br />
text mining solution is able to find relevant data from MEDLINE. Although the<br />
developed classifiers have a weak performance, it is not clear whether this explains completely<br />
the cross-validation results. It is possible, that key information is not mentioned<br />
in abstract texts that would identify the biological role <strong>of</strong> the protein residues. Another<br />
explanation is based on the protein residue identification performance, which had been<br />
evaluated with a low recall score.<br />
Although abstract texts cover only a subset <strong>of</strong> information from full-text articles, and<br />
information is represented repeatedly in MEDLINE, this study shows that the text mined<br />
information is biologically valid and contains snippets <strong>of</strong> additional information that are<br />
relevant for UniProtKB. For example, the extracted <strong>annotation</strong>s complement existing<br />
information in UniProtKB and provide first data <strong>of</strong> yet not curated <strong>functional</strong> <strong>sites</strong> in<br />
proteins.<br />
7.4 Conclusion<br />
In this chapter, two text mining components were combined to form the <strong>functional</strong> <strong>annotation</strong><br />
extraction system. Performance analysis shows, that the system is precise, but<br />
has a low coverage. However, the low recall is compensated by the fact, that information<br />
is distributed redundantly. The extracted information is biologically valid, and contains<br />
some novel data, which can be used to update UniProtKB. So far, <strong>functional</strong> <strong>annotation</strong>s<br />
<strong>of</strong> residues have been evaluated in isolation, i.e. independent from structural context in<br />
proteins. In the following chapter a biological context is created, by combining <strong>functional</strong><br />
<strong>annotation</strong>s with protein structure data (cf. chapter 3 and chapter 4).<br />
136
Chapter 8<br />
Combining <strong>active</strong> site prediction<br />
with mined <strong>functional</strong> <strong>annotation</strong>s<br />
The goal in this thesis is to combine information from two disjoint information resources.<br />
In this course various methodologies were developed for the prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong><br />
in proteins, and the extraction <strong>of</strong> relevant information for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong><br />
protein residues from scientific articles.<br />
More specifically, a <strong>predicted</strong> <strong>functional</strong> site<br />
can be validated by a set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues.<br />
Conversely, a<br />
set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s requires a structural context to understand the molecular<br />
mechanism <strong>of</strong> a protein function.<br />
In the previous chapters, I have presented the results on 3D pattern mining from PDB<br />
(cf. chapter 3) and <strong>functional</strong> <strong>annotation</strong> extraction from MEDLINE (cf. chapters 5, 6,<br />
and 7). Here, the produced datasets are combined and analysed. The objective in this<br />
chapter is to validate <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> that the data mining output may contain,<br />
by combining specific <strong>functional</strong> <strong>annotation</strong>s extracted from MEDLINE. The result is<br />
compared with data from CSA.<br />
137
Figure 8.1: Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> combining the protein structure dataset<br />
and literature dataset.<br />
8.1 Algorithms<br />
8.1.1 Combining protein structure data with literature data<br />
Theory<br />
The method to combine PDB with MEDLINE data, i.e. the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> a<br />
residue from a protein structure, is based on the combination <strong>of</strong> two identifiers: RID+UID<br />
(cf. section 5.3). There are two major subtasks to combine the datasets (cf. figure 8.1):<br />
linking PDB entries to a Uniprot entry, and associating a residue with its co-mentioned<br />
protein in text.<br />
Mapping residues in PDB to UniProtKB.<br />
The mapping between PDB and UniProtKB,<br />
and the inherited mapping <strong>of</strong> a protein residue from a PDB entry to its UniProtKB sequence<br />
index, is a non-trivial task. One problem is that the author <strong>of</strong> a determined protein<br />
structure used an arbitrary residue index system that is not in accordance with the wild-<br />
138
type protein sequence.<br />
Furthermore, residues in a protein deletion mutant may have<br />
been numbered sequentially, irrespectively <strong>of</strong> sequence gaps. Another example is, that<br />
UniProtKB does not have the corresponding protein sequence for a crystallised protein,<br />
which may be, for example, a novel splice variant.<br />
In some cases, cross-links from PDB to UniProtKB, or UniProtKB to PDB are available.<br />
However, over time the links may have become outdated. In order to find the correct<br />
mapping between the protein residue indices in both databases, an exhaustive sequence<br />
alignment is required. Various solutions and services have been provided for the periodic<br />
update <strong>of</strong> UniProtKB-PDB mappings [VMMR + 05] [Mar05] [VZHC05] [MSD08].<br />
Here, I reuse a previously published lookup table file [Mar05] for the mapping <strong>of</strong><br />
protein residues in PDB to UniProtKB. Notice, that the lookup table is based on the<br />
alignment analysis work <strong>of</strong> the Macromolecular Structure Database (MSD) group at the<br />
<strong>European</strong> Bioinformatics Institute [MSD08].<br />
Mapping protein residue in text to UniProtKB.<br />
The mapping <strong>of</strong> a residue entity<br />
in text to its co-mentioned protein, and ultimately the mapping to UniProtKB, is<br />
explained in section 5.1.<br />
Implementation<br />
The correct sequence index mapping <strong>of</strong> a PDB entry to its corresponding Uniprot entry<br />
was based on the lookup table produced by [Mar05] (version October 2008). An example<br />
<strong>of</strong> the lookup table data is shown in figure 8.2. The combination <strong>of</strong> the following keys were<br />
used to unambiguously map a residue from PDB to its Uniprot native sequence position:<br />
PDBID + chainID + RID.<br />
139
PDB<br />
UniProtKB<br />
PDBID chainID serial resName resSeq UID resName seqIndex<br />
11gs B 1 PRO 2 GSTP1 HUMAN P 3<br />
11gs B 2 TYR 3 GSTP1 HUMAN Y 4<br />
11gs B 3 THR 4 GSTP1 HUMAN T 5<br />
11gs B 4 VAL 5 GSTP1 HUMAN V 6<br />
11gs B 5 VAL 6 GSTP1 HUMAN V 7<br />
11gs B 6 TYR 7 GSTP1 HUMAN Y 8<br />
11gs B 7 PHE 8 GSTP1 HUMAN F 9<br />
11gs B 8 PRO 9 GSTP1 HUMAN P 10<br />
11gs B 9 VAL 10 GSTP1 HUMAN V 11<br />
11gs B 10 ARG 11 GSTP1 HUMAN R 12<br />
Figure 8.2: Lookup table for PDB/UniProtKB mapping. Excerpt <strong>of</strong> the lookup table to map protein<br />
residues from a PDB entry to the corresponding UniProtKB entry.<br />
8.2 Evaluation methods<br />
The validation <strong>of</strong> identified catalytic residues was done by manual examination <strong>of</strong> the<br />
<strong>functional</strong> descriptions <strong>of</strong> annotated protein residues.<br />
Within this analysis 6 datasets<br />
were used (cf. section 7.2): CSA is the set <strong>of</strong> <strong>active</strong> site residues from the Catalytic Site<br />
Atlas [PBT04]; OLDFIELD is the set <strong>of</strong> residues in the non-redundant structure set from<br />
[Old02]; PATTERN is the set <strong>of</strong> residues from the data mined 3D patterns; OPR is the<br />
set <strong>of</strong> protein residues identified from MEDLINE extraction; FA is the subset <strong>of</strong> OPR,<br />
which have <strong>functional</strong> <strong>annotation</strong>s extracted from MEDLINE; and ENZ is the subset <strong>of</strong><br />
FA, where the contained information are classified as ENZ ACT, i.e. the information are<br />
enzyme-related.<br />
8.3 Results<br />
8.3.1 Protein residue mapping between three data resources<br />
This section gives an overview <strong>of</strong> the analysed datasets. Figure 8.3 summarises the data.<br />
OLDFIELD contains in total 341,365 protein residues, counted as RID+PDBID.<br />
328,796 out <strong>of</strong> 341,365 residues are found in the lookup table, which corresponds to<br />
280,521 RID+UID. Parallely, the residues from the mined 3D pattern set (PATTERN) was<br />
140
Figure 8.3: Overview <strong>of</strong> the combined datasets from protein structure data and biomedical literature<br />
data. The combined dataset is analysed to identify <strong>active</strong> site residues. CSA = <strong>active</strong> site database; OPR<br />
= identified protein residues; PAS = contextual feature assigned to a protein residue; ENZ = contextual<br />
feature with enzyme-related information; OLDFIELD = protein structure subset from PDB; PATTERN<br />
= data mined structural features from OLDFIELD.<br />
141
mapped to 24,500 RID+UID. The identification <strong>of</strong> protein residues in MEDLINE found<br />
a total <strong>of</strong> 132,476 RID+UID with a unique count <strong>of</strong> 46,750 RID+UID. This dataset is<br />
referred as OPR. 36,569 out <strong>of</strong> 46,750 protein residues have <strong>functional</strong> <strong>annotation</strong>s (FA),<br />
while another subset <strong>of</strong> 1,467 out <strong>of</strong> 36,569 have <strong>annotation</strong>s classified as ENZ ACT<br />
(ENZ). A set analysis between OLDFIELD and OPR determined 2,402 common protein<br />
residues, 197 out <strong>of</strong> 2,402 also listed in CSA.<br />
In summary, for a large fraction <strong>of</strong> protein residues in OLDFIELD, mapping to<br />
UniProtKB sequence indices is available. However, only 2,402 are recovered from MED-<br />
LINE extraction, which can be used for validation.<br />
8.3.2 Rediscovery <strong>of</strong> <strong>active</strong> <strong>sites</strong> and catalytic residues<br />
The identification <strong>of</strong> catalytic residues from protein structure data mining, and from<br />
biomedical literature mining was studied previously (cf. sections 4.2 and 7.2). Each<br />
result was evaluated by cross-validation with CSA. This section studies the validation <strong>of</strong><br />
<strong>predicted</strong> <strong>active</strong> <strong>sites</strong> from the combined datasets.<br />
Previously, three structural patterns were identified as <strong>active</strong> <strong>sites</strong>, by cross-validation<br />
with CSA (cf. chapter 4). One <strong>of</strong> the pattern represents the well known catalytic triad.<br />
This pattern was found in 19 proteins within the dataset (cf. section 4.2). Associated<br />
with these 19 proteins is the set <strong>of</strong> 57 protein residues. The analysis shows that only 3 out<br />
<strong>of</strong> 57 residues were identified in MEDLINE, The 3 identified residues in text correspond<br />
to the same protein, bovine chymotrypsinogen (cf. table 8.1). The associated <strong>functional</strong><br />
<strong>annotation</strong>s for the residues ASP102, and HIS57, were not classified as ENZ ACT. The<br />
contained information in these <strong>annotation</strong>s only indirectly indicate the catalytic property<br />
<strong>of</strong> these residues; the <strong>annotation</strong>s do not mention them as part <strong>of</strong> the catalytic triad. In<br />
conclusion, a structure-based prediction <strong>of</strong> an <strong>active</strong> site was not validated by literature<br />
data.<br />
The intersection <strong>of</strong> PATTERN, OPR, and CSA results in a set <strong>of</strong> 15 protein residues.<br />
142
RID+UID<br />
S195 CTRA BOVIN; D102 CTRA BOVIN; H57 CTRA BOVIN<br />
Sentence ”These include the NH2-terminal four residues, the sequences near histidine-57 (chymotrypsinogen<br />
A numbering system), aspartic acid-102, aspartic acid-189, and serine-195,<br />
the regions <strong>of</strong> the three disulfide bridges, and the COOH-terminal end (residues 225-<br />
229) <strong>of</strong> the proteins. When aligned to maximize homology the identity <strong>of</strong> residues is<br />
34%.”(PMID:804314)<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
N/A<br />
D102 CTRA BOVIN; H57 CTRA BOVIN<br />
”In bovine chymotrypsinogen A in 2H2O at 31 degrees C, histidine-57 has a pK’ <strong>of</strong> 7.3 and<br />
aspartate-102 a pK’ <strong>of</strong> 1.4, and the histidine-40-aspartate-194 system exhibits inflections at<br />
pH 4.6 and 2.3.” (PMID:31898)<br />
pred = has<br />
arg1 = HIS57<br />
arg2 = a pK<br />
arg2-<strong>of</strong> = 7.3 and ASP102 a pK<br />
arg2-<strong>of</strong> = 1.4<br />
D102 CTRA BOVIN<br />
”In bovine chymotrypsin Aalpha under the same conditions, the histidine-57-aspartate-102<br />
system has pK’ values <strong>of</strong> 6.1 and 2.8, and histidine-40 has a pK’ <strong>of</strong> 7.2.” (PMID:31898)<br />
pred = have<br />
arg1 = the HIS57 ASP102 system<br />
arg2 = pK values<br />
arg2-<strong>of</strong> = 6.1 and 2.8<br />
D102 CTRA BOVIN; H57 CTRA BOVIN<br />
”The results suggest that the pK’ <strong>of</strong> histidine-57 is higher than the pK’ <strong>of</strong> aspartate-102 in<br />
both zymogen and enzyme.” (PMID:31898)<br />
pred = is<br />
arg1 = that the pK<br />
arg1-<strong>of</strong> = HIS57<br />
arg2 = higher than the pK<br />
arg2-<strong>of</strong> = ASP102<br />
arg2-in = both zymogen and enzyme<br />
H57 CTRA BOVIN<br />
”The 1H NMR chemical shift <strong>of</strong> the Cepsilon1 H <strong>of</strong> histidine-57 in the chymotrypsin Aalphapancreatic<br />
trypsin inhibitor (Kunitz) complex is constant between pH 3 and 9 at a value<br />
similar to that <strong>of</strong> histidine-57 in the porcine trypsin-pancreatic trypsin inhibitor complex<br />
[Markley, J.L., and Porubcan, M. A. (1976), J. Mol. Biol. 102, 487–509], suggesting that the<br />
mechanisms <strong>of</strong> interaction are similar in the two complexes.” (PMID:31898)<br />
pred = is<br />
arg1 = complex<br />
arg2 = constant<br />
arg2-between = pH 3 and 9<br />
arg2-at = a value similar<br />
arg2-to = that<br />
arg2-<strong>of</strong> = HIS57<br />
arg2-in = the porcine trypsin-pancreatic trypsin inhibitor complex<br />
Table 8.1: Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen.<br />
Based on the performance <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system and the availability <strong>of</strong> information<br />
in MEDLINE, only few information was extracted. The mined information on the <strong>active</strong> site<br />
residues mention only indirectly their catalytic properties.<br />
143
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
C32 THIO HUMAN; C35 THIO HUMAN<br />
”A hydrogen bond between the sulfhydryls <strong>of</strong> Cys32 and Cys35 may reduce the pKa <strong>of</strong> Cys32<br />
and this pKa depression probably results in increased nucleophilicity <strong>of</strong> the Cys32 thiolate<br />
group.” (PMID:8805557)<br />
pred = reduce<br />
arg1 = A hydrogen bond<br />
arg1-between = the sulfhydryls<br />
arg1-<strong>of</strong> = CYS32 and CYS35<br />
arg2 = the pKa<br />
arg2-<strong>of</strong> = [CYS32 and this pKa depression]/ENZ ACT<br />
C215 PTN1 HUMAN<br />
”The structure <strong>of</strong> the catalytically in<strong>active</strong> mutant (C215S) <strong>of</strong> the human proteintyrosine<br />
phosphatase 1B (PTP1B) has been solved to high resolution in two complexes.”<br />
(PMID:9391040)<br />
pred = solved<br />
arg1 = [in<strong>active</strong> mutant (C215S)]/ENZ ACT<br />
arg1-<strong>of</strong> = the human protein-tyrosine phosphatase 1B (PTP1B)<br />
arg2 = unk<br />
arg2-to = to high resolution<br />
arg2-in = in two complexes<br />
Table 8.2: Identified catalytic residues from MEDLINE extraction. The mined <strong>functional</strong> <strong>annotation</strong><br />
were classified as enzyme-related, suggesting the correspondent protein residue has some catalytic properties.<br />
The identified residues were also cross-validated by CSA, however the mined 3D pattern with<br />
these residues were not validated as <strong>active</strong> site residues by the database.<br />
The analysis shows that only 3 out <strong>of</strong> 15 protein residues have enzyme-related <strong>annotation</strong>s.<br />
2 out <strong>of</strong> 3 residues correspond to the protein human thioredoxin (cf. table 8.2). However,<br />
none <strong>of</strong> the mined 3D patterns can provide a structure context to the identified catalytic<br />
residues. A manual analysis on the 12 out <strong>of</strong> 15 residues shows, that some <strong>of</strong> the associated<br />
<strong>annotation</strong>s were not correctly classified as enzyme-related, which can be explained by<br />
the performance <strong>of</strong> the classifier (cf. section 6.3).<br />
For 16 out <strong>of</strong> 197 protein residues, i.e. the intersection between OLDFIELD, OPR,<br />
and CSA, the term ”catalytic triad” is found as co-mention within sentences. While none<br />
<strong>of</strong> the 16 residues are associated with a mined 3D pattern, 6 out <strong>of</strong> 16 residues have<br />
enzyme-related <strong>functional</strong> <strong>annotation</strong>s (cf. table 8.3).<br />
In conclusion, the results in this study indicate, that the coverage <strong>of</strong> relevant information<br />
to validate <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> is too low. However, some <strong>of</strong> the enzyme-related<br />
<strong>annotation</strong>s are biological valid, but have no correlation with a 3D pattern.<br />
144
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
S80 HNL HEVBR; D207 HNL HEVBR; H235 HNL HEVBR<br />
”Our results yielded further support for an enzymatic mechanism involving the catalytic<br />
triad Ser80, His235, and Asp207 as a general acid/base.” (PMID:11354003)<br />
pred = involving<br />
arg1 = furhter support<br />
arg1-for = for an enzymatic mechanism<br />
arg2 = [the catalytic triad SER80, HIS235, and ASP207]/ENZ ACT<br />
E132 LINB PSEPA; D108 LINB PSEPA; H272 LINB PSEPA<br />
”The enzyme belongs to the alpha/beta hydrolase family and contains a catalytic triad<br />
(Asp108, His272, and Glu132) in the lipase-like topological arrangement previously proposed<br />
from mutagenesis experiments.” (PMID:11087355)<br />
pred = contains<br />
arg1 = unk<br />
arg1-to = the alpha/beta hydrolase family and<br />
arg2 = [a catalytic triad (ASP108, HIS272, and GLU132)]/ENZ ACT<br />
Table 8.3: Catalytic triad residues available from the mined <strong>functional</strong> <strong>annotation</strong>s. The <strong>active</strong> site<br />
residues were identified by a search for the term ”catalytic triad” in the mined <strong>functional</strong> <strong>annotation</strong><br />
data. The validity was also confirmed by comparison with CSA.<br />
8.3.3 Search for novel catalytic residues<br />
In the previous section, the combined dataset was evaluated by cross-validation with CSA.<br />
Thus the identified catalytic residues represent only re-discoveries <strong>of</strong> known data. The<br />
goal in this section is to search for novel catalytic residues by combining enzyme-related<br />
<strong>annotation</strong>s with mined 3D pattern.<br />
A set analysis between CSA, OLDFIELD, and OPR revealed, that 2,205 residues<br />
are included in OLDFIELD and OPR, but not in CSA (cf. figure 8.3). A search for<br />
the term ”catalytic triad” in sentences <strong>of</strong> these 2,205 identified residues resulted in a<br />
subselection <strong>of</strong> 24 residues. The analysis shows that none <strong>of</strong> the 24 residues were found in<br />
the mined 3D pattern. However, 15 out <strong>of</strong> 24 residues have enzyme-related <strong>annotation</strong>s<br />
(cf. table F.1), suggesting they are catalytic residues. A manual analysis determined,<br />
that the <strong>annotation</strong>s contain valid evidences to identify the residues as catalytic.<br />
The result in this study indicates, that MEDLINE extraction can find some additional<br />
catalytic residues that are not represented in CSA. However, a correlation with the mined<br />
3D patterns was not found, and <strong>functional</strong> <strong>annotation</strong>s were not interpreted in a structural<br />
context.<br />
145
8.3.4 General correlation found between <strong>predicted</strong> <strong>functional</strong><br />
<strong>sites</strong> and extract <strong>functional</strong> <strong>annotation</strong>s.<br />
Previously, the validation <strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> was studied by cross-validation <strong>of</strong> known<br />
catalytic residues. In this section a more general correlation analysis between structure<br />
and function data is studied. Because the coverage <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s<br />
<strong>of</strong> protein residues is too low to be useful to annotate the residues <strong>of</strong> the prediction,<br />
we cannot expect that all residues in one prediction are annotated with description <strong>of</strong><br />
biological function. However, if a <strong>predicted</strong> <strong>functional</strong> site has some feature which point<br />
to a common concept <strong>of</strong> function, then this can be used to prioritise the prediction.<br />
Table 8.4 (left panel) shows the top 25 mined structural patterns which were ranked<br />
by the number <strong>of</strong> distinct residues with PAS data. In total 168 patterns have <strong>annotation</strong>s<br />
ranging from one residue to a maximal <strong>of</strong> nine distinct residues with <strong>annotation</strong>s. Another<br />
view is to take into consideration the number <strong>of</strong> annotated residues in context <strong>of</strong> the total<br />
number <strong>of</strong> residues in a prediction (cf. table 8.4, right panel). This gives an indication <strong>of</strong><br />
how frequent a pattern is and how much do we know on each residue from the text mined<br />
data.<br />
The extraction <strong>of</strong> biological features from text for protein residues matches to a number<br />
<strong>of</strong> various proteins, including homologues proteins. So far the <strong>annotation</strong> <strong>of</strong> residues<br />
in a <strong>predicted</strong> <strong>functional</strong> site considered only first level information (<strong>annotation</strong>s for exact<br />
protein), however, the correlation analysis can also exploit information from homologous<br />
proteins (second level information). Based on the information from the Homology-derived<br />
Secondary Structure <strong>of</strong> proteins (HSSP) database [SS96], the <strong>annotation</strong> <strong>of</strong> the prediction<br />
was expanded by extracted information from homologues. The result <strong>of</strong> this study shows,<br />
that the number <strong>of</strong> residue <strong>annotation</strong> is increased by 10% (cf. table 8.5). A control analysis<br />
<strong>of</strong> how many residues in the non-redundant protein dataset OLDFIELD are identified<br />
in MEDLINE and how many <strong>of</strong> these have an association with PAS data indicates that<br />
the low recall <strong>of</strong> the developed text mining system is the reason for the weak <strong>annotation</strong><br />
146
#residues with Pattern #residues in A/B #residues with Pattern #residues in A/B<br />
PAS (A) pattern (B) PAS (A) pattern (B)<br />
6 9 10 16 CYS CYS PHE-1 12 0.5 4 10 11 11 ALA HIS HIS-1 6 0.6667<br />
4 10 15 11 ASP HIS TRP-2 18 0.2222 4 9 15 11 GLN LEU TRP-2 6 0.6667<br />
4 10 11 20 HIS MET PHE-1 12 0.3333 6 9 10 16 CYS CYS PHE-1 12 0.5<br />
4 9 18 11 GLY MET TYR-1 12 0.3333 3 10 13 10 CYS PHE TYR-1 6 0.5<br />
4 9 11 17 ALA LEU VAL-1 30 0.1333 4 10 11 20 HIS MET PHE-1 12 0.3333<br />
4 8 9 10 CYS CYS HIS-1 12 0.3333 4 11 18 9 CYS ILE PHE-1 12 0.3333<br />
4 11 8 18 HIS HIS SER-1 12 0.3333 4 11 8 18 HIS HIS SER-1 12 0.3333<br />
4 11 18 9 CYS ILE PHE-1 12 0.3333 4 18 10 10 ASP CYS PHE-1 12 0.3333<br />
4 11 11 12 HIS HIS MET-1 21 0.1905 4 19 11 10 ASP CYS ILE-1 12 0.3333<br />
4 9 15 11 GLN LEU TRP-2 6 0.6667 4 20 9 11 ASP GLY MET-1 12 0.3333<br />
4 10 15 11 ASP HIS TRP-1 15 0.2667 4 8 9 10 CYS CYS HIS-1 12 0.3333<br />
4 10 11 11 ALA HIS HIS-1 6 0.6667 4 9 18 11 GLY MET TYR-1 12 0.3333<br />
4 20 9 11 ASP GLY MET-1 12 0.3333 3 9 10 8 CYS HIS MET-1 9 0.3333<br />
4 18 10 10 ASP CYS PHE-1 12 0.3333 2 11 13 9 ASN LYS SER-1 6 0.3333<br />
4 19 11 10 ASP CYS ILE-1 12 0.3333 2 11 14 8 ALA ARG ASN-2 6 0.3333<br />
4 11 14 7 ASP MET SER-1 18 0.2222 2 11 17 10 CYS PHE PRO-1 6 0.3333<br />
4 9 17 10 ALA ILE PHE-1 18 0.2222 2 18 10 11 ARG GLU PRO-1 6 0.3333<br />
3 9 10 8 CYS HIS MET-1 9 0.3333 2 19 9 11 ALA PRO TYR-1 6 0.3333<br />
3 10 13 10 CYS PHE TYR-1 6 0.5 2 9 11 9 ASP CYS LYS-1 6 0.3333<br />
3 21 11 10 CYS GLY VAL-1 21 0.1429 1 10 10 20 HIS PRO TYR-1 3 0.3333<br />
3 11 9 9 ASP MET SER-1 15 0.2 1 10 12 11 ILE LEU PHE-1 3 0.3333<br />
3 17 11 9 ALA LEU VAL-1 102 0.0294 1 14 8 7 ASP HIS SER-1 3 0.3333<br />
3 10 10 19 ALA HIS MET-1 18 0.1667 1 8 11 17 GLU THR THR-1 3 0.3333<br />
3 8 8 15 ASP HIS SER-1 33 0.0909 4 10 15 11 ASP HIS TRP-1 15 0.2667<br />
3 10 9 11 CYS VAL VAL-1 33 0.09099 4 10 15 11 ASP HIS TRP-2 18 0.2222<br />
Table 8.4: Functional <strong>annotation</strong>s <strong>of</strong> protein residues in <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong>. A <strong>functional</strong> site is<br />
<strong>predicted</strong> as a structure pattern that is recurrent among a non-redundant set <strong>of</strong> proteins. The table on<br />
the left panel lists the top 25 patterns ranked by the total number <strong>of</strong> annotated protein residues for each<br />
pattern, while the table on the right panel ranks the pattern by the total number <strong>of</strong> annotated protein<br />
residues in context <strong>of</strong> total number <strong>of</strong> residues found in all structure examples.<br />
147
Residue Annotations<br />
-HSSP<br />
+HSSP<br />
OPR FA OPR FA<br />
OLDFIELD 2,402 1,963 243 192<br />
PATTERN 168 132 16 19<br />
Table 8.5: Homology-based transfer <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for protein residues in the<br />
mined pattern data. Based on the HSSP information the identified protein residues and their associated<br />
<strong>functional</strong> <strong>annotation</strong>s were transferred from homologous proteins to the target proteins and residues in<br />
the mined structure pattern data.<br />
expansion.<br />
In conclusion, a general correlation between protein structure and function data is<br />
found in this study. The set <strong>of</strong> available <strong>annotation</strong>s for protein residues is an indication<br />
<strong>of</strong> biological function for a <strong>predicted</strong> <strong>functional</strong> site. The biological significance <strong>of</strong> this<br />
result is being investigated further.<br />
8.4 Discussion<br />
The distribution <strong>of</strong> information in the combined data was studied by a search for <strong>active</strong><br />
site residues. Another approach in sampling the dataset is the identification <strong>of</strong> ligand<br />
binding residues. A search can be done from the protein structure data, by selecting only<br />
residues <strong>of</strong> an identified metal binding site, and then consulting the literature for relevant<br />
<strong>annotation</strong>s.<br />
The validation <strong>of</strong> a <strong>predicted</strong> <strong>active</strong> site in this study demonstrates, that the amount<br />
<strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s was not sufficient for this task.<br />
Considering, that<br />
the catalytic triad is a well characterised structural feature, the information should be<br />
available in MEDLINE. In fact, by searching for the term ”catalytic triad” in the text<br />
mined data, several associations between the term and residues can be found. A close<br />
examination reveals that some are <strong>annotation</strong>s for homologous proteins with the Asp-<br />
His-Ser catalytic triad motif (data not shown).<br />
However, the results <strong>of</strong> the presented<br />
studies indicate that the recall <strong>of</strong> the text mining system is to low to capture sufficiently<br />
148
<strong>annotation</strong>s for protein homologues.<br />
Despite the identification <strong>of</strong> some catalytic residues in this analysis, it must be noted<br />
that literature-based verification <strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> cannot rule out the detection <strong>of</strong><br />
false positives. The absence <strong>of</strong> a biological evidence in the literature does not mean, that<br />
the prediction is wrong, but that simply no knowledge is currently available. Biological<br />
research is hypothesis-driven, and therefore not all <strong>of</strong> the <strong>predicted</strong> <strong>active</strong> site residues<br />
are expected to be reported in the literature, if they have not been a biological research<br />
target.<br />
8.5 Conclusion<br />
In this chapter I performed a correlation analysis between the dataset from protein structure<br />
data mining and literature mining.<br />
The result in this study suggests, that the<br />
combined data have little correlations. For example, a structure-based prediction <strong>of</strong> an<br />
<strong>active</strong> site had no <strong>functional</strong> <strong>annotation</strong>s with biological evidences, while the result was<br />
cross-validated with CSA. Conversely, literature-based identification <strong>of</strong> catalytic residues<br />
could not be interpreted in an evolutionary conserved structure context, because data<br />
mining did not find a suitable recurrent structure pattern.<br />
149
Chapter 9<br />
Conclusions and future work<br />
9.1 Summary <strong>of</strong> main contributions<br />
The goal <strong>of</strong> this thesis was to identify <strong>functional</strong> <strong>sites</strong> in proteins. For this purpose a<br />
novel approach that combines protein structure data mining and literature mining was<br />
used. Below is a summary <strong>of</strong> contributions.<br />
Significance testing <strong>of</strong> residue interaction is a novel approach to identify statistically<br />
significant spatial and chemical configurations <strong>of</strong> residues.<br />
The developed<br />
method relies solely on mathematical models, and the analysis shows, that recurrent<br />
homologous or convergent structural features can be extracted. More importantly,<br />
the mined result contains biologically valid data. For example, 22 proteins with the<br />
catalytic triad were identified from cross-validation studies. Altogether, the developed<br />
data mining method can be used to discover novel information; the result is a<br />
prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong>.<br />
Identification <strong>of</strong> protein residues is an important text mining component developed<br />
in this study for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s. The implemented solution<br />
utilises regular expression patterns, and lists <strong>of</strong> terminologies from UniProtKB and<br />
NCBI Taxonomy, in order to find and associate biological entities. Ultimately, an<br />
150
identified protein residue is mapped to a Uniprot protein, which means other extracted<br />
information can be integrated into UniProtKB. With a precision <strong>of</strong> 0.82 and<br />
a recall <strong>of</strong> 0.38, residues can be identified and associated precisely with their Uniprot<br />
proteins. From a whole MEDLINE analysis, 15,110 abstract texts were found, that<br />
can be used for information extraction <strong>of</strong> 2,884 UniProtKB/PDB proteins.<br />
Contextual feature extraction is a discovery-driven information extraction approach,<br />
to find description <strong>of</strong> function associated with a residue entity in the text. The developed<br />
method extracts from a parsed sentence verbal and prepositional relations<br />
<strong>of</strong> a residue and its contextual features. The Gene Ontology was not used, because<br />
it does not contain suitable terminologies for the identification <strong>of</strong> <strong>functional</strong> descriptions<br />
<strong>of</strong> residues. With a precision <strong>of</strong> 0.68 and a recall <strong>of</strong> 0.48, the language parser<br />
found 46,750 <strong>annotation</strong>s for the identified protein residues from MEDLINE. Manual<br />
analysis indicates that some <strong>of</strong> the extracted <strong>annotation</strong>s are valid, and contain<br />
novel information that can be used to update the feature table in UniProtKB.<br />
Annotation <strong>of</strong> protein structures is the main objective in this thesis. The goal is to<br />
create a synthesis between protein structure data and protein function data. The<br />
hypothesis is, that the intersection <strong>of</strong> information from both datasets can lead to<br />
the discovery <strong>of</strong> new biological information. For example, a <strong>predicted</strong> <strong>active</strong> site can<br />
be validated with evidences from the set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s. Although crossvalidations<br />
demonstrates, that mined information from PDB and literature contain<br />
correct results, no correlation was found between both datasets. Nevertheless, the<br />
text mined information are valid, and 1,391 catalytic residues were found, that can<br />
be used to update CSA.<br />
151
9.2 Limitations and future works<br />
During the work <strong>of</strong> this thesis, various research techniques, and three major analysis<br />
components have been developed. Their algorithms, and implementations were explained,<br />
their performances analysed, and suggestions for improvement have been made. In the<br />
following is a discussion on the improvements for the combined dataset analysis.<br />
To biologically validate a <strong>predicted</strong> <strong>functional</strong> site with published experimental data<br />
results it has to be assumed that the extracted <strong>functional</strong> <strong>annotation</strong>s from the literature<br />
provide sufficient supporting evidence for a biological function. This has been shown to<br />
be partly correct for some examples. However, it will probably not work in all cases. My<br />
results suggest that other factors have to be considered in order to achieve one <strong>of</strong> the<br />
followings: (1) standardised description <strong>of</strong> function <strong>of</strong> protein residues; (2) identification<br />
<strong>of</strong> a representative <strong>functional</strong> concept <strong>of</strong> a structural feature; and (3) verification <strong>of</strong> the<br />
validity <strong>of</strong> the pattern as a consensus <strong>functional</strong> site, where <strong>annotation</strong>s <strong>of</strong> other protein<br />
examples share the same <strong>annotation</strong>s. Although the verification approach uses the vast<br />
and broad covering information from MEDLINE, the analysis indicates that this might<br />
not be sufficient for this task.<br />
Another serious limitation in the literature-based verification <strong>of</strong> <strong>functional</strong> <strong>sites</strong> is to<br />
take into account that our knowledge <strong>of</strong> the protein function space could be incomplete or<br />
even incorrect. Protein structure data mining aims to deliver biologically unbiased results,<br />
since 3D pattern mining relies on mathematical models and no biological knowledge is<br />
used. The result is a prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong>. However, the input is biologically<br />
biased.<br />
Currently, we do not have the complete knowledge <strong>of</strong> the fold space, which<br />
means the actual distribution <strong>of</strong> structural features may be skewed. As a consequence,<br />
the prediction may contain a large fraction <strong>of</strong> false positives. In the long run, various<br />
structural genomics initiatives may expand our knowledge <strong>of</strong> the fold space.<br />
In the meantime, the literature is the main resource <strong>of</strong> biological evidences to validate<br />
predictions. Yet, our knowledge <strong>of</strong> protein residue function, and even the spectrum <strong>of</strong><br />
152
iological function has still to be determined.<br />
This can lead to four scenarios: (1) a<br />
true <strong>functional</strong> site is fully supported by evidences (true positive); (2) a true <strong>functional</strong><br />
site is partly supported by evidences (incomplete knowledge); (3) a falsely <strong>predicted</strong><br />
<strong>functional</strong> site is partly supported by evidences (incomplete knowledge); and (4) a falsely<br />
<strong>predicted</strong> <strong>functional</strong> site is fully supported by contradictory evidences (false positive).<br />
While, from a bioinformatical point <strong>of</strong> view, there is little we can do about this problem,<br />
the identification <strong>of</strong> case (2), (3), and case (4) can propose further biological experiments<br />
to find the missing data.<br />
153
Bibliography<br />
[AGM + 90]<br />
SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local<br />
alignment search tool. Journal <strong>of</strong> Molecular Biololgy, 215(3):403–10, 1990.<br />
[AL02]<br />
M Ashburner and SE Lewis. On ontologies for biologists: the gene ontology<br />
- uncoupling the web. Novartis Foundation Symposium, 2002.<br />
[AMS + 97]<br />
SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and<br />
DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation <strong>of</strong> protein<br />
database search programs. Nucleic Acids Research, 25(17):3389–402, 1997.<br />
[APG + 94] PJ Artymiuk, AR Poirrette, HM Grindley, DW Rice, and P Willett. A<br />
graph-theoretic approach to the identification <strong>of</strong> three-dimensional patterns<br />
<strong>of</strong> amino acid side-chains in protein structures. Journal <strong>of</strong> Molecular Biololgy,<br />
243(2):327–44, 1994.<br />
[Att02]<br />
TK Attwood. The PRINTS database: a resource for identification <strong>of</strong> protein<br />
families. Brief Bioinform, 3(3):252–63, 2002.<br />
[AZP + 05]<br />
G Ausiello, A Zanzoni, D Peluso, A Via, and M Helmer-Citterich. pdbFun:<br />
mass selection and fast comparison <strong>of</strong> annotated PDB residues.<br />
Nucleic<br />
Acids Research, 33:W133–137, Jul 2005.<br />
154
[BFL04]<br />
T Binkowski, P Freeman, and J Liang. pvSOAR: detecting similar surface<br />
patterns <strong>of</strong> pocket and void surfaces <strong>of</strong> amino acid residues on proteins.<br />
Nucleic Acids Research, 32:555–558, 2004.<br />
[BFW + 94]<br />
A Barth, K Frost, M Wahab, W Brandt, HD Schadler, and R Franke. Classification<br />
<strong>of</strong> serine proteases derived from steric comparisons <strong>of</strong> their <strong>active</strong><br />
<strong>sites</strong>, part ii: ”ser, his, asp arrangements in proteolytic and nonproteolytic<br />
proteins”. Drug Design Discovery, 2:89–111, November 1994.<br />
[BGH + 00]<br />
WC Barker, JS Garavelli, H Huang, PB Mcgarvey, BC Orcutt, GY Srinivasarao,<br />
C Xiao, LL Yeh, RS Ledley, JF Janda, F Pfeiffer, HW Mewes,<br />
A Tsugita, and C Wu. The protein information resource (pir). Nucleic<br />
Acids Research, 28(1):41–44, January 2000.<br />
[BKL00]<br />
SE Brenner, P Koehl, and M Levitt. The astral compendium for protein<br />
structure and sequence analysis. Nucleic Acids Research, 28(1):254–256,<br />
January 2000.<br />
[BLK + 08]<br />
E Beisswanger, V Lee, JJ Kim, D Rebholz-Schuhmann, A Splendiani,<br />
O Dameron, S Schulz, and U Hahn. Gene regulation ontology (gro): design<br />
principles and use cases. Studies in health technology and informatics,<br />
136:9–14, 2008.<br />
[BM05]<br />
R Bunescu and RJ Mooney. A shortest path dependency kernel for relation<br />
extraction.<br />
In Proceedings <strong>of</strong> the Joint Conference on Human Language<br />
Technology / Empirical Methods in Natural Language Processing<br />
(HLT/EMNLP’05), 2005.<br />
[BM06]<br />
R Bunescu and RJ Mooney. Subsequence kernels for relation extraction. In<br />
Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information<br />
Processing Systems 18, pages 171–178. MIT Press, 2006.<br />
155
[BMC08] BMC. Biomed central. http://www.biomedcentral.com/, November 2008.<br />
[BT03]<br />
JA Barker and JM Thornton. An algorithm for constraint-based structural<br />
template matching: application to 3D templates with statistical analysis.<br />
Bioinformatics, 19(13):1644–1649, September 2003.<br />
[BW03]<br />
PE Bourne and H Weissig. Structural Bioinformatics (Methods <strong>of</strong> Biochemical<br />
Analysis, V. 44). Wiley-Liss, 1 edition, February 2003.<br />
[BW05]<br />
CJO Baker and R Witte. Mutation miner - textual <strong>annotation</strong> <strong>of</strong> protein<br />
structures. CERMM Symposium, 2005.<br />
[BWF + 00]<br />
HM Berman, J Westbrook, Z Feng, G Gilliland, TN Bhat, H Weissig,<br />
IN Shindyalov, and PE Bourne. The protein data bank. Nucleic Acids<br />
Research, 28(1):235–242, January 2000.<br />
[CB94]<br />
RR Copley and GJ Barton. A structural analysis <strong>of</strong> phosphate and sulphate<br />
binding <strong>sites</strong> in proteins. Estimation <strong>of</strong> propensities for binding and conservation<br />
<strong>of</strong> phosphate binding <strong>sites</strong>. Journal <strong>of</strong> Molecular Biology, 242:321–<br />
329, Sep 1994.<br />
[CCR + 08]<br />
BL Cantarel, PM Coutinho, C Rancurel, T Bernard, V Lombard, and<br />
B Henrissat. The Carbohydrate-Active EnZymes database (CAZy): an<br />
expert resource for Glycogenomics. Nucleic Acids Research, Oct 2008.<br />
[Cer00]<br />
F Cerbah. Exogenous and endogenous approaches to semantic categorization<br />
<strong>of</strong> unknown technical terms. In in In Proceedings <strong>of</strong> the 18th International<br />
Conference on Computational Linguistics (COLING, pages 145–151,<br />
2000.<br />
[CFK + 05]<br />
BY Chen, VY F<strong>of</strong>anov, DM Kristensen, M Kimmel, O Lichtarge, and<br />
LE Kavraki. Algorithms for structural comparison and statistical analysis<br />
156
<strong>of</strong> 3D protein motifs. Pacific Symposium on Biocomputing, pages 334–345,<br />
2005.<br />
[Cha93]<br />
P Chakrabarti. Anion binding <strong>sites</strong> in protein structures. Journal <strong>of</strong> Molecular<br />
Biololgy, 234:463–482, Nov 1993.<br />
[CHR + 02]<br />
JM Castagnetto, SW Hennessy, VA Roberts, ED Getz<strong>of</strong>f, JA Tainer, and<br />
ME Pique. Mdb: the metalloprotein database and browser at the scripps<br />
research institute. Nucleic Acids Research, 30(1):379–382, January 2002.<br />
[CK06] IG Choi and SH Kim. Evolution <strong>of</strong> protein structural classes and protein<br />
sequence families. Proceedings <strong>of</strong> the National Academy <strong>of</strong> Sciences,<br />
September 2006.<br />
[CL64]<br />
RV Cochran and LH Lund. On the kirkwood superposition approximation.<br />
Journal <strong>of</strong> Physical Chemistry, 1964.<br />
[CMP05]<br />
J Crim, R McDonald, and F Pereira. <strong>Automatic</strong>ally annotating documents<br />
with normalized gene lists. BMC Bioinformatics, 6 Suppl 1, 2005.<br />
[CMR06]<br />
P Corbett and P Murray-Rust. High-throughput identification <strong>of</strong> chemistry<br />
in life science texts. In Computational Life Sciences II, pages 107–118.<br />
Springer, 2006.<br />
[CSL + 06]<br />
FM Couto, MJ Silva, V Lee, E Dimmer, E Camon, R Apweiler, H Kirsch,<br />
and D Rebholz-Schuhmann. Goannotator: linking protein go <strong>annotation</strong>s<br />
to evidence text. Journal <strong>of</strong> Biomedical Discovery and Collaboration, 1:19+,<br />
December 2006.<br />
[DBAD03]<br />
R Day, DA Beck, RS Armen, and V Daggett. A consensus view <strong>of</strong> fold<br />
space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein<br />
Science, 12:2150–2160, Oct 2003.<br />
157
[DCG + 04]<br />
F Diella, S Cameron, C Gemuend, R Linding, A Via, B Kuster, ST Ponten,<br />
N Blom, and TJ Gibson. Phospho.elm: a database <strong>of</strong> experimentally verified<br />
phosphorylation <strong>sites</strong> in eukaryotic proteins. BMC Bioinformatics, 5, June<br />
2004.<br />
[DS05]<br />
A Doms and M Schroeder. Gopubmed: exploring pubmed with the gene<br />
ontology. Nucleic Acids Research, 33(Web Server issue), July 2005.<br />
[FGS98]<br />
JS Fetrow, A Godzik, and J Skolnick. Functional analysis <strong>of</strong> the escherichia<br />
coli genome using the sequence-to-structure-to-function paradigm: identification<br />
<strong>of</strong> proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase<br />
activity. Journal <strong>of</strong> Molecular Biololgy, 282(4):703–711, October<br />
1998.<br />
[FKY + 01]<br />
C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: a<br />
natural-language processing system for the extraction <strong>of</strong> molecular pathways<br />
from journal articles. Bioinformatics, 17 Suppl 1, 2001.<br />
[Fri07]<br />
D Frishman. Protein <strong>annotation</strong> at genomic scale: the current status. Chem<br />
Rev, 107(8):3448–3466, August 2007.<br />
[FS98]<br />
JS Fetrow and J Skolnick. Method for prediction <strong>of</strong> protein function from sequence<br />
using the sequence-to-structure-to-function paradigm with application<br />
to glutaredoxins/thioredoxins and T1 ribonucleases. Journal <strong>of</strong> Molecular<br />
Biololgy, 281(5), September 1998.<br />
[Fuk98]<br />
K Fukuda. Toward information extraction: identifying protein names from<br />
biological papers, 1998.<br />
[FWLN94] D Fischer, H Wolfson, SL Lin, and R Nussinov. Three-dimensional, sequence<br />
order-independent structural comparison <strong>of</strong> a serine protease against<br />
158
the crystallographic database reveals <strong>active</strong> site similarities: potential implications<br />
to evolution and to protein folding. Protein Science, 3(5):769–778,<br />
May 1994.<br />
[GDAW03]<br />
R Gaizauskas, G Demetriou, PJ Artymiuk, and P Willett. Protein structures<br />
and information extraction from biological texts: the pasta system.<br />
Bioinformatics, 19(1):135–143, January 2003.<br />
[GDO + 05]<br />
A Golovin, D Dimitropoulos, TJ Oldfield, A Rachedi, and K Henrick.<br />
Msdsite: A database search and retrieval system for the analysis and viewing<br />
<strong>of</strong> bound ligands and <strong>active</strong> <strong>sites</strong>. Proteins: Structure, Function, and<br />
Bioinformatics, 58(1):190–199, 2005.<br />
[GH08]<br />
A Golovin and K Henrick. Msdmotif: exploring protein <strong>sites</strong> and motifs.<br />
BMC Bioinformatics, 9(1), 2008.<br />
[GJYLRS08] S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Combining<br />
evidence, specificity, and proximity towards the normalization <strong>of</strong> gene ontology<br />
terms in text. EURASIP journal on bioinformatics & systems biology,<br />
2008.<br />
[Glu91]<br />
JP Glusker. Structural aspects <strong>of</strong> metal liganding to <strong>functional</strong> groups in<br />
proteins. Advances in Protein Chemistry, 42:1–76, 1991.<br />
[GOC06] GOConsortium. The gene ontology (go) project in 2006. Nucleic Acids<br />
Research, 34(Database issue), January 2006.<br />
[GPP + 03]<br />
F Glaser, T Pupko, I Paz, RE Bell, D Bechor-Shental, E Martz, and N Ben-<br />
Tal.<br />
ConSurf: identification <strong>of</strong> <strong>functional</strong> regions in proteins by surfacemapping<br />
<strong>of</strong> phylogenetic information. Bioinformatics, 19(1):163–164, January<br />
2003.<br />
159
[Gue96]<br />
F Guenthner. Electronic lexica and corpora research at cis. CIS Bericht-<br />
96-100, 1996.<br />
[HBB + 08]<br />
N Hulo, A Bairoch, V Bulliard, L Cerutti, BA Cuche, E de Castro,<br />
C Lachaize, PS Langendijk-Genevaux, and CJ Sigrist.<br />
The 20 years <strong>of</strong><br />
PROSITE. Nucleic Acids Research, 36:D245–249, Jan 2008.<br />
[HBGK03]<br />
M Hendlich, A Bergner, J Günther, and G Klebe. Relibase: design and<br />
development <strong>of</strong> a database for comprehensive analysis <strong>of</strong> protein-ligand interactions.<br />
Journal <strong>of</strong> Molecular Biololgy, 326(2):607–620, February 2003.<br />
[HFM + 05]<br />
D Hanisch, K Fundel, HT Mevissen, R Zimmer, and J Fluck. Prominer:<br />
rule-based protein and gene entity recognition. BMC Bioinformatics, 6<br />
Suppl 1, 2005.<br />
[HJ99] C Hadley and DT Jones. A systematic comparison <strong>of</strong> protein structure<br />
classifications: SCOP, CATH and FSSP. Structure, 7:1099–1112, Sep 1999.<br />
[HLC04]<br />
F Horn, AL Lau, and FE Cohen. Automated extraction <strong>of</strong> mutation data<br />
from the literature: application <strong>of</strong> mutext to g protein-coupled receptors and<br />
nuclear hormone receptors. Bioinformatics, 20(4):557–568, March 2004.<br />
[HMBC97]<br />
TJ Hubbard, AG Murzin, SE Brenner, and C Chothia. SCOP: a structural<br />
classification <strong>of</strong> proteins database. Nucleic Acids Research, 25:236–239, Jan<br />
1997.<br />
[HNR + 05]<br />
ZZ Hu, M Narayanaswamy, KE Ravikumar, K Vijay-Shanker, and CH Wu.<br />
Literature mining and database <strong>annotation</strong> <strong>of</strong> protein phosphorylation using<br />
a rule-based system. Bioinformatics, 21(11):2759–2765, June 2005.<br />
[Hob02]<br />
JR Hobbs. Information extraction from biomedical text. Journal <strong>of</strong> Biomedical<br />
Informatics, 35(4):260–264, August 2002.<br />
160
[HPS + 03]<br />
A Harrison, F Pearl, I Sillitoe, T Slidel, R Mott, JM Thornton, and<br />
CA Orengo. Recognizing the fold <strong>of</strong> a protein structure. Bioinformatics,<br />
19(14):1748–1759, September 2003.<br />
[HS94]<br />
L Holm and C Sander. The fssp database <strong>of</strong> structurally aligned protein<br />
fold families. Nucleic Acids Research, 22(17):3600–3609, September 1994.<br />
[HS96] L Holm and C Sander. Mapping the protein universe. Science,<br />
273(5275):595–603, August 1996.<br />
[HSSS92]<br />
U Hobohm, M Scharf, R Schneider, and C Sander. Selection <strong>of</strong> representative<br />
protein data sets. Protein Science, 1(3):409–417, March 1992.<br />
[HZH + 04]<br />
M Huang, X Zhu, Y Hao, DG Payan, K Qu, and M Li. Discovering patterns<br />
to extract protein-protein interactions from full texts. Bioinformatics,<br />
20(18):3604–3612, December 2004.<br />
[IPGK05]<br />
VA Ivanisenko, SS Pintus, DA Grigorovich, and NA Kolchanov. PDBSite:<br />
a database <strong>of</strong> the 3D structure <strong>of</strong> protein <strong>functional</strong> <strong>sites</strong>. Nucleic Acids<br />
Research, 33:D183–187, Jan 2005.<br />
[JB04]<br />
A Jakulin and I Bratko. Testing the significance <strong>of</strong> attribute interactions.<br />
In In ICML, pages 409–416. ACM Press, 2004.<br />
[JGLRS08] S Jaeger, S Gaudan, U Leser, and D Rebholz-Schuhmann. Integrating<br />
protein-protein interactions and text mining for protein function prediction.<br />
BMC Bioinformatics, 9(Suppl 8), 2008.<br />
[JIDG03]<br />
M Jambon, A Imberty, G Delà c○age, and C Geourjon. A new bioinformatic<br />
approach to detect common 3d <strong>sites</strong> in protein structures. Proteins:<br />
Structure, Function, and Genetics, 52:137–145, 2003.<br />
161
[JK95]<br />
J Justeson and S Katz. Technical terminology: some linguistic properties<br />
and an algorithm for identification in text. Natural Language Engineering,<br />
pages 9–27, 1995.<br />
[KCRB07]<br />
R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker. A workflow for<br />
mutation extraction and structure <strong>annotation</strong>. Journal <strong>of</strong> Bioinformatics<br />
and Computational Biology, 5(6):1319–1337, December 2007.<br />
[KH04]<br />
E Krissinel and K Henrick. Secondary-structure matching (ssm), a new tool<br />
for fast protein structure alignment in three dimensions. Acta Crystallographica<br />
Section D: Biological Crystallography, 60(1):2256–2268, December<br />
2004.<br />
[KJ94]<br />
GJ Kleywegt and TA Jones. Detection, delineation, measurement and display<br />
<strong>of</strong> cavities in macromolecular structures. Acta Crystallographica Section<br />
D: Biological Crystallography, 50(Pt 2):178–185, March 1994.<br />
[Kle99]<br />
GJ Kleywegt. Recognition <strong>of</strong> spatial motifs in protein structures. Journal<br />
<strong>of</strong> Molecular Biololgy, 285(4):1887–1897, January 1999.<br />
[KN03]<br />
K Kinoshita and H Nakamura. Identification <strong>of</strong> protein biochemical functions<br />
by similarity search using the molecular surface database ef-site. Protein<br />
Science, 12(8):1589–1595, August 2003.<br />
[KNT05] A Koike, Y Niwa, and T Takagi. <strong>Automatic</strong> extraction <strong>of</strong> gene/protein<br />
biological functions from biomedical text. Bioinformatics, 21(7):1227–1236,<br />
April 2005.<br />
[KON99] T Kawabata, M Ota, and K Nishikawa. The protein mutant database.<br />
Nucleic Acids Research, 27(1):355–357, January 1999.<br />
162
[Las95]<br />
RA Laskowski. Surfnet: a program for visualizing molecular surfaces, cavities,<br />
and intermolecular interactions. Journal <strong>of</strong> Molecular Biololgy, 13(5),<br />
October 1995.<br />
[LC05] G Leroy and H Chen. Genescene: An ontology-enhanced integration <strong>of</strong><br />
linguistic and co-occurrence based relations in biomedical texts: Research<br />
articles. Journal <strong>of</strong> the American Society for Information Science and Technology,<br />
56(5):457–468, March 2005.<br />
[LCM03] G Leroy, H Chen, and JD Martinez. A shallow parser based on closedclass<br />
words to capture relations in biomedical text. Journal <strong>of</strong> Biomedical<br />
Informatics, pages 145–158, June 2003.<br />
[LEW98]<br />
J Liang, H Edelsbrunner, and C Woodward. Anatomy <strong>of</strong> protein pockets<br />
and cavities: measurement <strong>of</strong> binding site geometry and implications for<br />
ligand design. Protein Science, 7(9):1884–1897, September 1998.<br />
[LHC07] LC Lee, F Horn, and FE Cohen. <strong>Automatic</strong> extraction <strong>of</strong> protein point<br />
mutations using a graph bigram association. PLoS Computational Biology,<br />
3(2):e16+, February 2007.<br />
[LRTV07]<br />
Gonzalo Lopez, Ana Rojas, Michael Tress, and Alfonso Valencia. Assessment<br />
<strong>of</strong> predictions submitted for the CASP7 function prediction category.<br />
Proteins, 69 Suppl 8:165–74, 2007.<br />
[LW91] Y Lamdan and HJ Wolfson. Protein structures and information extraction<br />
from biological texts: the pasta system. Computer Vision and Pattern<br />
Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Conference<br />
on, pages 22–27, June 1991.<br />
[Mar05] AC Martin. Mapping pdb chains to uniprotkb entries. Bioinformatics,<br />
21(23):4297–4301, December 2005.<br />
163
[MB99] Y Matsuo and SH Bryant. Identification <strong>of</strong> homologous core structures.<br />
Proteins, 35:70–79, Apr 1999.<br />
[MG03] J McCallum and S Ganesh. Text mining <strong>of</strong> DNA sequence homology<br />
searches. Applied Bioinformatics, 2:59–63, 2003.<br />
[MR03]<br />
S Mika and B Rost. UniqueProt: Creating representative protein sequence<br />
sets. Nucleic Acids Research, 31:3789–3791, Jul 2003.<br />
[MSD08] MSDmapping. Msdmapping. http://www.ebi.ac.uk/msd-as/<br />
MSDMapping/, November 2008.<br />
[MT05] Y Miyao and J Tsujii. Probabilistic disambiguation models for widecoverage<br />
hpsg parsing. In ACL ’05: Proceedings <strong>of</strong> the 43rd Annual Meeting<br />
on Association for Computational Linguistics, pages 83–90. Association for<br />
Computational Linguistics, 2005.<br />
[NBD + 06]<br />
J Natarajan, D Berrar, W Dubitzky, C Hack, Y Zhang, C Desesa,<br />
JR Van Brocklyn, and EG Bremer.<br />
Text mining <strong>of</strong> full-text journal articles<br />
combined with gene expression analysis reveals a relationship between<br />
sphingosine-1-phosphate and invasiveness <strong>of</strong> a glioblastoma cell line. BMC<br />
Bioinformatics, 7:373+, August 2006.<br />
[NED03]<br />
S Novichkova, S Egorov, and N Daraselia. Medscan, a natural language<br />
processing engine for medline abstracts. Bioinformatics, 19(13):1699–1706,<br />
September 2003.<br />
[OCR01]<br />
MJ Ondrechen, JG Clifton, and D Ringe. Thematics: A simple computational<br />
predictor <strong>of</strong> enzyme function from structure.<br />
Proceedings <strong>of</strong> the<br />
National Academy <strong>of</strong> Sciences, 98(22):12473–12478, October 2001.<br />
164
[Old01]<br />
TJ Oldfield. Creating structure features by data mining the PDB to use as<br />
molecular-replacement models. Acta Crystallographica Section D: Biological<br />
Crystallography, 57:1421–1427, Oct 2001.<br />
[Old02]<br />
TJ Oldfield. Data mining the protein data bank: residue interactions. Proteins,<br />
49(4):510–528, December 2002.<br />
[OMJ + 97]<br />
CA Orengo, AD Michie, S Jones, DT Jones, MB Swindells, and JM Thornton.<br />
CATH-a hierarchic classification <strong>of</strong> protein domain structures. Structure,<br />
5:1093–1108, Aug 1997.<br />
[PB06]<br />
BJ Polacco and PC Babbitt. Automated discovery <strong>of</strong> 3d motifs for protein<br />
function <strong>annotation</strong>. Bioinformatics, 22(6):723–730, March 2006.<br />
[PBT04]<br />
CT Porter, GJ Bartlett, and JM Thornton. The Catalytic Site Atlas: a<br />
resource <strong>of</strong> catalytic <strong>sites</strong> and residues identified in enzymes using structural<br />
data. Nucleic Acids Research, 32(Database issue), January 2004.<br />
[PJYLRS08] P Pezik, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Static dictionary<br />
features for term polysemy identification. Building and evaluating<br />
resources for biomedical text mining, LREC Workshop, 2008.<br />
[PKS06] G Pandey, V Kumar, and M Steinbach. Computational approaches for<br />
protein function prediction: A survey. Technical Report 06-028, Department<br />
<strong>of</strong> Computer Science and Engineering, University <strong>of</strong> Minnesota, Twin Cities,<br />
2006.<br />
[Plo08] PloS. Public library <strong>of</strong> science. http://www.plos.org/, November 2008.<br />
[PMC08]<br />
PMC. Pubmed central. http://www.pubmedcentral.nih.gov/, November<br />
2008.<br />
165
[POHS05]<br />
M Pesu, J O’Shea, L Hennighausen, and O Silvennoinen. Identification <strong>of</strong> an<br />
acquired mutation in Jak2 provides molecular insights into the pathogenesis<br />
<strong>of</strong> myeloproliferative disorders.<br />
Molecular Interventions, 5:211–215, Aug<br />
2005.<br />
[RMK + 07]<br />
ND Rawlings, FR Morton, CY Kok, J Kong, and AJ Barrett. Merops: the<br />
peptidase database.<br />
Nucleic Acids Research, pages gkm954+, November<br />
2007.<br />
[Ros99]<br />
B Rost. Twilight zone <strong>of</strong> protein sequence alignments. Protein Engineering<br />
Design and Selection, 12(2):85–94, February 1999.<br />
[RSAG + 08]<br />
D Rebholz-Schuhmann, M Arregui, S Gaudan, H Kirsch, and A Jimeno<br />
Yepes. Text processing through web services: Calling whatizit. Bioinformatics,<br />
2008.<br />
[RSKA + 07]<br />
D Rebholz-Schuhmann, H Kirsch, M Arregui, S Gaudan, M Riethoven, and<br />
P Stoehr. Ebimed-text crunching to gather facts for proteins from medline.<br />
Bioinformatics, 23(2), January 2007.<br />
[RSMA + 04]<br />
D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle, G Casari, and H Kirsch.<br />
<strong>Automatic</strong> extraction <strong>of</strong> mutations from medline and cross-validation with<br />
omim. Nucleic Acids Research, 2004.<br />
[Rus98] RB Russell. Detection <strong>of</strong> protein three-dimensional side-chain patterns:<br />
new examples <strong>of</strong> convergent evolution.<br />
Journal <strong>of</strong> Molecular Biology,<br />
279(5):1211–1227, June 1998.<br />
[SAR + 07]<br />
B Smith, M Ashburner, C Rosse, K Bard, W Bug, W Ceusters, LJ Goldberg,<br />
K Eilbeck, A Ireland, CJ Mungall, N Leontis, P Rocca-Serra, A Ruttenberg,<br />
SA Sansone, RH Scheuermann, N Shah, PL Whetzel, and S Lewis. The<br />
166
OBO Foundry: coordinated evolution <strong>of</strong> ontologies to support biomedical<br />
data integration. Nature Biotechnology, 25(11):1251–5, 2007.<br />
[SB05]<br />
A Schutz and P Buitelaar. Relext: A tool for relation extraction from text<br />
in ontology extension. The Semantic Web - ISWC 2005, pages 593–606,<br />
2005.<br />
[SB06]<br />
J Schuman and S Bergler. Postnominal prepositional phrase attachment<br />
in proteomics. In Proceedings <strong>of</strong> the HLT-NAACL BioNLP Workshop on<br />
Linking Natural Language and Biology. Association for Computational Linguistics,<br />
2006.<br />
[SDC06]<br />
A Sidhu, T Dillon, and E Chang. Unification <strong>of</strong> protein data and knowledge<br />
sources. Knowledge-Based Intelligent Information and Engineering Systems,<br />
pages 728–737, 2006.<br />
[Sin04] A Singer. Maximum entropy formulation <strong>of</strong> the Kirkwood superposition<br />
approximation. Journal <strong>of</strong> Chemical Physics, 121:3657–3666, Aug 2004.<br />
[SPIBA03]<br />
PK Shah, C Perez-Iratxeta, P Bork, and MA Andrade. Information extraction<br />
from full text scientific articles: where are the keywords?<br />
BMC<br />
Bioinformatics, 4(1), May 2003.<br />
[SPNW04]<br />
A Shulman-Peleg, R Nussinov, and HJ Wolfson. Recognition <strong>of</strong> <strong>functional</strong><br />
<strong>sites</strong> in protein structures. Journal <strong>of</strong> Molecular Biololgy, 339(3):607–633,<br />
June 2004.<br />
[SS96] R Schneider and C Sander. The HSSP database <strong>of</strong> protein structuresequence<br />
alignments. Nucleic Acids Research, 24(1):201–5, 1996.<br />
[SSR03]<br />
A Stark, S Sunyaev, and RB Russell. A model for statistical significance <strong>of</strong><br />
local similarities in structure. Journal <strong>of</strong> Molecular Biology, 326(5):1307–<br />
1316, March 2003.<br />
167
[STB06]<br />
MH Saier, CV Tran, and RD Barabote. Tcdb: the transporter classification<br />
database for membrane transport protein analyses and information. Nucleic<br />
Acids Research, 34(Database issue), January 2006.<br />
[SWS + 04]<br />
MJ Schuemie, M Weeber, BJ Schijvenaars, EM van Mulligen, CC van der<br />
Eijk, R Jelier, B Mons, and JA Kors. Distribution <strong>of</strong> information in biomedical<br />
abstracts and full-text publications. Bioinformatics, 20(16):2597–2604,<br />
November 2004.<br />
[SYH + 03]<br />
S Saito, H Yamaguchi, Y Higashimoto, C Chao, Y Xu, AJ Fornace, E Appella,<br />
and CW Anderson. Phosphorylation site interdependence <strong>of</strong> human<br />
p53 post-translational modifications in response to stress. Journal <strong>of</strong> Biological<br />
Chemistry, 278:37536–37544, Sep 2003.<br />
[TCS + 07]<br />
RT Tsai, WC Chou, YS Su, YC Lin, CL Sung, HJ Dai, IT Yeh, W Ku,<br />
TY Sung, and WL Hsu.<br />
Biosmile: A semantic role labeling system for<br />
biomedical verbs using a maximum-entropy model with automatically generated<br />
template features. BMC Bioinformatics, 8:325+, September 2007.<br />
[TMA08]<br />
Y Tsuruoka, J Mcnaught, and S Ananiadou. Normalizing biomedical terms<br />
by minimizing ambiguity and variability. BMC Bioinformatics, 9(Suppl 3),<br />
2008.<br />
[TOT04]<br />
Y Tateisi, T Ohta, and J Tsujii. Annotation <strong>of</strong> predicate-argument structure<br />
on molecular biology text. In First International Joint Conference on Natural<br />
Language Processing In the IJCNLP-04 workshop on Beyond Shallow<br />
Analyses, March 2004.<br />
[TW02]<br />
L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical<br />
text. Bioinformatics, 18(8):1124–1132, August 2002.<br />
168
[VMMR + 05] S Velankar, P McNeil, V Mittard-Runte, A Suarez, D Barrell, R Apweiler,<br />
and K Henrick.<br />
E-msd: an integrated data resource for bioinformatics.<br />
Nucleic Acids Research, 33(Database issue), January 2005.<br />
[VZHC05] A Via, A Zanzoni, and M Helmer-Citterich. Seq2Struct: a resource for<br />
establishing sequence-structure links. Bioinformatics, 21(4):551–3, 2005.<br />
[WAB + 06]<br />
CH Wu, R Apweiler, A Bairoch, DA Natale, WC Barker, B Boeckmann,<br />
S Ferro, E Gasteiger, H Huang, R Lopez, M Magrane, MJ Martin,<br />
R Mazumder, C O’Donovan, N Redaschi, and B Suzek. The universal<br />
protein resource (uniprot): an expanding universe <strong>of</strong> protein information.<br />
Nucleic Acids Research, 34(Database issue), January 2006.<br />
[WBB + 06]<br />
DL Wheeler, T Barrett, DA Benson, SH Bryant, K Canese, V Chetvernin,<br />
DM Church, M Dicuccio, R Edgar, S Federhen, LY Geer, W Helmberg,<br />
Y Kapustin, DL Kenton, O Khovayko, DJ Lipman, TL Madden, DR Maglott,<br />
J Ostell, KD Pruitt, GD Schuler, LM Schriml, E Sequeira, ST Sherry,<br />
K Sirotkin, A Souvorov, G Starchenko, TO Suzek, R Tatusov, TA Tatusova,<br />
L Wagner, and E Yaschenko.<br />
Database resources <strong>of</strong> the national center<br />
for biotechnology information. Nucleic Acids Research, 34(Database issue),<br />
January 2006.<br />
[WBT97] AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hashing<br />
algorithm for deriving 3d coordinate templates for searching structural<br />
databases. application to enzyme <strong>active</strong> <strong>sites</strong>. Protein Science, 6(11):2308–<br />
2323, November 1997.<br />
[WD03]<br />
G Wang and RL Dunbrack. Pisces: a protein sequence culling server. Bioinformatics,<br />
19(12):1589–1591, August 2003.<br />
169
[WK07]<br />
R Witte and T Kappler. Enhanced semantic access to the protein engineering<br />
literature using ontologies populated by text mining. International<br />
Journal <strong>of</strong> Bioinformatics Research and Applications, 2007.<br />
[WR97] HJ Wolfson and I Rigoutsos. Geometric hashing: an overview. Computational<br />
Science and Engineering, IEEE [see also Computing in Science &<br />
Engineering], 4(4):10–21, 1997.<br />
[WSC04] T Wattarujeekrit, PK Shah, and N Collier. Pasbio: predicate-argument<br />
structures for event extraction in molecular biology. BMC Bioinformatics,<br />
5, October 2004.<br />
[YEC + 07]<br />
S Yoon, JC Ebert, EY Chung, G De Micheli, and RB Altman. Clustering<br />
protein environments for function prediction: finding prosite motifs in 3d.<br />
BMC Bioinformatics, 8 Suppl 4, 2007.<br />
[YHF + 02]<br />
H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. <strong>Automatic</strong><br />
extraction <strong>of</strong> gene and protein synonyms from medline and journal<br />
articles. Proceedings <strong>of</strong> the AMIA Symposium, pages 919–923, 2002.<br />
[YLPV07] YL Yip, N Lachenal, V Pillet, and AL Veuthey. Retrieving mutationspecific<br />
information for human proteins in UniProt/Swiss-Prot Knowledgebase.<br />
Journal <strong>of</strong> Bioinformatics and Computational Biology, 5:1215–1231,<br />
Dec 2007.<br />
[YMTT05] A Yakushiji, Y Miyao, Y Tateisi, and J Tsujii. Biomedical information<br />
extraction with predicate-argument structure patterns. In SMBM, 2005.<br />
170
Appendix A<br />
Examples <strong>of</strong> errors in relation<br />
extraction.<br />
171
Table A.1: Examples <strong>of</strong> errors in the relation extraction for the detection <strong>of</strong><br />
contextual features.<br />
.<br />
Sentence<br />
Annotated residue<br />
Annotated keywords<br />
Annotated PAS<br />
TP shallow parsing<br />
FP full parsing<br />
Sentence<br />
Annotated residue<br />
Annotated keywords<br />
Annotated PAS<br />
FP shallow parsing<br />
TP full parsing<br />
Sentence<br />
Annotated residue<br />
Annotated keywords<br />
Annotated PAS<br />
FP shallow parsing<br />
FP full parsing<br />
”This observation provides a rationale for the reduced electron-transfer efficiency displayed<br />
by the E92K mutant. ” (PMID:10089511)<br />
GLU92<br />
reduced electron-transfer efficiency<br />
pred = diplayed<br />
arg1 = the reduced electron-transfer efficiency<br />
arg2-by = the E92K mutant<br />
pred = displayed<br />
arg1 = a rationale<br />
arg1-for = the reduced electron-transfer efficiency<br />
arg2-by = the GLU92 LYS mutant<br />
pred = displayed<br />
arg1-by = the GLU92 LYS mutant<br />
”An apparent ’acceptor consensus overlap’ at Ser474 suggests that the mechanism behind<br />
the glycosaminoglycan split <strong>of</strong> TM may involve a competition for substrate between xylosyltransferase<br />
and N-acetylgalactosaminyltransferase.” (PMID:8216207)<br />
SER474<br />
acceptor consensus overlap<br />
pred = suggests<br />
arg1 = An apparent ’acceptor consensus overlap’<br />
arg1-at = SER474<br />
arg2 = the mechanism behind the glycosaminoglycan split<br />
arg2-<strong>of</strong> = TM<br />
pred = suggests<br />
arg1-at = SER474<br />
arg2 = that the mechanism<br />
arg2-behind = the glycosaminoglycan split<br />
arg2-<strong>of</strong> =<br />
pred = suggests<br />
arg1 = An apparent ’acceptor consensus overlap’<br />
arg1-at = SER474<br />
arg2 = that the mechanism<br />
arg2-behind = the glycosaminoglycan split<br />
arg2-<strong>of</strong> = TM<br />
”Using this approach, coupled with Edman degradation <strong>of</strong> the 32PO4-labeled tryptic<br />
peptides, and comparison with tryptic peptides analyzed after labeling normal human<br />
colonic tissues, we identified ser-52 as the major K18 physiologic phosphorylation site.”<br />
(PMID:7523419)<br />
SER52<br />
physiologic phosphorylation site<br />
pred = identified<br />
arg1 = unk<br />
arg2 = SER52<br />
arg2-as = the major K18 phosphorylation phosphorylation site<br />
pred = identified<br />
arg2 = SER52<br />
arg2-as = the major<br />
pred = identified<br />
arg1 = we<br />
arg2 = SER52<br />
172
Appendix B<br />
Examples <strong>of</strong> extracted <strong>functional</strong><br />
<strong>annotation</strong>s compared with<br />
UniProtKB<br />
173
.<br />
RID+UID<br />
Table B.1: Comparison <strong>of</strong> extracted protein residue <strong>annotation</strong>s from GC with<br />
UniProtKB. Mined <strong>functional</strong> <strong>annotation</strong>s are listed as PAS, while relevant<br />
information from UniProtKB are reproduced from the feature table (FT) entry<br />
line.<br />
SER15 P53 HUMAN<br />
Sentence ”Previous studies have demonstrated that phosphorylation <strong>of</strong> human<br />
p53 on serine 15 contributes to protein stabilization after<br />
DNA damage and that this is mediated by the ATM family <strong>of</strong> kinases.”<br />
(PMID:11865061)<br />
UniProtKB/FT<br />
PAS<br />
RID+UID<br />
Sentence<br />
UniProtKB/FT<br />
PAS<br />
PAS<br />
RID+UID<br />
Sentence<br />
UniProtKB/FT<br />
PAS<br />
RID+UID<br />
Sentence<br />
UniProtKB/FT<br />
PAS<br />
SER15 MOD RES: Phosphoserine; by PRPK<br />
SER15 VARIANT: S->R in a sporadic cancer; somatic mutation.<br />
pred = contributes<br />
Arg1 =<br />
arg1-on = SER15<br />
arg2 =<br />
arg2-to = protein stabilization<br />
arg2-after = DNA damage and that<br />
GLU189 CP27B HUMAN, LEU343 CP27B HUMAN<br />
”The R389G mutant was totally in<strong>active</strong>,but mutant L343F retained<br />
2.3% <strong>of</strong> wild-type activity,and mutant E189G retained 22% <strong>of</strong> wildtype<br />
activity.” (PMID:12050193)<br />
GLU189 VARIANT: E-K in VDDR I; 11% <strong>of</strong> wild-type activity.<br />
LEU343 VARIANT: L->F in VDDR I; 2.3% <strong>of</strong> wild-type activity.<br />
pred = retained<br />
arg1 = but mutant LEU343 PHE<br />
arg2 = 2.3 %<br />
arg2-<strong>of</strong> = wild-type activity<br />
pred = retained<br />
arg1 = and mutant GLU189 GLY<br />
arg2 = 22 %<br />
arg2-<strong>of</strong> = wild-type activity<br />
CYS260 TGA1 ARATH, CYS266 TGA1 ARATH<br />
”Furthermore,site-directed mutagenesis <strong>of</strong> TGA1 Cys-260 and Cys-<br />
266 enables the interaction with NPR1 in yeast and<br />
Arabidopsis.” (PMID:12953119)<br />
C260/C266 DISULFID: (potential).<br />
C260 MUTAGEN: C->N; Gain <strong>of</strong> interaction with NPR1; when associated with S-266.<br />
C266 MUTAGEN: C->S: Gain <strong>of</strong> interaction with NPR1; when associated with S-260.<br />
pred = enables<br />
arg1 = site-directed mutagenesis<br />
arg1-<strong>of</strong> = TGA1 CYS260 and CYS266<br />
arg2 = the interaction<br />
arg2-with = NPR1<br />
arg2-in = yeast and Arabidopsis<br />
THR13 RUM1 SCHPO, SER19 RUM1 SCHPO<br />
”Direct in vitro kinase assay using GST-fusion proteins <strong>of</strong> wild-type as well as various mutants<br />
<strong>of</strong> p25(rum1) demonstrated that MAPK phosphorylates<br />
the N-terminal portion <strong>of</strong> p25(rum1) and residues Thr13<br />
and Ser19 are major phosphorylation <strong>sites</strong> for MAPK.”<br />
(PMID:12135491)<br />
THR13 MOD RES: Phosphothreonine; by MAPK<br />
SER19 MOD RES: Phosphoserine; by MAPK<br />
SER19 MUTAGEN: S->E:reduces activity as a cdc2 inhibitor; when associated with E-13<br />
pred = are<br />
arg1 = the N-terminal portion<br />
arg1-<strong>of</strong> = p25(rum1) and residues THR13 and SER19<br />
174
. . . continuation <strong>of</strong> table B.1<br />
arg2 = major phosphorylation <strong>sites</strong><br />
arg2-for = MAPK<br />
RID+UID<br />
Sentence<br />
UniProtKB/FT<br />
PAS<br />
PAS<br />
RID+UID<br />
THR13 RUM1 SCHPO, SER19 RUM1 SCHPO<br />
”Together with the fact that replacement <strong>of</strong> both Thr13 and Ser19 with<br />
Glu,which mimics the phosphorylated state <strong>of</strong> these residues,also significantly reduces the activity<br />
<strong>of</strong> p25(rum1) as a Cdc2 inhibitor,it was suggested that<br />
the phosphorylation <strong>of</strong> Thr13 and Ser19 negatively regulates<br />
the function <strong>of</strong> p25(rum1).” (PMID:12135491)<br />
THR13 N/A<br />
SER19 N/A<br />
pred = suggested<br />
arg2 = that the phosphorylation<br />
arg2-<strong>of</strong> = THR13 and SER19<br />
pred = regulates<br />
arg1 = that the phosphorylation<br />
arg1-<strong>of</strong> = THR13 and SER19<br />
arg2 = the function<br />
arg2-<strong>of</strong> = p25(rum1)<br />
THR13 RUM1 SCHPO, SER19 RUM1 SCHPO<br />
Sentence ”Further evidence indicates that phosphorylation <strong>of</strong> Thr13<br />
and Ser19 may retain a negative effect on the function <strong>of</strong><br />
p25(rum1) even in vivo.” (PMID:12135491)<br />
UniProtKB/FT<br />
PAS<br />
RID+UID<br />
THR13 N/A<br />
SER19 N/A<br />
pred = retain<br />
arg1 = that<br />
arg1-<strong>of</strong> = THR13 and SER19<br />
arg2 = a negative effect<br />
arg2-on = the function<br />
arg2-<strong>of</strong> = p25(rum1)<br />
GLU55 DHMA MYCAV, ASP123 DHMA MYCAV, TRP124 DHMA MYCAV<br />
Sentence ”Many residues essential for the dehalogenation reaction are conserved<br />
in DhmA;the putative catalytic triad consists <strong>of</strong><br />
Asp123,His279,and Asp250,and the putative oxyanion<br />
hole consists <strong>of</strong> Glu55 and Trp124.” (PMID:12147465)<br />
UniProtKB/FT<br />
PAS<br />
PAS<br />
RID+UID<br />
Sentence<br />
UniProtKB/FT<br />
GLU55 N/A<br />
ASP123 ACT SITE: Nucleophile (by similarity).<br />
TRP124 N/A<br />
pred = consists<br />
arg1 = the putative catalytic triad<br />
arg2 =<br />
arg2-<strong>of</strong> = ASP123<br />
pred = consists<br />
arg1 = and the putative oxyanion hole<br />
arg2 =<br />
arg2-<strong>of</strong> = GLU55 and TRP124<br />
CYS48 THIO RAT, CYS152 THIO RAT, CYS73 THIO RAT<br />
”Thus,PrxV mutants lacking Cys(48) or Cys(152) showed<br />
no detectable thioredoxin-dependent peroxidase activity,whereas mutation <strong>of</strong><br />
Cys(73) had no effect on activity.” (PMID:10751410)<br />
N/A<br />
175
. . . continuation <strong>of</strong> table B.1<br />
PAS<br />
PAS<br />
RID+UID<br />
pred = showed<br />
arg1 = CYS48 or CYS152<br />
arg2 = no detectable thioredoxin-dependent peroxidase activity<br />
pred = had<br />
arg1 = whereas mutation<br />
arg1-<strong>of</strong> = CYS73<br />
arg2 = no effect on activity<br />
GLY43 PPCS HUMAN<br />
Sentence ”Highly conserved ATP binding residues include<br />
Gly43,Ser61,Gly63,Gly66,Phe230,and<br />
Asn258.” (PMID:12906824)<br />
UniProtKB/FT<br />
PAS<br />
RID+UID<br />
N/A<br />
pred = include<br />
arg1 = conserved ATP binding residues<br />
arg2 = GLY43<br />
ASN59 PPCS HUMAN<br />
Sentence ”Highly conserved phosphopantothenate binding residues include<br />
Asn59,Ala179,Ala180,and Asp183 from one<br />
monomer and Arg55’ from the adjacent monomer.” (PMID:12906824)<br />
UniProtKB/FT<br />
PAS<br />
RID+UID<br />
N/A<br />
pred = include<br />
arg1 = conserved phosphopantothenate binding residues<br />
arg2 = ASN59<br />
GLU50 SHD HUMAN, GLU51 SHD HUMAN<br />
Sentence ”Rab3A binding-defective mutants <strong>of</strong> rabphilin<br />
(E50A) and Noc2( E51A) were still localized in the distal<br />
portion <strong>of</strong> the neurites (where dense-core vesicles had accumulated) in nerve growth factordifferentiated<br />
PC12 cells,the same as the wild-type proteins,whereas Rab27A<br />
binding-defective mutants <strong>of</strong> rabphilin ( E50A/I54A) and<br />
Noc2( E51A/I55A) were present throughout the cytosol.”<br />
(PMID:14722103)<br />
UniProtKB/FT<br />
PAS<br />
RID+UID<br />
Sentence<br />
UniProtKB/FT<br />
PAS<br />
N/A<br />
pred = localized<br />
arg1 = Rab3A binding-defective mutants<br />
arg1-<strong>of</strong> = rabphilin ( GLU50 ALA ) and Noc2 ( GLU51 ALA )<br />
arg2 =<br />
arg2-in = the distal portion<br />
arg2-<strong>of</strong> = the neurites ( where dense-core vesicles<br />
TRP124 DHMA MYCAV<br />
”Trp124 should be involved in substrate binding and product<br />
(halide) stabilization,while the second halide-stabilizing residue cannot be identified<br />
from a comparison <strong>of</strong> the DhmA sequence with the sequences <strong>of</strong> three<br />
dehalogenases with known tertiary structures.” (PMID:12147465)<br />
N/A<br />
pred = involved<br />
arg1 = TRP124<br />
arg2 =<br />
arg2-in = substrate binding and product (halide) stabilization<br />
176
Appendix C<br />
Examples <strong>of</strong> extracted <strong>functional</strong><br />
<strong>annotation</strong>s for the protein p53<br />
177
Table C.1: Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in<br />
p53. The listed data are grouped by topics.<br />
.<br />
regulatory PTM<br />
RID+UID<br />
PMID 10930428<br />
PAS<br />
RID+UID<br />
SER6 P53 HUMAN<br />
pred = creased<br />
arg1 = a background<br />
arg1-<strong>of</strong> = constitutive phosphorylation<br />
arg1-at = SER6 that<br />
arg2 = 10-fold<br />
arg2-upon = upon exposure<br />
arg2-to = either ionizing radiation or UV light<br />
pred = exhibited<br />
arg1 = Untreated A549 cells<br />
arg2 = a background<br />
arg2-<strong>of</strong> = constitutive phosphorylation<br />
arg2-at = SER6 that<br />
pred = is<br />
arg1 = The relative phosphorylation<br />
arg1-<strong>of</strong> = THR18<br />
arg1-by = VRK2B<br />
arg2 = similar<br />
arg2-in = magnitude<br />
arg2-to = that induced<br />
arg2-by = taxol<br />
PMID 12487430<br />
PAS<br />
RID+UID<br />
THR18 P53 HUMAN<br />
pred = compared<br />
arg1 = that phosphorylation<br />
arg1-at = THR18 decreased binding<br />
arg1-to = recombinant Mdm2 protein<br />
arg2 =<br />
arg2-with = the unphosphorylated and the two other single phosphorylated analogues<br />
PMID 11030628<br />
PAS<br />
RID+UID<br />
SER46 P53 HUMAN<br />
pred = regulates<br />
arg1 = and phosphorylation<br />
arg1-<strong>of</strong> = SER46<br />
arg2 = the transcriptional activation<br />
arg2-<strong>of</strong> = this apoptosis-inducing gene<br />
PMID 11875057<br />
PAS<br />
RID+UID<br />
SER46 P53 HUMAN<br />
pred = hibited<br />
arg1 = IR-induced phosphorylation<br />
arg1-at = SER46<br />
arg2 =<br />
arg2-by = wortmannin<br />
PMID 14757188<br />
PAS<br />
SER15 P53 HUMAN<br />
pred = duce<br />
arg1 =<br />
arg1-in = synergy<br />
arg2 = ATM-mediated phosphorylation<br />
arg2-<strong>of</strong> = the SER15 site<br />
178
. . . continuation <strong>of</strong> table C.1<br />
RID+UID<br />
arg2-<strong>of</strong> =<br />
PMID 17292432<br />
PAS<br />
RID+UID<br />
SER15 P53 HUMAN<br />
pred = suppressed<br />
arg2 = both NaVO(3)-induced SER15 phosphorylation and accumulation<br />
arg2-<strong>of</strong> =<br />
PMID 11850826<br />
PAS<br />
RID+UID<br />
SER15 P53 HUMAN<br />
pred = observed<br />
arg1 = Increased phosphorylation<br />
arg1-<strong>of</strong> = SER15<br />
arg2 =<br />
arg2-in = heat shocked GM638<br />
PMID 10933801<br />
PAS<br />
RID+UID<br />
THR55 P53 HUMAN<br />
pred = define<br />
arg1 = These data<br />
arg2 = THR55<br />
arg2-as = a novel phosphorylation site and<br />
arg2-for = the first time show threonine phosphorylation<br />
arg2-<strong>of</strong> = human<br />
PMID 15116093<br />
PAS<br />
RID+UID<br />
PMID 9246643<br />
PAS<br />
THR55 P53 HUMAN<br />
pred = clarify<br />
arg1 = This study<br />
arg2 = the biological significance<br />
arg2-<strong>of</strong> = doxorubicin-induced THR55 phosphorylation<br />
pred = reduced<br />
arg1 = phosphorylation<br />
arg1-at = SER15<br />
arg2 = and phosphorylation<br />
arg2-at = SER392<br />
SER315 P53 HUMAN<br />
pred = reversed<br />
arg1 = but SER315<br />
arg2 = the effect<br />
arg2-<strong>of</strong> = phosphorylation<br />
arg2-at = SER392<br />
RID+UID<br />
PMID 7926727<br />
PAS<br />
RID+UID<br />
PHE19 P53 HUMAN<br />
pred = are<br />
arg1 = PHE19<br />
arg2 = crucial<br />
arg2-for = the interactions<br />
arg2-between =<br />
SER20 P53 HUMAN<br />
binding activity<br />
179
. . . continuation <strong>of</strong> table C.1<br />
PMID 11323395<br />
PAS<br />
RID+UID<br />
pred = play<br />
arg1 =<br />
arg1-<strong>of</strong> = SER20<br />
arg2 = a key role<br />
arg2-in = the dissociation<br />
arg2-<strong>of</strong> = mdm2<br />
arg2-in = response<br />
arg2-to = Cr(VI)<br />
PMID 17914575<br />
PAS<br />
RID+UID<br />
CYS135 P53 HUMAN<br />
pred = generates<br />
arg1 = that the amino acid change CYS135˜ARG<br />
arg1-in = the human TP53<br />
arg2 = the loss<br />
arg2-<strong>of</strong> = TP53 DNA-binding activity<br />
PMID 16784539<br />
PAS<br />
SER315 P53 HUMAN<br />
pred = dephosphorylates<br />
arg1 = both<br />
arg1-in = vitro and<br />
arg1-in = vivo and<br />
arg2 = the SER315 site<br />
arg2-<strong>of</strong> =<br />
RID+UID<br />
PMID 10432310<br />
PAS<br />
RID+UID<br />
SER20 P53 HUMAN<br />
protein-protein-interaction<br />
pred = containing<br />
arg2 = phosphate<br />
arg2-at = SER20 inhibited DO-1 binding<br />
PMID 11960368<br />
PAS<br />
RID+UID<br />
SER166 P53 HUMAN<br />
pred = mutated<br />
arg1 = analysis<br />
arg1-<strong>of</strong> = HDM2 proteins<br />
arg2 =<br />
arg2-at = the consensus Akt recognition <strong>sites</strong><br />
arg2-at = SER166<br />
PMID 11172034<br />
PAS<br />
RID+UID<br />
PMID 7624134<br />
PAS<br />
ARG175 P53 HUMAN<br />
pred = abolish<br />
arg1 = mutations ARG175˜HIS or ARG248˜TRP<br />
arg2 = the association<br />
arg2-<strong>of</strong> =<br />
SER315 P53 HUMAN<br />
pred = abolished<br />
arg1 =<br />
arg1-to = alanine ( p53- SER315˜ALA )<br />
180
. . . continuation <strong>of</strong> table C.1<br />
arg2 = phosphorylation<br />
arg2-by = cdk2 kinase<br />
RID+UID<br />
PMID 7624134<br />
PAS<br />
RID+UID<br />
SER315 P53 HUMAN<br />
pred = required<br />
arg1 = SER315<br />
arg1-<strong>of</strong> = wtp53<br />
arg2 =<br />
arg2-for = transcriptional activity<br />
arg2-in = vivo<br />
PMID 16818505<br />
PAS<br />
RID+UID<br />
CYS238 P53 HUMAN<br />
pred = retains<br />
arg1 = ( CYS238˜TYR ) mutant<br />
arg2 = <strong>functional</strong> wild-type<br />
PMID 16707427<br />
PAS<br />
ARG175 P53 HUMAN<br />
biological activity<br />
pred = displayed<br />
arg1 = the ARG175˜LEU mutant<br />
arg2 = an attenuated tumor suppressor activity<br />
arg2-in = the regulation<br />
arg2-<strong>of</strong> = transcription<br />
RID+UID<br />
PMID 10616523<br />
PAS<br />
RID+UID<br />
ARG72 P53 HUMAN<br />
disease<br />
pred = suggests<br />
arg1 = The acquisition<br />
arg1-<strong>of</strong> = both mutations ( GLY245˜VAL and ARG72˜PRO )<br />
arg1-in = the transformation<br />
arg1-from = transient leukemia<br />
arg1-to = overt acute megakaryoblastic leukemia<br />
arg2 = a <strong>functional</strong> role<br />
arg2-<strong>of</strong> = mutant<br />
PMID 18181044<br />
PAS<br />
ARG72 P53 HUMAN<br />
pred = sociated<br />
arg1 = the development<br />
arg1-<strong>of</strong> = lung carcinoma and that ARG72˜PRO genotype<br />
arg2 =<br />
arg2-with = a poorer prognosis<br />
arg2-<strong>of</strong> = lung cancer<br />
181
. . . continuation <strong>of</strong> table C.1<br />
RID+UID<br />
PMID 7761089<br />
PAS<br />
RID+UID<br />
VAL138 P53 HUMAN<br />
molecular stability<br />
pred = showed<br />
arg1 = The human VAL138 mutant<br />
arg2 = temperature-sensitive transformation<br />
arg2-<strong>of</strong> = rat embryo fibroblasts ( REFs )<br />
arg2-in = collaboration assay<br />
arg2-with = activated<br />
PMID 15703170<br />
PAS<br />
ARG249 P53 HUMAN<br />
pred = duce<br />
arg1 = oncogenic mutations HIS168˜ARG and z:resi ty<br />
ARG249˜SER<br />
arg2 = substantial structural perturbation<br />
arg2-around = the mutation site<br />
arg2-in = the L2 and L3 loops<br />
182
Appendix D<br />
Examples <strong>of</strong> extracted <strong>functional</strong><br />
<strong>annotation</strong>s for the protein Jak2<br />
183
Table D.1: Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in<br />
Jak2. The listed data are grouped by topics.<br />
.<br />
disease<br />
PMID 16896569<br />
RID+UID<br />
VAL617 JAK2 HUMAN<br />
pred = improved<br />
arg1 = The improved knowledge<br />
arg1-<strong>of</strong> = the molecular basis<br />
arg1-<strong>of</strong> = the disease because<br />
arg1-<strong>of</strong> = the discovery<br />
arg1-<strong>of</strong> = the VAL617˜PHE mutation<br />
arg1-in = the JAK2 gene<br />
arg2 = the molecular diagnosis and<br />
PMID 16503548<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = is<br />
arg1 = that the JAK2 VAL617˜PHE mutation<br />
arg2 = rare<br />
arg2-in = patients<br />
arg2-with = idiopathic erythrocytosis<br />
PMID 16247455<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = reported<br />
arg1 = A missense somatic mutation<br />
arg1-in = JAK2 gene ( JAK2 VAL617˜PHE )<br />
arg2 =<br />
arg2-in = chronic myeloproliferative disorders<br />
PMID 18024388<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = is<br />
arg1 = The JAK2 VAL617˜PHE point mutation<br />
arg2 = rare<br />
arg2-in = hypereosinophilic syndrome and/or chronic eosinophilic leukemia<br />
PMID 15858187<br />
genetic<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = had<br />
arg1 = All 51 patients<br />
arg1-with = 9pLOH<br />
arg2 = the VAL617˜PHE mutation<br />
pred = is<br />
arg1 = VAL617˜PHE<br />
arg2 = a somatic mutation present<br />
arg2-in = hematopoietic cells<br />
molecular function<br />
184
. . . continuation <strong>of</strong> table D.1<br />
PMID 15970705<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = sociated<br />
arg1 = JAK2 ( VAL617˜PHE )<br />
arg2 =<br />
arg2-with = constitutive phosphorylation<br />
arg2-<strong>of</strong> = JAK2 and its downstream effectors<br />
arg2-as =<br />
PMID 16239216<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = duces<br />
arg1 = that the homologous VAL617˜PHE mutation<br />
arg2 = activation<br />
arg2-<strong>of</strong> = JAK1 and Tyk2<br />
PMID 16384930<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = link<br />
arg1 = the presence<br />
arg1-in = PV erythroblasts<br />
arg1-<strong>of</strong> = proliferative and antiapoptotic signals that<br />
arg2 = the JAK2 VAL617˜PHE mutation<br />
arg2-with = the inhibition<br />
arg2-<strong>of</strong> = death receptor signaling<br />
PMID 16442619<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = does<br />
arg1 = crease<br />
arg1-<strong>of</strong> = expression and kinase activity<br />
arg1-<strong>of</strong> = JAK2<br />
arg1-in = CML cells<br />
arg2 = result<br />
arg2-from = the JAK2 VAL617˜PHE activation mutation and that transformation<br />
arg2-into = to blast crisis<br />
PMID 16461300<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = sociated<br />
arg1 = the presence<br />
arg1-<strong>of</strong> = the JAK2 VAL617˜PHE mutation<br />
arg2 =<br />
arg2-with = higher platelet activation<br />
PMID 16904848<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = transmit<br />
arg1 = that JAK2 VAL617˜PHE<br />
arg2 = signals<br />
arg2-from = ligand-activated TpoR or EpoR<br />
PMID 15863514<br />
RID+UID<br />
PAS<br />
VAL617 JAK2 HUMAN<br />
pred = changes<br />
arg2 = conserved VAL617˜PHE<br />
arg2-in = the pseudokinase domain<br />
arg2-<strong>of</strong> = JAK2 that<br />
185
Appendix E<br />
Examples <strong>of</strong> extracted <strong>functional</strong><br />
<strong>annotation</strong>s <strong>of</strong> the category binding<br />
event<br />
186
Table E.1: [Mined <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues with information<br />
on binding events. The mined information correspond to 17 protein residues<br />
listed in MSDsite. The extracted information can be used for <strong>functional</strong> <strong>annotation</strong><br />
and validation <strong>of</strong> <strong>predicted</strong> binding site in the database.<br />
.<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
T199 CAH2 HUMAN<br />
”The three-dimensional structures <strong>of</strong> azide-bound and sulfate-bound T199V CAIIs were determined<br />
by x-ray crystallographic methods at 2.25 and 2.4 A, respectively (final crystallographic<br />
R factors are 0.173 and 0.174, respectively).” (PMID:8262987)<br />
pred = determined<br />
arg1 = The three-dimensional structures<br />
arg1-<strong>of</strong> = [azide-bound and sulfate-bound THR199 VAL CAIIs]/BINDING<br />
arg2 =<br />
arg2-by = x-prot:ray crystallographic methods<br />
arg2-at = at 2.25 and 2.4 A ,respectively ( final crystallographic<br />
R55 PPIA HUMAN<br />
”On the basis <strong>of</strong> the structure, it is proposed that Arg55 hydrogen-bonds to the nitrogen<br />
to deconjugate the resonance <strong>of</strong> the prolyl amide bond and thus facilitates the cis-trans<br />
rotation.” (PMID:8652511)<br />
pred = proposed<br />
arg2 = [that ARG55 hydrogen-bonds]/BINDING<br />
arg2-to = the nitrogen<br />
pred = deconjugate<br />
arg1 = [that ARG55 hydrogen-bonds]/BINDING<br />
arg1-to = the nitrogen<br />
arg2 = the resonance<br />
arg2-<strong>of</strong> = the prolyl amide bond and<br />
L255 PH4H HUMAN<br />
”Only for the R252Q and L255V mutants were catalytically <strong>active</strong> tetramer and dimer recovered<br />
and for R252G some dimer, i.e. 20% (R252Q, tetramer), 44% (L255V, tetramer)<br />
and 4.4% (R252G, dimer) <strong>of</strong> the activity for the respective wild-type (wt) forms.”<br />
(PMID:9799096)<br />
pred = recovered<br />
arg1 = <strong>active</strong> tetramer and dimer<br />
arg2 = and<br />
arg2-for = [ARG252 GLY some dimer]/BINDING<br />
Y156 HGXR TRIFO<br />
”But the forces involved in recognizing the exocyclic C2-substituents <strong>of</strong> the purine ring, which<br />
involve the Tyr156 hydroxyl, Ile157 backbone carbonyl, and Asp163 side-chain carboxyl, may<br />
be weakened by the shifted conformation <strong>of</strong> the peptide backbone resulted from loss <strong>of</strong> the<br />
Glu11-Arg155 salt bridge.” (PMID:9843428)<br />
pred = resulted<br />
arg1 =<br />
arg1-by = the shifted conformation<br />
arg1-<strong>of</strong> = the peptide backbone<br />
arg2 =<br />
arg2-from = loss<br />
arg2-<strong>of</strong> = [the GLU11 ARG155 salt bridge]/BINDING<br />
K79 HGXR TOXGO<br />
”The Leu78-Lys79 peptide bond in the <strong>active</strong> site adopts the cis configuration, which it must<br />
to bind PRPP or pyrophosphate.” (PMID:10545171)<br />
pred = adopts<br />
arg1 = [The LEU78 LYS79 peptide bond]/BINDING<br />
arg1-in = the <strong>active</strong> site<br />
arg2 = the<br />
G57 FLAV CLOBE<br />
187
. . . continuation <strong>of</strong> table E.1<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
”In the Clostridium beijerinckii flavodoxin, the reduction <strong>of</strong> the flavin mononucleotide (FMN)<br />
c<strong>of</strong>actor is accompanied by a local conformation change in which the Gly57-Asp58 peptide<br />
bond ”flips” from primarily the unusual cis O-down conformation in the oxidized state to<br />
the trans O-up conformation such that a new hydrogen bond can be formed between the<br />
carbonyl group <strong>of</strong> Gly57 and the proton on N(5) <strong>of</strong> the neutral FMN semiquinone radical<br />
[Ludwig, M. L., Pattridge, K. A., Metzger, A. L., Dixon, M. M., Eren, M., Feng, Y., and<br />
Swenson, R. P. (1997) Biochemistry 36, 1259-1280].” (PMID:10353827)<br />
pred = accompanied<br />
arg1 = ) c<strong>of</strong>actor<br />
arg2 =<br />
arg2-by = a local conformation change<br />
arg2-in = [which the GLY57 ASP58 peptide bond]/BINDING<br />
D160 APX STRGR; M161 APX STRGR; G201 APX STRGR; R202 APX STRGR; F219<br />
APX STRGR<br />
”These studies allowed the tracing <strong>of</strong> the previously disordered region <strong>of</strong> the enzyme (Glu196-<br />
Arg202) and the identification <strong>of</strong> some <strong>of</strong> the <strong>functional</strong> groups <strong>of</strong> the enzyme that are<br />
involved in enzyme-substrate interactions (Asp160, Met161, Gly201, Arg202 and Phe219).”<br />
(PMID:10771423)<br />
pred = involved<br />
arg1 = disordered region<br />
arg1-<strong>of</strong> = the enzyme ( GLU196 ARG202 ) and the identification<br />
arg1-<strong>of</strong> = some<br />
arg1-<strong>of</strong> = the <strong>functional</strong> groups<br />
arg1-<strong>of</strong> = the enzyme that<br />
arg2 =<br />
arg2-in = [enzyme-substrate interactions ( ASP160, MET161, GLY201, ARG202,<br />
PHE219)]/BINDING<br />
I209 FIXL RHIME<br />
”Interaction between the iron-bound O(2) and Ile209 was also observed in the resonance<br />
Raman spectra <strong>of</strong> RmFixLH as evidenced by the fact that the Fe-O(2) and Fe-CN stretching<br />
frequencies were shifted from 575 to 570 cm(-1) (Fe-O(2)), and 504 to 499 cm(-1), respectively,<br />
as the result <strong>of</strong> the replacement <strong>of</strong> Ile209 with an Ala residue.” (PMID:10926518)<br />
pred = observed<br />
arg1 = Interaction<br />
arg1-between = [the iron-bound O(2) and ILE209]/BINDING<br />
arg2 =<br />
arg2-in = the resonance Raman spectra<br />
arg2-<strong>of</strong> = RmFixLH as<br />
188
Appendix F<br />
Examples <strong>of</strong> extracted <strong>functional</strong><br />
<strong>annotation</strong>s <strong>of</strong> <strong>active</strong> site residues<br />
189
Table F.1: Identified catalytic triad residues from MEDLINE exraction. The<br />
listed sentences describe the mentioned protein residues as catalytic (comention<br />
with the term ”catalytic triad”), however, none <strong>of</strong> them are recorded<br />
in CSA, thus the identified information are novel data.<br />
.<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
RID+UID<br />
Sentence<br />
PAS<br />
D44 TPP2 HUMAN, H264 TPP2 HUMAN, S449 EPHA3 HUMAN<br />
”The amino acids forming the putative catalytic triad (Asp-44, His-264, Ser-449) as well as<br />
the conserved Asn-362, potentially stabilizing the transition state, were replaced by alanine<br />
and the mutated cDNAs were transfected into human embryonic kidney (HEK) 293 cells.”<br />
(PMID:12445476)<br />
pred = forming<br />
arg1 = The amino acids<br />
arg2 = [the putative catalytic triad ( ASP44, HIS264, SER449)]/ENZ ACT<br />
C25 CYSP1 CARCN, H159 CYSP1 CARCN, D175 CYSP1 CARCN<br />
”The seven cysteine residues are aligned with those <strong>of</strong> papain and the catalytic triad<br />
(Cys25, His159, Asn175) <strong>of</strong> all cysteine peptidases <strong>of</strong> the papain family is conserved.”<br />
(PMID:10355634)<br />
pred = aligned<br />
arg1 = The seven CYS+<br />
arg2 =<br />
arg2-with = with those<br />
arg2-<strong>of</strong> = <strong>of</strong> papain and the catalytic triad ( CYS25<br />
C176 NADE MYCTU, E52 NADE MYCTU, K121 NADE MYCTU<br />
”The residues forming the putative catalytic triad (Cys176, Glu52 and Lys121) were replaced<br />
by alanine; the mutated enzymes were expressed in the Escherichia coli Origami (DE3) strain<br />
and purified.” (PMID:15748981)<br />
pred = forming<br />
arg1 = The residues<br />
arg2 = [the putative catalytic triad ( CYS176, GLU52, and LYS121)]/ENZ ACT<br />
S1752 POLG BVDVS<br />
”Our study provides experimental evidence that histidine at position 1658 and aspartic acid<br />
at position 1686 constitute together with the previously identified serine at position 1752<br />
(S1752) the catalytic triad <strong>of</strong> the pestiviral NS3 serine protease.” (PMID:10915606)<br />
pred = identified<br />
arg1 =<br />
arg1-with = the<br />
arg2 = [SER1752 ( S1752 ) the catalytic triad]/ENZ ACT<br />
arg2-<strong>of</strong> = the pestiviral NS3 serine protease.<br />
D167 POLS SFV, H145 POLS SFV, S219 POLS SFV<br />
”After this autoproteolytic cleavage, the free carboxylic group <strong>of</strong> Trp267 interacts with the<br />
catalytic triad (His145, Asp167 and Ser219) and inactivates the enzyme.” (PMID:18177892)<br />
pred = interacts<br />
arg1 = the free carboxylic group<br />
arg1-<strong>of</strong> = TRP267<br />
arg2 =<br />
arg2-with = [the catalytic triad ( HIS145, ASP167, and SER219)]/ENZ ACT<br />
D122 ARY2 RAT<br />
”Substitution <strong>of</strong> the catalytic triad Asp-122 with either alanine or asparagine resulted in the<br />
complete loss <strong>of</strong> protein structural integrity and catalytic activity.” (PMID:15209520)<br />
pred = resulted<br />
arg1 = Substitution<br />
arg1-<strong>of</strong> = the catalytic triad ASP122<br />
arg1-with = either alanine or asparagine<br />
arg2 =<br />
arg2-in = the complete loss<br />
arg2-<strong>of</strong> = [protein structural integrity and catalytic activity]/ENZ ACT<br />
190
. . . continuation <strong>of</strong> table F.1<br />
RID+UID<br />
Sentence<br />
PAS<br />
D156 LYPA1 HUMAN<br />
”To investigate whether this bridging function occurs in vivo, two transgenic mouse lines<br />
were established expressing a muscle creatine kinase promoter-driven human LPL (hLPL)<br />
minigene mutated in the catalytic triad (Asp156 to Asn).” (PMID:9811888)<br />
pred = mutated<br />
arg1 = ( hLPL ) minigene<br />
arg2 =<br />
arg2-in = [the catalytic triad (ASP156 ASN)]/ENZ ACT<br />
191
Appendix G<br />
Glossary<br />
3D pattern – a recurrent residue triplet configuration (with k=2 or k=3 interaction <strong>of</strong> residues) within a dataset <strong>of</strong> protein<br />
structures.<br />
arg – the argument <strong>of</strong> a PAS<br />
BIND – the set <strong>of</strong> binding-related <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> extracted protein residues, i.e. <strong>annotation</strong>s are labelled as<br />
BINDING.<br />
BINDING – a category in MAN, describing binding events <strong>of</strong> a protein residue.<br />
CSA – a database <strong>of</strong> manually curated <strong>active</strong> <strong>sites</strong> with structure templates derived from PDB.<br />
Contextual feature .<br />
EC – Enzyme classification identifier.<br />
ER – entity recognition.<br />
ENZ – the set <strong>of</strong> enzyme-related <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> extracted protein residues, i.e.<br />
ENZ ACT.<br />
<strong>annotation</strong>s are labelled as<br />
ENZ ACT – a category in MAN, describing enzyme-related information.<br />
FA – a <strong>functional</strong> <strong>annotation</strong>; or the set <strong>of</strong> extracted protein residues with <strong>functional</strong> <strong>annotation</strong>s.<br />
FEAT – a categorisation scheme based on UniProtKB.<br />
FN – a false negative.<br />
FP – a false positive.<br />
FT – a record in Uniprot data file with <strong>functional</strong> <strong>annotation</strong>.<br />
Functional <strong>annotation</strong> – Information on biological function assigned to a protein residue.<br />
GC – a manually annotated test set with abstract texts drawn from a random selection <strong>of</strong> UniProtKB citations.<br />
GO – Gene Ontology.<br />
MAN – a categorisation scheme based on manual analysis on MEDLINE.<br />
MEDLINE – a database <strong>of</strong> citations and abstract texts from biomedical publications.<br />
NP – a noun phrase is defined as a nominal sequence.<br />
OLDFIELD – a non-redundant structure dataset <strong>of</strong> protein domains selected from PDB by sequence alignments.<br />
OPR – a semantic relation between a residue, its source protein, and hosting organism; or the set <strong>of</strong> mined protein residues.<br />
192
PAS – a data structure to accommodate the semantic relation between a predicate its arguments.<br />
PDBID – PDB identifier.<br />
PDB – the primary database <strong>of</strong> protein structure with spatial coordinates.<br />
PMID – a PubMed identifier.<br />
POS – a class <strong>of</strong> words, e.g. noun, verb, adjective, used for linguistic analysis.<br />
PP – a prepositional phrase is defined as preposition + noun phrase.<br />
pred – the predicate <strong>of</strong> a PAS.<br />
Protein residue – a residue with known association to its source protein within a hosting organism (OPR).<br />
RE – Relation extraction.<br />
RID – a Residue identifier: residue name + residue protein sequence.<br />
SCOP40 – a non-redundant protein structure dataset derived from SCOP.<br />
SCOP – a derived protein structure database with manual classification <strong>of</strong> proteins based on structure similarities.<br />
SITE – a record in the PDB data file denoting residues <strong>of</strong> a <strong>functional</strong> site.<br />
Structure pattern – cf. 3D pattern.<br />
TN – a true negative.<br />
TP – a true positive.<br />
TID – a Taxonomy identifier based on the NCBI Taxonomy guideline.<br />
UID – a Protein identifier based on the UniProtKB guideline.<br />
UniProtKB – a protein sequence database with manual <strong>annotation</strong>s on protein residues.<br />
VG – a verb group is sequence <strong>of</strong> verbs, auxiliaries, or verb modifiers.<br />
VP – a verb phrase, consisting <strong>of</strong> a verb group + noun phrase.<br />
XC – a cross-validation corpus based on references from UniProtKB.<br />
chainID – a protein chain identifier in a PDB entry.<br />
k=2, k=3 – a residue triplet configuration with two-way or three-way interaction.<br />
resName – a residue name.<br />
resSeq – a protein residue sequence identifier from a PDB entry.<br />
seqIndex – a protein residue sequence identifier from a UniProtKB entry.<br />
193