24.10.2014 Views

Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation of predicted active sites - European ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Automatic</strong> <strong>functional</strong> <strong>annotation</strong><br />

<strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong>:<br />

combining PDB and literature mining<br />

Kevin Nagel<br />

Wolfson College<br />

A dissertation submitted to the University <strong>of</strong> Cambridge<br />

for the degree <strong>of</strong> Doctor <strong>of</strong> Philosophy<br />

<strong>European</strong> Molecular Biology Laboratory,<br />

<strong>European</strong> Bioinformatics Institute,<br />

Wellcome Trust Genome Campus, Hinxton,<br />

Cambridge CB10 1SD, United Kingdom.<br />

Email: kevin5jan@googlemail.com<br />

January 2009


Declaration<br />

This dissertation is the result <strong>of</strong> my own work, and includes nothing which is the outcome<br />

<strong>of</strong> work done in collaboration, except where specifically indicated in the text. The dissertation<br />

does not exceed the specified length limit <strong>of</strong> 300 pages as defined by the Biology<br />

Degree Committee. This thesis has been typeset in 12pt font using L A TEX 2εaccording<br />

to the specifications defined by the Board <strong>of</strong> Graduate Studies and the Biology Degree<br />

Committee.<br />

1


Summary<br />

Kevin Nagel<br />

<strong>European</strong> Bioinformatics Institute<br />

University <strong>of</strong> Cambridge<br />

Dissertation title: <strong>Automatic</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong>:<br />

combining PDB and literature mining.<br />

Proteins are essential to cell functions, which is mainly identified in biological experiments.<br />

The structural models for proteins help to explain their function, but are not direct<br />

evidence for their function. Nonetheless, we can mine structural databases, such as Protein<br />

Data Bank (PDB), to filter out shared structural components that are meaningful with<br />

regards to the protein function.<br />

This thesis applied mining techniques to PDB to identify evolutionary conserved structural<br />

patterns, e.g. <strong>active</strong> <strong>sites</strong>. This analysis retrieved 3- and 4-bodies with assumed twoand<br />

three-way residue interaction that have been selected from a distribution analysis <strong>of</strong><br />

residue triplets. A subset <strong>of</strong> the mined patterns is assumed to represent an <strong>active</strong> site,<br />

which should be confirmed by <strong>annotation</strong>s gathered by automatic literature analysis.<br />

Literature analysis for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> proteins relies on the extraction<br />

<strong>of</strong> GO terms from the context <strong>of</strong> a protein mention. The <strong>annotation</strong> <strong>of</strong> protein residues<br />

2


equires the identification <strong>of</strong> chemical functions, which could be found in the context<br />

<strong>of</strong> residue mentions. MEDLINE abstracts have been processed to identify protein mentions<br />

in combination with species and residues (F1-measure 0.52; the F1-measure is a<br />

statistical measure <strong>of</strong> a test’s accuracy based on the precision and recall <strong>of</strong> a test). The<br />

identified protein-species-residue triplets have been validated and benchmarked against<br />

reference data resources. Then, contextual features were extracted through shallow and<br />

deep parsing and the features have been classified into predefined categories (F1-measure<br />

ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with <strong>annotation</strong><br />

types in UniProtKB to assess the relevance <strong>of</strong> the <strong>annotation</strong>s for ongoing curation<br />

projects.<br />

Altogether, the <strong>annotation</strong>s have been assessed automatically and manually<br />

against reference data resources.<br />

All MEDLINE has been processed to filter out <strong>annotation</strong>s for residues. A subset <strong>of</strong><br />

identified catalytic <strong>sites</strong> could be cross-validated against the Catalytic Site Atlas (CSA;<br />

44 out <strong>of</strong> 221). 429 out <strong>of</strong> 512 protein residues from MSDsite was then annotated with<br />

contextual data. Altogether, MEDLINE does not provide sufficient data to fully annotate<br />

the content from PDB. Conversely, residue <strong>annotation</strong> is achieved with a different feature<br />

set than provided from GO, and incomplete <strong>annotation</strong>s in the reference datasets can be<br />

filled from public literature.<br />

3


Acknowledgements<br />

This thesis would not have been possible without the support, direction, and love <strong>of</strong> a multitude<br />

<strong>of</strong> people. First, I would like to thank my supervisor Dietrich Rebholz-Schuhmann<br />

for his trust, encouragements, and for all his unconditional support and guidance. Dietrich<br />

has throughout given me opportunity and a sound research methodology. Working<br />

with him I have learned the value <strong>of</strong> vision, and persistence in achieving it.<br />

I am blessed to have had Tom Oldfield for my second supervisor. Ever since I was<br />

interviewed by Tom, he has been inspiring, helpful and most <strong>of</strong> all patient. I will look back<br />

fondly on our discussions, the ”insights” in protein science he gave me, and the cheerful<br />

and motivational chats. I am deeply indebted for his belief in me.<br />

I would like to thank my thesis committee members for their valuable and constructive<br />

comments and valuable criticism; Michael Ashburner, Kim Henrick, and Rob Russell.<br />

They all seemed to find time for me despite their busy schedules.<br />

A special thank you must go to Kim Henrick; had he not encouraged me to pursue a<br />

research position I would not be a scientist now.<br />

I would also like to acknowledge Antonio Jimeno for his time, patience, and suggestions<br />

and especially for reminding me to keep my focus always. But most <strong>of</strong> all I will remember<br />

the great times we had cycling to and from work.<br />

I would like to thank the past and present members <strong>of</strong> the Rebholz Group (Text<br />

Mining). During my years <strong>of</strong> research, the group has expanded and I have had the chance<br />

to learn from them as well as to have fun with them within the group.<br />

4


I am also thankful to the <strong>European</strong> Molecular Biology Laboratoy EMBL for the scholarship<br />

and the organised EMBL International PhD programme, throughout which I have<br />

had the chance to meet many talented and cheerful PhD students from the EMBL/EBI<br />

Hinxton.<br />

A special thank you to Christina Granroth and Dagmar Harzheim, who have done the<br />

pro<strong>of</strong>reading <strong>of</strong> this thesis. Thank you Dagmar for becoming clearer what I want to say.<br />

Finally, I would like to acknowledge my wife Almut Nagel and my daughter Juli Nagel.<br />

Without Almut I would have become a working maniac with no joy in life; she helped me<br />

to maintain balance during my PhD research and also for the future. My special thanks<br />

and love will go to Juli, aged one, from whom I have learned so much.<br />

5


Contents<br />

1 Introduction 15<br />

1.1 Proteins and <strong>functional</strong> <strong>sites</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

1.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

1.6 Guide to remaining chapters . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2 Background 26<br />

2.1 Protein related data resources . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.1.1 Protein Data Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

2.1.2 Universal Protein Knowledge base . . . . . . . . . . . . . . . . . . . 31<br />

2.1.3 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

2.1.4 Biomedical literature . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

2.2 Protein structure data mining . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.2.1 Hypothesis-driven data analysis . . . . . . . . . . . . . . . . . . . . 36<br />

2.2.2 Discovery-driven data mining . . . . . . . . . . . . . . . . . . . . . 37<br />

2.3 Biomedical literature mining . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.3.1 Biological entity recognition . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.3.2 Biological relation extraction . . . . . . . . . . . . . . . . . . . . . . 39<br />

6


2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

3 Mining residue interactions as triads from PDB 42<br />

3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

3.1.1 Structural feature extraction . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.1.2 Detection <strong>of</strong> significant configurations as interactions . . . . . . . . 47<br />

3.1.3 Grouping and selecting frequent configurations . . . . . . . . . . . . 52<br />

3.2 Analysing available non-redundant protein structure sets . . . . . . . . . . 53<br />

3.3 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

3.4.1 Identification <strong>of</strong> residue interactions is dependent on data selection 55<br />

3.4.2 The interaction distance correlates with the distribution <strong>of</strong> residue<br />

triads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

3.4.3 Interaction classification is sensitive to the size <strong>of</strong> cross-validation . 59<br />

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

4 Prediction <strong>of</strong> functions for mined residue triads 63<br />

4.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

4.2.1 Identification <strong>of</strong> homologous metal binding <strong>sites</strong> . . . . . . . . . . . 66<br />

4.2.2 Validation <strong>of</strong> convergent metal binding <strong>sites</strong> . . . . . . . . . . . . . 67<br />

4.2.3 Recovering <strong>active</strong> <strong>sites</strong> and catalytic triads from the dataset . . . . 73<br />

4.2.4 Discovering the conserved serine residue in the catalytic triad (quartet)<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

7


5 Identification <strong>of</strong> protein residues in MEDLINE 79<br />

5.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

5.1.1 Protein and organism entity recognition . . . . . . . . . . . . . . . 81<br />

5.1.2 Entity recognition <strong>of</strong> protein residue . . . . . . . . . . . . . . . . . 82<br />

5.1.3 Association identification <strong>of</strong> the entity triplet organism, protein,<br />

and residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

5.2 The construction <strong>of</strong> evaluation test corpora . . . . . . . . . . . . . . . . . . 86<br />

5.3 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

5.4.1 Evaluation <strong>of</strong> organism, protein, and residue entity recognition . . . 90<br />

5.4.2 Performance study on the entity triplet association . . . . . . . . . 92<br />

5.4.3 Cross-validation <strong>of</strong> identified residues with UniProtKB . . . . . . . 93<br />

5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins . . . . . 94<br />

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

6 Information extraction from the context <strong>of</strong> a residue in text 101<br />

6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

6.1.1 Extraction <strong>of</strong> contextual features . . . . . . . . . . . . . . . . . . . 103<br />

6.1.2 Categorisation <strong>of</strong> contextual features . . . . . . . . . . . . . . . . . 110<br />

6.2 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

6.3.1 Contextual feature extraction evaluated . . . . . . . . . . . . . . . . 117<br />

6.3.2 Performance analysis <strong>of</strong> the classifiers . . . . . . . . . . . . . . . . . 118<br />

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

8


7 Extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> for protein residues from MED-<br />

LINE 124<br />

7.1 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

7.2.1 Evaluation <strong>of</strong> the developed <strong>functional</strong> <strong>annotation</strong> extraction system 126<br />

7.2.2 Studying mined <strong>functional</strong> <strong>annotation</strong>s for the proteins p53 and Jak2129<br />

7.2.3 Cross-validation <strong>of</strong> mined catalytic residues with CSA . . . . . . . . 132<br />

7.2.4 Annotation <strong>of</strong> protein residues in MSDsite . . . . . . . . . . . . . . 134<br />

7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />

8 Combining <strong>active</strong> site prediction with mined <strong>functional</strong> <strong>annotation</strong>s 137<br />

8.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

8.1.1 Combining protein structure data with literature data . . . . . . . . 138<br />

8.2 Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

8.3.1 Protein residue mapping between three data resources . . . . . . . . 140<br />

8.3.2 Rediscovery <strong>of</strong> <strong>active</strong> <strong>sites</strong> and catalytic residues . . . . . . . . . . . 142<br />

8.3.3 Search for novel catalytic residues . . . . . . . . . . . . . . . . . . . 145<br />

8.3.4 General correlation found between <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong> and<br />

extract <strong>functional</strong> <strong>annotation</strong>s. . . . . . . . . . . . . . . . . . . . . 146<br />

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

9 Conclusions and future work 150<br />

9.1 Summary <strong>of</strong> main contributions . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

9.2 Limitations and future works . . . . . . . . . . . . . . . . . . . . . . . . . 152<br />

A Examples <strong>of</strong> errors in relation extraction. 171<br />

9


B Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s compared with UniProtKB173<br />

C Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for the protein p53 177<br />

D Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for the protein Jak2 183<br />

E Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> the category binding<br />

event 186<br />

F Examples <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> <strong>active</strong> site residues 189<br />

G Glossary 192<br />

10


List <strong>of</strong> Figures<br />

1.1 The standard amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.2 Examples <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in proteins . . . . . . . . . . . . . . . . . . . . 18<br />

1.3 The protein universe and its knowledge representation . . . . . . . . . . . . 20<br />

2.1 Data banks in the protein universe . . . . . . . . . . . . . . . . . . . . . . 28<br />

2.2 Three hyperlinked protein data banks . . . . . . . . . . . . . . . . . . . . . 29<br />

2.3 Categories for protein sequence <strong>annotation</strong> UniProtKB . . . . . . . . . . . 32<br />

2.4 GO terms are not suitable for protein residue <strong>annotation</strong> . . . . . . . . . . 34<br />

3.1 Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed 3D pattern<br />

identification system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

3.2 Four classes <strong>of</strong> interactions within a 3-body . . . . . . . . . . . . . . . . . . 49<br />

3.3 Non-redundant structure set for 3D pattern mining . . . . . . . . . . . . . 53<br />

3.4 Distribution analysis <strong>of</strong> extracted residue triplets . . . . . . . . . . . . . . 57<br />

3.5 Comparison <strong>of</strong> extracted residue triplets based on their interaction type . . 58<br />

3.6 The effect <strong>of</strong> varying the cross-validation sample size on significance testing<br />

<strong>of</strong> residue interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.1 A metal binding site with the 3Cys pattern in OLDFIELD . . . . . . . . . 68<br />

4.2 A metal binding site with the Cys-2His pattern in OLDFIELD . . . . . . . 69<br />

4.3 A metal binding site with the 3Cys pattern in SCOP40 . . . . . . . . . . . 70<br />

4.4 A metal binding site with the Cys-2His pattern in SCOP40 . . . . . . . . . 71<br />

11


4.5 Re-discovery <strong>of</strong> the catalytic triad as Asp-His-Ser pattern in OLDFIELD . 75<br />

5.1 Overview <strong>of</strong> processes and evaluation methods for the developed protein<br />

residue identification system . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

5.2 Test corpora for information extraction evaluation . . . . . . . . . . . . . . 87<br />

5.3 Identified protein residues in MEDLINE . . . . . . . . . . . . . . . . . . . 95<br />

5.4 Cross-validation <strong>of</strong> citations from identified protein residues with UniProtKB/PDB 97<br />

6.1 Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed contextual<br />

feature extraction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

7.1 Performance evaluation <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system . . 127<br />

7.2 Cross-validation <strong>of</strong> text mined catalytic residues with CSA . . . . . . . . . 133<br />

7.3 Cross-validation <strong>of</strong> text mined binding residues with MSDsite . . . . . . . 134<br />

8.1 Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> combining the protein<br />

structure dataset and literature dataset . . . . . . . . . . . . . . . . . . . . 138<br />

8.2 Lookup table for PDB/UniProtKB mapping . . . . . . . . . . . . . . . . . 140<br />

8.3 Overview <strong>of</strong> the combined datasets from protein structure data and biomedical<br />

literature data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

12


List <strong>of</strong> Tables<br />

3.1 Study on the effect <strong>of</strong> varying the interaction distance threshold in structure<br />

triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.1 Summary <strong>of</strong> extracted data at each protein structure data mining step . . 65<br />

4.2 Identification <strong>of</strong> metal binding <strong>sites</strong> in OLDFIELD . . . . . . . . . . . . . 66<br />

4.3 Convergent metal binding <strong>sites</strong> identified in SCOP40 . . . . . . . . . . . . 72<br />

4.4 List <strong>of</strong> cross-validated <strong>active</strong> site residues . . . . . . . . . . . . . . . . . . . 74<br />

4.5 Extending the catalytic triad into 4-bodies . . . . . . . . . . . . . . . . . . 76<br />

5.1 Regular expression patterns for the detection <strong>of</strong> residue mentions in text . 84<br />

5.2 Performance evaluation <strong>of</strong> residue entity recognition . . . . . . . . . . . . . 90<br />

5.3 Performance evaluation <strong>of</strong> protein entity recognition . . . . . . . . . . . . . 91<br />

5.4 Performance evaluation <strong>of</strong> organism entity recognition . . . . . . . . . . . . 91<br />

5.5 Performance evaluation <strong>of</strong> residue-protein-organism entity association detection<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

5.6 Performance evaluation <strong>of</strong> protein-organism and protein-residue entity association<br />

detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

5.7 A specialised performance evaluation between GC and XC2. . . . . . . . . 94<br />

6.1 Biological categories for the classification <strong>of</strong> protein residue related information<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

6.2 Category distribution in the text feature reference set . . . . . . . . . . . . 115<br />

13


6.3 Evaluation <strong>of</strong> syntactical language parser performance . . . . . . . . . . . . 117<br />

6.4 Performance analysis <strong>of</strong> the classifiers (confusion matrix) . . . . . . . . . . 119<br />

6.5 Performance evaluation <strong>of</strong> the classifiers (precision, recall, F1 measure) . . 120<br />

8.1 Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />

8.2 Identified catalytic residues from MEDLINE extraction . . . . . . . . . . . 144<br />

8.3 Catalytic triad residues available from the mined <strong>functional</strong> <strong>annotation</strong>s . . 145<br />

8.4 Functional <strong>annotation</strong>s <strong>of</strong> protein residues in <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong>. . . 147<br />

8.5 Homology-based transfer <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for protein<br />

residues in the mined pattern data. . . . . . . . . . . . . . . . . . . . . . . 148<br />

A.1 Examples <strong>of</strong> errors in the relation extraction for the detection <strong>of</strong> contextual<br />

features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />

B.1 Comparison <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s from GC with UniProtKB. 174<br />

C.1 Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in p53. . . . . 178<br />

D.1 Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in Jak2. . . . 184<br />

E.1 Mined <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues with information on binding<br />

events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187<br />

F.1 Identified catalytic triad residues from MEDLINE exraction. . . . . . . . . 190<br />

14


Chapter 1<br />

Introduction<br />

1.1 Proteins and <strong>functional</strong> <strong>sites</strong><br />

The genomic information encodes the blueprint to build an organism. The decoding and<br />

implementation <strong>of</strong> genetic information depends on the functions <strong>of</strong> the proteins. Each protein<br />

is the result <strong>of</strong> transcribing a gene into mRNA, which is translated into a polypeptide.<br />

Hence, a protein is a gene product. The elementary units <strong>of</strong> a protein are the 20 natural<br />

standard amino acids, each with four invariant parts: a central chiral alpha carbon<br />

(Cα), an amine group (NH2), a carboxylic acid group (COOH), hydrogen (H), and a<br />

characteristic side chain (R). Apart from the invariant amine and carboxylic acid group,<br />

which gives every amino acid the property <strong>of</strong> a zwitterion, distinctive physicochemical<br />

properties are defined by the side chain group. These can be polar, acidic/basic, aromatic,<br />

bulky, conformational flexible, contain cross-linking ability, show hydrogen-bond<br />

capability, or chemical reactivity. Figure 1.1 lists all the standard amino acids and their<br />

common classification on the basis <strong>of</strong> the nature <strong>of</strong> the side chain group.<br />

During biosynthesis, ribosomes catalyses the polymerisation <strong>of</strong> amino acids through<br />

condensation and form peptide bonds between the NH2 and COOH groups <strong>of</strong> two consecutive<br />

amino acids. The backbone (main chain) <strong>of</strong> the resulting polypeptide is the repeating<br />

sequence <strong>of</strong> NH2-C-CO-[NH-C-CO] n -NH-C-CO. This is the primary structure <strong>of</strong> a protein<br />

15


Amino Acid 3-Letter 1-Letter Side-chain polarity<br />

Alanine Ala A nonpolar<br />

Arginine Arg R polar<br />

Asparagine Asn N polar<br />

Aspartic acid Asp D polar<br />

Cysteine Cys C nonpolar<br />

Glutamic acid Glu E polar<br />

Glutamine Gln Q polar<br />

Glycine Gly G nonpolar<br />

Histidine His H polar<br />

Isoleucine Ile I nonpolar<br />

Leucine Leu L nonpolar<br />

Lysine Lys K polar<br />

Methionine Met M nonpolar<br />

Phenylalanine Phe F nonpolar<br />

Proline Pro P nonpolar<br />

Serine Ser S polar<br />

Threonine Thr T polar<br />

Tryptophan Trp W nonpolar<br />

Tyrosine Tyr Y polar<br />

Valine Val V nonpolar<br />

Figure 1.1: The standard amino acids. The trivial names, 3-letter and 1-letter abbreviations are listed<br />

along with the physicochemical properties <strong>of</strong> their side chains.<br />

and it will fold spontaneously due to different interactions <strong>of</strong> its amino acid composition<br />

with environmental factors, e.g. solvent, salt, chaperones. The most prominent formation<br />

during the folding process is the hydrophobic core, which stabilises the protein structure.<br />

Amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, and methionine,<br />

are clustered in the interior <strong>of</strong> a protein, while charged or polar side chains are turned to<br />

the solvent-exposed surface and interact with surrounding water molecules. Minimising<br />

the exposition <strong>of</strong> hydrophobic side chains to water is the principal driving force <strong>of</strong> folding.<br />

The process <strong>of</strong> protein folding involves the formation <strong>of</strong> regular secondary structure<br />

elements (SSE), such as alpha helix and beta strand, which are stabilised by intramolecular<br />

hydrogen bonds and contacts between side chain atoms (van der Waals interaction). By<br />

following a helical path, the carboxyl group <strong>of</strong> residue i and the amino group <strong>of</strong> residue i+4<br />

<strong>of</strong> the main chain are arranged in alignment and stabilise the local structure by hydrogenbond<br />

formation. The side chains protrude out from the helically coiled backbone and<br />

define the surface <strong>of</strong> the helix. In contrast, beta strands are formed by hydrogen bonds<br />

between distant regions on the peptide. Depending on the direction <strong>of</strong> the peptide region,<br />

16


two adjacent strands can be characterised as parallel or antiparallel. Because the backbone<br />

adopts almost a fully extended conformation, every side chain <strong>of</strong> i + 2 residue is facing<br />

the same direction. A set <strong>of</strong> interacting strands is called a sheet. Within the process <strong>of</strong><br />

intramolecular stabilisation <strong>of</strong> the main chain, the regions between secondary structure<br />

elements adopt a loosely defined conformation such as turns and random coils or loops.<br />

The attr<strong>active</strong> and repulsive forces (e.g. ionic or van der-Waals interaction between<br />

residues) among the SSEs balance each other during the folding process and lead to a<br />

relatively stable and complex three-dimensional structure. Stabilisation <strong>of</strong> the conformation<br />

may involve covalent bonding, e.g. disulphide bridges between two cysteine residues<br />

or the formation <strong>of</strong> metal binding-motifs. The spatial arrangement <strong>of</strong> sequentially proximate<br />

or distant residues allows the generation <strong>of</strong> biochemical <strong>functional</strong> <strong>sites</strong>. To identify<br />

those and other novel biologically <strong>functional</strong> regions in the protein is one <strong>of</strong> the greatest<br />

research interests in the protein bioinformatics community, because they explain phenological<br />

data, e.g. cellular processes. Figure 1.2 lists some <strong>of</strong> the well known <strong>functional</strong><br />

<strong>sites</strong> in various proteins classified according to my own designed categorisation scheme.<br />

Finally, the formation <strong>of</strong> quaternary structure is the assembly <strong>of</strong> tertiary structures<br />

within a multi-chain protein. In this respect, each polypeptide chain is regarded as an<br />

individual <strong>functional</strong> unit (subunit or domain). Within the interfaces <strong>of</strong> the subunits,<br />

a multi-domain based <strong>functional</strong> site can be formed, which is not present or <strong>functional</strong><br />

in the individual domains. For example, the proteins cAMP-dependent protein kinase<br />

(PDBID:1rdq), hexokinase (PDBID:1bdq), or maltodextrin phosphorylase (PDBID:1l5w)<br />

contain ligand binding <strong>sites</strong> consisting <strong>of</strong> more than one protein structure domain (A.<br />

Kahraman, pers. comm.). The identification <strong>of</strong> these multi-domain <strong>functional</strong> <strong>sites</strong> is<br />

another great challenge in protein bioinformatics.<br />

First, the prediction system has to<br />

find the correct assembly <strong>of</strong> tertiary structures (a crystal structure <strong>of</strong> a protein does<br />

not necessarily reflect the biological state <strong>of</strong> assembly).<br />

Second, the structure models<br />

have to be adjusted (proteins are not rigid molecules and have flexible parts), and finally<br />

17


site<br />

1. evolutionary site<br />

1.1. conserved site<br />

2. <strong>functional</strong> site<br />

2.1. interaction site<br />

2.1.1. <strong>active</strong> site<br />

2.1.1.1. catalytic site / re<strong>active</strong> site<br />

2.1.1.1.1. catalytic residue<br />

2.1.1.1.2. donor site<br />

2.1.1.1.3. acceptor site<br />

2.1.1.2. binding site / contact site / substrate binding site / ligand binding site /<br />

binding site / recognition site<br />

2.1.1.2.1. specificity residue / specific site<br />

2.1.1.2.1.1. high affinity binding site<br />

2.1.1.2.1.2. low affinity binding site<br />

2.1.1.2.2. peptide binding site<br />

2.1.1.2.3. protein binding / receptor site<br />

2.1.1.2.3.1. nf kappab site<br />

2.1.1.2.3.2. antibody binding site<br />

2.1.1.2.3.3. antigen binding site<br />

2.1.1.2.3.4. actin binding site<br />

2.1.1.2.4. sugar binding<br />

2.1.1.2.5. lipid binding<br />

2.1.1.2.6. nucleic acid binding<br />

2.1.1.2.6.1. atp binding site<br />

2.1.1.2.7. metal binding site<br />

2.1.1.2.7.1. calcium binding site / ca(2+) binding site<br />

2.1.1.2.7.2. copper site<br />

2.1.2. passive site / target site<br />

2.1.2.1. cleavage site / lesion site / processing site / proteolytic cleavage site<br />

2.1.2.2. PTM site<br />

2.1.2.2.1. phosphorylation site<br />

2.1.2.2.1.1. tyrosine phosphorylation site<br />

2.1.2.2.2. glycosylation site<br />

2.1.2.2.3. regulatory site<br />

2.1.2.2.4. inhibitory site<br />

2.1.2.2.5. activation site<br />

2.2. structural site<br />

2.2.1 hydrophobic site<br />

2.2.1.1 hydrophobic core<br />

2.2.1.2. hydrophobic patch<br />

2.2.2. n terminal site<br />

2.2.3. c terminal site<br />

2.2.4. transmembrane site<br />

2.2.5. intracellular site / cellular site<br />

2.2.6. extracellular site<br />

2.2.7. anionic site<br />

2.2.8. cationic site<br />

2.2.9. nucleation site<br />

Figure 1.2: Examples <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in proteins. A proposition <strong>of</strong> a classification scheme (excerpt)<br />

is represented based on my own perspective <strong>of</strong> biomolecular function <strong>of</strong> specific residue configurations in<br />

protein structures.<br />

18


co-factors, e.g. metal ions, have to be considered.<br />

1.2 Motivation<br />

The understanding <strong>of</strong> the biological function <strong>of</strong> proteins remains a central challenge in<br />

biology.<br />

Our knowledge <strong>of</strong> the protein universe can be partitioned into at least three<br />

knowledge spaces (cf. figure 1.3): protein sequence space, protein structure space, and<br />

protein function space. Each space represents a specific view <strong>of</strong> proteins. For example, the<br />

protein structure space contains information about the number <strong>of</strong> biological conformations<br />

<strong>of</strong> protein structures (cf. figure 1.3, top panel). Whereas, the function space describes the<br />

spectrum <strong>of</strong> protein function. Although information from each space partially overlaps,<br />

only little data are available to explain their relationship.<br />

For example, site-directed<br />

mutational analysis is <strong>of</strong>ten reported in context <strong>of</strong> gain or loss <strong>of</strong> a protein function,<br />

while the biological correlation between sequence and function is not understood. This is<br />

because the mechanism <strong>of</strong> protein function is not explained by information within sequence<br />

space. In contrast, structural data are more expressive than sequence data, because a<br />

protein structure provides spatial context <strong>of</strong> residues. Proteins are physical entities and<br />

as such, they perform interactions with other proteins or ligands. The shape <strong>of</strong> a protein,<br />

or more precisely, the spatial configuration <strong>of</strong> a set <strong>of</strong> residues in a <strong>functional</strong> site, is<br />

one explanation for protein function. While protein structure data mining is concerned<br />

with the prediction <strong>of</strong> novel <strong>functional</strong> <strong>sites</strong> in proteins, a mined structural pattern has<br />

no evidences <strong>of</strong> biological function.<br />

In contrast, biomedical literature reports a range<br />

<strong>of</strong> biological function <strong>of</strong> protein residues without a structural context and explanation <strong>of</strong><br />

molecular mechanism (cf. figure 1.3, middle panel). The combination <strong>of</strong> information from<br />

protein structure space and protein function space seems to be an obvious approach in<br />

order to gain new knowledge on protein function.<br />

19


Figure 1.3: The protein universe and its knowledge representation. Information on a protein can be collected<br />

from at least three different knowledge domains: crystallography provides the spatial coordinate <strong>of</strong><br />

a protein, protein sequencing determines the linear composition <strong>of</strong> amino acids in a protein, and biochemical<br />

experiments characterises the biological function (top panel). In principle protein function prediction<br />

can be done based on information from each domain knowledge spaces, however the combination <strong>of</strong> them<br />

can overcome some domain specific limitations (middle panel).<br />

20


1.3 Objective<br />

This thesis aims to discover hypothetical <strong>functional</strong> <strong>sites</strong> from Protein Data Bank (PDB)<br />

and annotate them with <strong>functional</strong> information from biomedical literature.<br />

The main<br />

idea is to combine the information from currently two detached data resources, protein<br />

structure information from PDB, and <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> residues from MEDLINE<br />

(cf. figure 1.3, lower panel). More specifically, this research focuses on the prediction <strong>of</strong><br />

<strong>active</strong> <strong>sites</strong> by data mining recurrent spatial residue configurations (3D pattern) in proteins.<br />

Contextual features <strong>of</strong> residues are extracted from biomedical literature to provide<br />

<strong>functional</strong> <strong>annotation</strong>s. The results from both datasets are then combined to verify <strong>predicted</strong><br />

<strong>functional</strong> <strong>sites</strong> by evidences <strong>of</strong> biological function. While existing approaches in<br />

protein structure data mining and biomedical literature mining has been used to generate<br />

data for each research domain, the combination <strong>of</strong> the datasets is a novel approach in<br />

protein bioinformatics research.<br />

1.4 Related works<br />

To verify a <strong>predicted</strong> protein function with <strong>functional</strong> <strong>annotation</strong>s extracted from biomedical<br />

literature, two different levels have to be considered: the protein level, and the residue<br />

level (i.e. groups <strong>of</strong> residues forming a <strong>functional</strong> site).<br />

The recent publication <strong>of</strong> [JGLRS08] is one example for case (1): The prediction <strong>of</strong><br />

protein function is based on the search for a conserved and connected subgraph (CCS) in<br />

protein-protein interaction graphs, generated from several biological databases. Within<br />

the set <strong>of</strong> CCS, all available <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> a protein in a database are transferred<br />

to homologous proteins. The <strong>annotation</strong>s consist <strong>of</strong> Gene Ontology (GO) terminologies<br />

and the transfer is the prediction <strong>of</strong> protein function. The verification <strong>of</strong> a <strong>predicted</strong><br />

function was done by identifying GO terms in abstract texts <strong>of</strong> the corresponding protein.<br />

The approach <strong>of</strong> this thesis has some similarities to this report [JGLRS08], e.g. in<br />

21


oth approaches, results from data mining were verified by information extracted from<br />

biomedical literature. However, there are crucial differences between the two that need<br />

to be considered when assessing the result <strong>of</strong> this thesis. First, in contrast to the CCS<br />

identification, the data mining part in this work does not aim to identify known patterns,<br />

but wants to discover new structural features that may represent a novel <strong>functional</strong> site.<br />

Secondly, in [JGLRS08] the prediction <strong>of</strong> protein function utilises terminologies <strong>of</strong> a welldeveloped<br />

public resource, the Gene Ontology, while the same resource is not suitable<br />

for <strong>annotation</strong> <strong>of</strong> protein residues. This is because GO is designed to describe function<br />

<strong>of</strong> genes and gene products. From a conceptual point <strong>of</strong> view, terminologies in GO describe<br />

a high level <strong>of</strong> biological function, while the description <strong>of</strong> residue function are <strong>of</strong> a<br />

lower level. For example, description <strong>of</strong> protein-protein interaction is found in context <strong>of</strong><br />

metabolomics, signal-transduction or other cellular processes. In contrast, the function <strong>of</strong><br />

a protein residue can be explained in light <strong>of</strong> molecular interactions or chemical reaction<br />

mechanisms. Finally, the distribution <strong>of</strong> information on biological function is expected to<br />

be different in biomedical publications. Because protein function is conceptually a high<br />

level <strong>of</strong> biological function, it is likely that abstract texts <strong>of</strong> biomedical articles contain<br />

information on this level. Conversely, the interaction <strong>of</strong> protein residues is a detailed description<br />

<strong>of</strong> protein function, and key information are expected to be mentioned in results<br />

or discussion sections <strong>of</strong> full-text articles. To my knowledge, the most related relevant<br />

work in terms <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues (case (2)) is the system called<br />

Mutation extraction and STRucture Annotation Pipeline (mSTRAP) [KCRB07]. The key<br />

feature <strong>of</strong> mSTRAP is the visualisation <strong>of</strong> mutation <strong>annotation</strong>s, which is projected onto<br />

a structure <strong>of</strong> a protein <strong>of</strong> interest. The advantage <strong>of</strong> mSTRAP is to interpret impacts <strong>of</strong><br />

mutation in context <strong>of</strong> the protein structure. However, the prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong><br />

is done by visual analysis <strong>of</strong> the protein structure. The provided <strong>annotation</strong>s are sets<br />

<strong>of</strong> complete sentences extracted from MEDLINE, which means that the interpretation <strong>of</strong><br />

the information requires expert knowledge.<br />

22


The developed system in this work differs from mSTRAP, in that the extracted information<br />

is not exclusively used to annotate point mutations, but rather other <strong>functional</strong><br />

descriptions <strong>of</strong> wild-type residues are also collected. Another distinction to mSTRAP is,<br />

the mined information is represented in a so called predicate-argument structure (PAS)<br />

format; only relevant text segments from sentences are extracted that describe a biological<br />

function or a biological context <strong>of</strong> a mentioned residue. The structured format allows to<br />

some extent queries for specific information in the extracted <strong>annotation</strong> dataset.<br />

In conclusion, only few related works have been reported that describe an automated<br />

system to verify a <strong>predicted</strong> protein function by using <strong>functional</strong> <strong>annotation</strong>s extracted<br />

from the literature. This work retains its originality, because it aims to find novel <strong>functional</strong><br />

<strong>sites</strong> in proteins by mining the PDB, and by extracting <strong>functional</strong> <strong>annotation</strong>s from<br />

a wide range <strong>of</strong> biomedical literature data.<br />

1.5 Challenges<br />

Is it possible to identify a <strong>functional</strong> site, e.g.<br />

an <strong>active</strong> site, on the basis <strong>of</strong> mining<br />

PDB and the literature, and then combine the information <strong>of</strong> both?<br />

We can expect<br />

that a significant population <strong>of</strong> similarly arranged residues in a protein can be identified<br />

from a non-redundant protein set, if this evolutionary conserved interaction provides a<br />

<strong>functional</strong> or structural advantage. We can also expect that residues are mentioned in<br />

conjunction with their corresponding protein, and that the biological role <strong>of</strong> a protein<br />

residue is reported in context <strong>of</strong> gain or loss <strong>of</strong> function <strong>of</strong> the overall protein in biomedical<br />

literature.<br />

One task presented in this thesis is the identification <strong>of</strong> textual features as <strong>functional</strong><br />

<strong>annotation</strong>. The problem differs from other information extraction tasks, e.g. the <strong>annotation</strong><br />

<strong>of</strong> proteins, because the target is to provide knowledge on the biological role <strong>of</strong> a<br />

residue. For example, to extract protein-protein interactions from text, a list <strong>of</strong> protein<br />

names is used, and the task is reduced to finding only associations between listed pro-<br />

23


teins. In contrast, to extract a protein residue and its corresponding biological function<br />

is difficult, because an adequate dictionary <strong>of</strong> terms is not available.<br />

1.6 Guide to remaining chapters<br />

Chapter 2 presents background knowledge that are important for this work. Four different<br />

data resources are reviewed and their limitations discussed in context <strong>of</strong> this<br />

thesis. Then follows an explanation <strong>of</strong> methods in the field <strong>of</strong> protein structure data<br />

mining and biomedical literature mining. Some <strong>of</strong> the introduced methodologies are<br />

reused in this work, while ideas and approaches <strong>of</strong> others were adopted to develop<br />

task specific extraction systems.<br />

Chapter 3 describes the developed protein structure data mining system for the identification<br />

<strong>of</strong> 3D patterns in PDB. Algorithms for the identification <strong>of</strong> conserved<br />

spatial residue configurations are explained and the effects <strong>of</strong> algorithm-related and<br />

data-related parameters are discussed.<br />

Chapter 4 demonstrates the biological implication <strong>of</strong> the mined 3D patterns from chapter<br />

3. Two examples <strong>of</strong> rediscovered <strong>functional</strong> <strong>sites</strong> in proteins are shown to justify<br />

the presented data mining approach. The first biological validation is the identification<br />

<strong>of</strong> metal binding <strong>sites</strong>, while the second validation is the rediscovery <strong>of</strong> catalytic<br />

triad from the mined data.<br />

Chapter 5 is the first <strong>of</strong> three text mining chapters in this thesis. It explains the developed<br />

protein residue identification system, which consists <strong>of</strong> two main modules:<br />

biological entity recognition <strong>of</strong> residue, protein, and organism, and association detection<br />

<strong>of</strong> the entity triplet.<br />

Chapter 6 describes the approach to detect contextual features <strong>of</strong> a mentioned residue in<br />

text. An automatic method is introduced to assign semantic labels to the extracted<br />

24


textual features.<br />

Chapter 7 presents the third part <strong>of</strong> the three text mining chapters. Both text mining<br />

modules from the previous chapters (protein residue identification, and contextual<br />

feature extraction) are combined to form the <strong>functional</strong> <strong>annotation</strong> extraction system.<br />

The overall performance <strong>of</strong> this information extraction system is studied. The<br />

validity <strong>of</strong> the extracted information as <strong>functional</strong> <strong>annotation</strong> is demonstrated by<br />

manual analysis on two example proteins (p53 and Jak2), and by cross-validation<br />

<strong>of</strong> identified catalytic or binding residues with two reference databases: CSA and<br />

MSDsite.<br />

Chapter 8 presents results on combining protein structure data with literature data.<br />

The validity is studied by examining the correlation <strong>of</strong> <strong>predicted</strong> <strong>active</strong> site residues<br />

with enzyme-related <strong>functional</strong> <strong>annotation</strong>s.<br />

Chapter 9 summarises the thesis and presents limitations and open questions for follow<br />

up research.<br />

25


Chapter 2<br />

Background<br />

In the previous chapter, I have presented the motivation and objective <strong>of</strong> this thesis. The<br />

purpose <strong>of</strong> this chapter is to familiarise the reader with relevant concepts in protein science,<br />

data mining, and literature mining. The limitations <strong>of</strong> each reviewed data resource or<br />

methodology are discussed in context <strong>of</strong> this research work.<br />

2.1 Protein related data resources<br />

Proteins are both building blocks <strong>of</strong> cellular structures and the major machinery in cells.<br />

In order to perform their functions, proteins need to fold into their three-dimensional<br />

structures and thereby form <strong>functional</strong> <strong>sites</strong>. The prediction <strong>of</strong> a structural pattern associated<br />

with a biological function is an important aspect in protein bioinformatics. To<br />

interpret the multiple functions <strong>of</strong> proteins, <strong>annotation</strong>s are linked with results from<br />

bioinformatics analysis tools. In addition, data are extracted from generic and specific<br />

databases, biological knowledge accumulated in literature, and data from genome-wide<br />

experiments, such as transcriptomics and proteomics, are collected. One major goal is to<br />

describe protein function within biological context by using a standardised hierarchical<br />

classification scheme and controlled vocabulary.<br />

The biological community has developed databases and <strong>functional</strong> <strong>annotation</strong> schemes<br />

26


that are not only used to archive protein data, but also to describe protein function on<br />

a molecular, cellular and phenotypical level. Figure 2.1 shows some <strong>of</strong> the most popular<br />

and relevant databases in the field <strong>of</strong> protein bioinformatics. These protein-related data<br />

resources are hyperlinked in order to foster bioinformatical research works. A statistic <strong>of</strong><br />

three example databanks and their hyperlinked references is given in figure 2.2.<br />

2.1.1 Protein Data Bank<br />

The Protein Data Bank (PDB) is an archive <strong>of</strong> 3D structures <strong>of</strong> large biological molecules,<br />

such as proteins and nucleic acids. Currently, PDB lists 43,099 proteins determined by<br />

crystallography (version November 2008).<br />

Despite the large amount <strong>of</strong> structure data<br />

available for a range <strong>of</strong> proteins, the information in the PDB has three significant limitations.<br />

First <strong>of</strong> all, the structure data have a low correlation with sequence data. In<br />

comparison to the sequence data in UniProtKB (cf. section 2.1.2), the coverage <strong>of</strong> the sequence<br />

space is much larger than the structure space. Therefore, the derived information<br />

from PDB is only applicable to a limited set <strong>of</strong> proteins.<br />

The second limitation is the coverage <strong>of</strong> <strong>annotation</strong> available for proteins.<br />

In the<br />

PDB, there are some facilities to annotate proteins, for example the SITE record is used<br />

to annotate protein residues that are part <strong>of</strong> <strong>active</strong> <strong>sites</strong>. However, <strong>annotation</strong>s are not<br />

mandatory and many other <strong>sites</strong> are not updated, although new evidences <strong>of</strong> biological<br />

<strong>functional</strong>ity <strong>of</strong> these residues were found. An automatically derived database called PDB-<br />

SITE [IPGK05] stores the SITE record information and makes the search for these data<br />

accessible. Another, rather predictive, database <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in protein structures is<br />

the MSDmotif [GH08], which provides information about ligands, sequence and structure<br />

motifs, their relative position, and their neighbour environment. Another database <strong>of</strong> <strong>predicted</strong><br />

<strong>functional</strong> <strong>sites</strong> is MSDtemplate [Old02], which contains small fragments generated<br />

by data mining on a structurally unique protein set from PDB. Examples <strong>of</strong> biologically<br />

relevant fragments were identified in this data collection, such as the catalytic triad and<br />

27


Figure 2.1: Data banks in the protein universe. This figure shows my interpretation <strong>of</strong> how our knowledge<br />

about proteins can be categorised. A selection <strong>of</strong> the most relevant data resources and web services<br />

are reproduced in this figure. UniProtKB = Universal Protein Knowledge base [WAB + 06]; PIR = Protein<br />

Information Resource [BGH + 00]; PDB SELECT = representative list <strong>of</strong> PDB chain identifiers [HSSS92];<br />

PISCES = Protein Sequence Culling Server [WD03]; UniqueProt = web-service to create representative<br />

protein sequence sets [MR03]; MEROPS = the Peptidase Database [RMK + 07]; CAZy = Carbohydrate-<br />

Active enZYmes [CCR + 08]; TC-DB = Membrane Transport Protein Classification Database [STB06];<br />

PMD = Protein Mutant Database [KON99]; Phospho.ELM = a database <strong>of</strong> S/T/Y phosphorylation <strong>sites</strong><br />

[DCG + 04]; PROSITE = Database <strong>of</strong> protein domains, families and <strong>functional</strong> <strong>sites</strong> [HBB + 08]; PRINTS<br />

= Protein Motif Fingerprint Database [Att02]; BMC = Biomedical Center [BMC08]; PMC = PubMed<br />

Central [PMC08]; PDB = Protein Data Bank [BWF + 00]; SCOP = Structural Classification <strong>of</strong> Proteins<br />

[HMBC97]; CATH = Class, Architecture, Topology, Homologous superfamily - Protein structure classification<br />

[OMJ + 97]; Relibase = database <strong>of</strong> protein-ligand complexes [HBGK03]; CSA = Catalytic Site<br />

Atlas [PBT04]; MSDmotif = an integrated resource <strong>of</strong> protein structure motifs.<br />

28


Figure 2.2: Three hyperlinked protein data banks. Illustrated is the size <strong>of</strong> three databanks, PDB,<br />

UniProtKB, and MEDLINE, along with their cross-references. For example, the PDB contains in total<br />

42,943 PDB identifiers (version November 2008) with cross-references to 42,085 out <strong>of</strong> 333,445 Uniprot<br />

identifiers, which in return points to 10,466 biomedical journal articles (PMIDs). Notice that PDB<br />

also holds for each record a small number <strong>of</strong> primary citations, however, these are mainly pointers to<br />

crystallographic publications and provide little hints <strong>of</strong> biological function <strong>of</strong> the protein or <strong>annotation</strong><br />

<strong>of</strong> <strong>functional</strong> <strong>sites</strong>.<br />

29


various metal binding <strong>sites</strong>. The Catalytic Site Atlas (CSA) [PBT04] is another database<br />

documenting <strong>active</strong> <strong>sites</strong> in enzymes <strong>of</strong> 3D structures.<br />

The data are either manually<br />

curated or <strong>predicted</strong>, based on searches for homologous proteins.<br />

Another serious limitation <strong>of</strong> PDB is its use for statistical analysis <strong>of</strong> structure data.<br />

The PDB represents a redundant and biased snapshot <strong>of</strong> the protein universe. Redundancy<br />

is due to the fact that many highly similar structures or identical folds are deposited<br />

in the database leading to an over-representation <strong>of</strong> some proteins. In the past, structure<br />

determination has been guided by hypothesis-driven experiments, short-listed target<br />

proteins in the medical or commercial field, and by the methodologically tractable small<br />

proteins for crystallisation.<br />

Consequently, the fold-space has not been fully explored<br />

yet. Although techniques in protein crystallography are improving, there are still other<br />

underrepresented proteins, e.g. membrane proteins or large proteins, which define the<br />

boundaries <strong>of</strong> representativeness <strong>of</strong> the structure data.<br />

While there is little we can do about exploring the complete ensemble <strong>of</strong> folds from<br />

a bioinformatics point <strong>of</strong> view, the over-representation can be filtered.<br />

For example,<br />

protein sequence based clustering [AGM + 90] [AMS + 97] is the principle method to produce<br />

the following datasets: PDB SELECT [HSSS92], PISCES [WD03], UniqueProt [MR03].<br />

However, this approach is limited by the assertion <strong>of</strong> sequence-structure relation in the<br />

so called twilight zone, i.e. below 30 per cent sequence identity proteins may or may not<br />

have similar folds [Ros99]. Another critical issue with sequence based clustering is the<br />

comparison <strong>of</strong> protein chain sequences rather than the alignment <strong>of</strong> segments defined by<br />

protein domain boundaries.<br />

Structure based approaches cluster the data on the basis <strong>of</strong> domain structures. Several<br />

databases <strong>of</strong> domain based structure clustering were created with the most prominent<br />

ranging from entirely manual work (SCOP [HMBC97]), semi-automatic approach (CATH<br />

[OMJ + 97]), to entirely non-supervised methods (FSSP-Dali, [HS94]). Differences in these<br />

classification were studied by [HJ99] and [DBAD03].<br />

30


2.1.2 Universal Protein Knowledge base<br />

The major repository <strong>of</strong> protein sequence data is the Universal Protein Knowledge base<br />

(UniProtKB). Along with the collection <strong>of</strong> sequence data is the listing <strong>of</strong> protein names<br />

and synonyms, taxonomic data, citation references, and other manually curated information<br />

from literature survey.<br />

One important aspect <strong>of</strong> UniProtKB when evaluating<br />

structure-function relationships is the <strong>annotation</strong> <strong>of</strong> protein residues. In the feature table<br />

the biological function <strong>of</strong> a residue site is described along with several other key categories<br />

(cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific<br />

<strong>annotation</strong>s (version from January 2008).<br />

Despite the high quality data contained in UniProtKB, the process <strong>of</strong> extracting <strong>functional</strong><br />

<strong>annotation</strong>s from literature remains a laborious human expert curation work. The<br />

curator surveys the biomedical literature, represents the experimentally determined <strong>functional</strong><br />

information, and formulates the precise <strong>functional</strong> role by utilising standardised<br />

semantic resources (cf. section 2.1.3). Despite the highly reliable quality <strong>of</strong> manual curation,<br />

this approach is evidently inefficient considering the amount <strong>of</strong> full-text publications<br />

curators have to distil. According to Frishman, if we assume<br />

”[...] that one needs on average roughly 30 min to assess published fact<br />

and bioinformatics evidence for one protein, one thousand annotators would<br />

have to work 1 year long, 8 h a day, to annotate all 5 million sequences that<br />

are currently known. However, since the size <strong>of</strong> the protein database has been<br />

consistently doubling every 18 months, the moving target <strong>of</strong> annotating all<br />

proteins will never be achieved.” [Fri07]<br />

Considering that the estimated total number <strong>of</strong> proteins is in excess <strong>of</strong> 10 10 [CK06],<br />

an automatic or semi-automatic solution is needed to facilitate the laborious human expert<br />

work.<br />

Currently, methods for the automatic expansion <strong>of</strong> citation set [YLPV07]<br />

[HLC04] [LHC07] and the automatic <strong>annotation</strong> <strong>of</strong> protein function with GO terminologies<br />

[CSL + 06] [GJYLRS08] [RSKA + 07] are being developed in the field <strong>of</strong> text mining.<br />

31


Key<br />

INIT MET<br />

SIGNAL<br />

PROPEP<br />

TRANSIT<br />

CHAIN<br />

PEPTIDE<br />

TOPO DOM<br />

TRANSMEM<br />

DOMAIN<br />

REPEAT<br />

CA BIND<br />

ZN FING<br />

DNA BIND<br />

NP BIND<br />

REGION<br />

COILED<br />

MOTIF<br />

COMPBIAS<br />

ACT SITE<br />

METAL<br />

BINDING<br />

SITE<br />

NON STD<br />

MOD RES<br />

LIPID<br />

CARBOHYD<br />

DISULFID<br />

CROSSLNK<br />

VAR SEQ<br />

VARIANT<br />

MUTAGEN<br />

CONFLICT<br />

Description<br />

Initiator methionine.<br />

Extent <strong>of</strong> a signal sequence (prepeptide).<br />

Extent <strong>of</strong> a propeptide.<br />

Extent <strong>of</strong> a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome etc.).<br />

Extent <strong>of</strong> a polypeptide chain in the mature protein.<br />

Extent <strong>of</strong> a released <strong>active</strong> peptide.<br />

Topological domain.<br />

Extent <strong>of</strong> a transmembrane region.<br />

Extent <strong>of</strong> a domain, which is defined as a specific combination <strong>of</strong> secondary structures organised<br />

into a characteristic three-dimensional structure <strong>of</strong> fold.<br />

Extent <strong>of</strong> an internal sequence repetition.<br />

Extent <strong>of</strong> a calcium-binding region.<br />

Extent <strong>of</strong> a zinc finger region.<br />

Extent <strong>of</strong> a DNA-binding region.<br />

Extent <strong>of</strong> a nucleotide phosphate-binding region.<br />

Extent <strong>of</strong> a region <strong>of</strong> interest in the sequence.<br />

Extent <strong>of</strong> a coiled-coil region.<br />

Short (up to 20 amino acids) sequence motif <strong>of</strong> biological interest.<br />

Extent <strong>of</strong> a compositionally biased region.<br />

Amino acid(s) involved in the activity <strong>of</strong> an enzyme.<br />

Binding site for a metal ion.<br />

Binding site for any chemical group (co-enzyme, prosthetic group, etc.).<br />

Any interesting single amino-acid site on the sequence, that is not defined by another feature<br />

key. It can also apply to an amino acid bond which is represented by the positions <strong>of</strong> the<br />

two flanking amino acids.<br />

Non-standard amino acid.<br />

Posttranslational modification <strong>of</strong> a residue.<br />

Covalent binding <strong>of</strong> a lipid moiety.<br />

Glycosylation site.<br />

Disulfide bond.<br />

Posttranslationally formed amino acid bonds.<br />

Description <strong>of</strong> sequence variants produced by alternative splicing, alternative promoter usage,<br />

alternative initiation and ribosomal frameshifting.<br />

Authors report that sequence variants exist.<br />

Site which has been experimentally altered by mutagenesis.<br />

Different sources report differing sequences.<br />

Figure 2.3: Categories for protein sequence <strong>annotation</strong> in UniProtKB. Key categories used to describe<br />

regions or <strong>sites</strong> <strong>of</strong> interest in a protein sequence are listed. The key and the corresponding information<br />

(value) are stored in the feature table (FT line) in UniProtKB. Along with the listed categories are their<br />

definitions presented in this figure.<br />

32


Clearly, the <strong>annotation</strong> for a whole protein cannot be transferred to residue site <strong>annotation</strong>,<br />

because different groups <strong>of</strong> residues in the protein structure have different function.<br />

In this respect, the biological community is missing an information extraction system for<br />

the <strong>annotation</strong> <strong>of</strong> proteins at residue level.<br />

2.1.3 Gene Ontology<br />

The Gene Ontology (GO) [AL02] [GOC06] is one <strong>of</strong> the most widely used <strong>functional</strong><br />

classification scheme including all <strong>of</strong> the most important criteria for <strong>annotation</strong>s <strong>of</strong> biological<br />

data [PKS06]. Currently, the ontology lists a total <strong>of</strong> 26,302 terms with 15,643<br />

biological process terms, 2,233 cellular component terms, and 8,426 molecular function<br />

terms (version November 2008). The UniProtKB/InterPro group at the <strong>European</strong> Bioinformatics<br />

Institute (EBI) belongs to the Gene Ontology Consortium, and use its standard<br />

vocabulary to the <strong>annotation</strong> <strong>of</strong> protein function. The vocabulary is meant to describe<br />

biological phenomenology <strong>of</strong> genes and gene products (proteins). This is the reason why<br />

terminologies in GO are not suitable to describe the function and property <strong>of</strong> a protein<br />

residue. Figure 2.4 lists some examples where the identification <strong>of</strong> GO terms [GJYLRS08]<br />

did not find the more relevant keywords for the <strong>annotation</strong> <strong>of</strong> residues. At the moment,<br />

an ontology dedicated solely for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues has not been<br />

developed. However, terminologies can be in general collected from other considerable resources,<br />

such as the Open Biomedical Ontologies [SAR + 07] which contains, for example,<br />

REX (an ontology <strong>of</strong> physico-chemical processes), and PSI-MOD (an ontology describing<br />

protein chemical modifications).<br />

2.1.4 Biomedical literature<br />

Biomedical research tackles biological questions from a number <strong>of</strong> perspectives and the<br />

published experimental data are always heterogeneous. The sum <strong>of</strong> description <strong>of</strong> biological<br />

phenomenon enables scientists to understand mechanisms in biology within various<br />

33


Annotation<br />

Sentence Manual GO<br />

”The catalytic mechanism <strong>of</strong> the<br />

non-phosphorylating glyceraldehyde-<br />

3-phosphate dehydrogenase and the<br />

other aldehyde dehydrogenases resembles<br />

a thioester mechanism involving<br />

the universally conserved cysteine 298<br />

(pea GAPN).” (PMID:9461340)<br />

thioester mechanism, conserved<br />

cysteine<br />

glyceraldehyde-3-phosphate<br />

dehydrogenase (NADP+)<br />

(phosphorylating activity),<br />

glyceraldehyde-3-phosphate<br />

biosynthesis, glyceraldehyde-<br />

3-phosphate catabolism, phosphoglycerate<br />

dehydrogenase<br />

activity<br />

Annotation<br />

Sentence Manual GO<br />

”However, mutations <strong>of</strong> a key residue,<br />

His48, show significant deviation from<br />

the relationship, implying a role<br />

for the side chain in protection <strong>of</strong><br />

the complex from hydroxide attack.”<br />

(PMID:2690955)<br />

protection <strong>of</strong> the complex from<br />

hydroxide attack<br />

AT DNA binding, tRNA, tyrosine<br />

tRNA ligase activity<br />

Annotation<br />

Sentence Manual GO<br />

”Second, this re<strong>active</strong> cysteinyl<br />

residue, which is required for L-<br />

cysteine desulfurization activity, was<br />

identified as Cys325 by the specific<br />

alkylation <strong>of</strong> that residue and by sitedirected<br />

mutagenesis experiments.”<br />

(PMID:81615929)<br />

L-cysteine desulfurization activity<br />

pyridoxal biosynthesis, phosphate<br />

binding, mutagenesis,<br />

nitrogenase activity, L-alanine<br />

biosynthesis, pyridoxal phosphate<br />

binding<br />

Figure 2.4: GO terms are not suitable for protein residue <strong>annotation</strong>. The presented examples demonstrate<br />

that <strong>predicted</strong> GO terms are not always suitable for protein residue <strong>annotation</strong>. The prediction <strong>of</strong><br />

GO terms was done with an information theory based parser [GJYLRS08].<br />

34


contexts. This summary <strong>of</strong> text has also been compared with an ”unstructured knowledge<br />

database”, where information is present, but difficult to retrieve due to the complexity <strong>of</strong><br />

natural language. According to Sidhu,<br />

”[...] it is generally acknowledged that only 20 per cent <strong>of</strong> biological knowledge<br />

and data is available in a structured format or a database. The remaining<br />

80 per cent <strong>of</strong> biological information is hidden in the unstructured, free text<br />

<strong>of</strong> scientific publications.” [SDC06]<br />

In context <strong>of</strong> information extraction, the data to be extracted from an article are<br />

words (keywords) regarding biological concepts that could summarise the key message<br />

<strong>of</strong> the article.<br />

At first glance, abstract texts have a high density <strong>of</strong> keywords but a<br />

low coverage <strong>of</strong> information, while full-texts cover a larger but disperse quantity <strong>of</strong> data<br />

[FKY + 01] [YHF + 02] [SPIBA03] [SWS + 04] [NBD + 06].<br />

Another key distinction between abstract texts and full-texts is the availability <strong>of</strong><br />

data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE<br />

without restriction, while full-texts from various journals are only available for subscribed<br />

customers.<br />

Although some full-text articles are accessible through various initiatives<br />

[BMC08] [Plo08] [PMC08], the extraction <strong>of</strong> information from a whole document is expected<br />

to be much more complex than from an abstract text. For example, a biological<br />

feature <strong>of</strong> a residue may be expressed over several sentences, requiring a co-reference<br />

resolution <strong>of</strong> the residue and the feature.<br />

2.2 Protein structure data mining<br />

Data mining is an analytic method to identify valid, and novel patterns in data. A general<br />

data mining solution does not exist. Instead human data mining expertise and human<br />

domain expertise are required to solve each specific data mining problem. A data mining<br />

35


process consists <strong>of</strong> the following main processes: data selection, feature extraction, and<br />

correlation analysis.<br />

In respect <strong>of</strong> protein structure data mining, data selection means the identification<br />

<strong>of</strong> a non-redundant set <strong>of</strong> protein structures from PDB (cf. section 2.1.1). Although a<br />

protein structure contains only geometrical information, it is important to distinguish<br />

the types <strong>of</strong> structural features to be analysed. Following are the options <strong>of</strong> structural<br />

feature as target: the configuration <strong>of</strong> amino acids as Cα, the configuration <strong>of</strong> backbone<br />

atoms, the spatial arrangement <strong>of</strong> chemical groups [JIDG03] [YEC + 07] [Rus98] [SSR03]<br />

[Old02], and the physicochemical environments [OCR01] [YEC + 07]. In order to discover<br />

new information from the data, a developed data mining algorithm must not contain any<br />

biochemical knowledge. The target should be a mathematical model and not a biological<br />

template.<br />

2.2.1 Hypothesis-driven data analysis<br />

”Within the field <strong>of</strong> bioinformatics research, the term data mining is used very loosely to<br />

describe any type <strong>of</strong> data analysis. (T. Oldfield, pers. comm.).” Hypothesis-driven data<br />

analysis consists <strong>of</strong> defining a biological target (hypothesis), and searching for the target.<br />

Consequently, the result <strong>of</strong> a hypothesis-driven data analysis is not the discovery <strong>of</strong> new<br />

information.<br />

A number <strong>of</strong> methods were published that predicts a known protein function on the<br />

basis <strong>of</strong> protein structure information. Initially, the research work focused on global fold<br />

recognition [HS96] [WR97] [MB99] [KH04] [HPS + 03] [AZP + 05] to identify evolutionary<br />

distant, but structurally conserved homologues. Once a match is found <strong>functional</strong> <strong>annotation</strong>s<br />

are transferred from the target to the query. Another more specific approach<br />

focuses on the search for matching local substructures in the proteins. The rational is,<br />

that a biological function can be mapped to a particular residue configuration in the<br />

protein, which is independent in function from the global fold <strong>of</strong> the structure. One obvi-<br />

36


ous approach was to design structure templates, which contains all the essential residues<br />

for a biological function. Several specific types <strong>of</strong> <strong>sites</strong> or motifs have been studied in<br />

detail to capture metal binding <strong>sites</strong> [Glu91], the catalytic triad <strong>of</strong> the serine proteases<br />

[FWLN94] [WBT97], and binding <strong>sites</strong> for anions such as sulphate and phosphate [Cha93]<br />

[CB94]. Computer assisted methods were developed in the following to help experts to<br />

design templates by analysing motifs over large sets <strong>of</strong> proteins corresponding to <strong>active</strong><br />

<strong>sites</strong> [APG + 94] [Rus98] [SSR03] [Kle99] [FS98] [FGS98] [WBT97] [BT03] [PB06], surface<br />

patches or clefts [Las95] [KJ94] [LEW98] [SPNW04] [BFL04] or structural binding site<br />

locations [GPP + 03] [KN03].<br />

2.2.2 Discovery-driven data mining<br />

The key feature in a discovery-driven data mining is the search for common characteristics<br />

(pattern) in the data, without providing any domain knowledge. More specifically, the<br />

target is mathematically defined and the system aims to identify over-representations,<br />

data variations, or singularities in the dataset. Hence discovery-driven data mining can<br />

deliver novel information, while the biological significance <strong>of</strong> the result is not trivial.<br />

One important aspect in identifying residue interactions in protein structures is the<br />

consideration <strong>of</strong> contextual information, such as interaction distance, chemical environment,<br />

and evolutionary conservation, in the data mining algorithm. The systems called<br />

ET/MA [CFK + 05] and ConSurf uses evolutionary information in combination with structural<br />

and chemical data, in order to highlight region <strong>of</strong> local structures with <strong>functional</strong><br />

importance. In contrast, the systems PINTS [Rus98] [SSR03] and SIDEMINE [Old02] find<br />

patterns within the distribution <strong>of</strong> non-redundant structure set, by using solely mathematical<br />

model <strong>of</strong> interactions. One critical issue in the development <strong>of</strong> these data mining<br />

methods was the improvement <strong>of</strong> the signal/noise ratio. In order to boost the signal frequency,<br />

two structural features are merged if one is biologically equivalent to the other.<br />

While the analysis showed that the mined output contained biological valid data, the<br />

37


esult actually incurs some bias, because biological knowledge was introduced.<br />

2.3 Biomedical literature mining<br />

Biomedical text mining extracts information from text for the integration into biological<br />

databases. Due to the complexity <strong>of</strong> natural language, text processing involves structuring<br />

the text input by means <strong>of</strong> parsing and the <strong>annotation</strong> <strong>of</strong> some linguistic features,<br />

e.g. part-<strong>of</strong>-speech tags. The majority <strong>of</strong> biological text analysis is concerned about the<br />

extraction <strong>of</strong> explicitly stated facts from text; a task referred as biological information extraction<br />

[Hob02]. Biomedical text mining processes typically consist <strong>of</strong> two main analysis<br />

steps: biological entity recognition, and biological relation extraction.<br />

The vast amount <strong>of</strong> published biomedical articles contains phenomenological data on<br />

proteins, such as their molecular function. The information is encoded in unstructured<br />

text and requires different level <strong>of</strong> complexity to mine the data. There are several levels<br />

<strong>of</strong> text mining challenges to extract <strong>functional</strong> <strong>annotation</strong>: the identification <strong>of</strong> mutations<br />

[LHC07] [WK07] [BW05] [RSMA + 04] [HLC04] or genetic sequences [MG03], identification<br />

<strong>of</strong> gene or protein names [RSAG + 08] [PJYLRS08] [TMA08] [Fuk98] and chemical entities<br />

[CMR06], the extraction <strong>of</strong> <strong>annotation</strong> <strong>of</strong> molecular function [GJYLRS08] [RSKA + 07]<br />

[DS05] [KNT05] [GDAW03] [HNR + 05], and the identification <strong>of</strong> semantic relations between<br />

the biological entities [BLK + 08] [LCM03] [SB06].<br />

2.3.1 Biological entity recognition<br />

The process <strong>of</strong> entity recognition (ER) can be split into three parts: location <strong>of</strong> the mentioned<br />

entity in text, classification <strong>of</strong> the entity into a predefined category, and normalising<br />

the entity by referencing to an entry in a database.<br />

Biological entities are <strong>of</strong>ten ambiguous in terms <strong>of</strong> their boundaries and categories.<br />

Probably the most challenging task is the correct identification <strong>of</strong> protein or gene names.<br />

38


For example, ”hunchback” is a protein in Drosophila, while it is also a general English<br />

term. Furthermore, protein names consist mostly <strong>of</strong> multiple words, e.g. ”Rho-like protein”<br />

or ”HIV-1 envelope glycoprotein gp120”. An ER system needs to identify all the<br />

constituents <strong>of</strong> a protein name in order to relate the detected entity to its reference entry<br />

in a database. The BioCreAtIvE challenge addressed this problem with the 1B subtask;<br />

the target is the identification <strong>of</strong> protein/gene names in text, and the <strong>annotation</strong> <strong>of</strong> their<br />

correct gene identifier. Various solutions were published ranging from rule-based methods<br />

[HFM + 05] [TW02] [Fuk98] to machine learning approaches [CMP05]. The developed<br />

methods are, in general, reusable for any other biological entity recognition or terminology<br />

identification problem.<br />

Works have also been published that focused on the extraction <strong>of</strong> protein point mutations<br />

[RSMA + 04] [HLC04] [BW05] [LHC07] [YLPV07], which is one category <strong>of</strong> protein<br />

residue terminology. Other categories are residue sequence or residue interaction pair.<br />

The most widely adopted method to identify these terminologies is the design <strong>of</strong> regular<br />

expression patterns.<br />

2.3.2 Biological relation extraction<br />

Relation extraction (RD) aims to find associations between entities, or between an entity<br />

and a terminology within a text phrase. One objective in biomedical information<br />

extraction is the mining <strong>of</strong> biological facts from text. An example <strong>of</strong> biological fact is<br />

the semantic relation between two biological entities, such as protein-protein interaction<br />

[TOT04].<br />

Until now, three strategies have been investigated for biological relation extraction: the<br />

co-occurrence based analysis [LC05] [SB05], pattern-based approach [HZH + 04] [LCM03],<br />

and machine learning based methods [BM05] [BM06]. The common limitation <strong>of</strong> all <strong>of</strong><br />

these extraction systems is, that only the relation targets, e.g. proteins within a proteinprotein<br />

interaction, are extracted. By no means are contextual information considered in<br />

39


the extraction that would describe or explain the association <strong>of</strong> the entities. Within the<br />

information extraction community, a consensus has been reached, that deeper analysis <strong>of</strong><br />

sentence structures is required in order to adequately acquire biomolecular relations from<br />

text [WSC04].<br />

In respect <strong>of</strong> biological relation extraction, two classes <strong>of</strong> syntactical parsers were studied.<br />

The first is the shallow parsing technique, which aims in detecting main constituents<br />

<strong>of</strong> a sentence, without determining the complete syntactical structure. Results were published,<br />

where protein-protein interactions [KNT05] and general biological entity relations<br />

[LCM03] were extracted based on shallow parsing. The second class <strong>of</strong> syntactical parser<br />

is the full parser, which attempts a deep analysis <strong>of</strong> the syntactical structure <strong>of</strong> a sentence.<br />

Several systems have been reported [NED03] [FKY + 01] that utilises full parsing<br />

for relation extraction from biomedical literature. One interesting full parser is ENJU<br />

[YMTT05] [MT05], a so called head-driven phrase structure grammar (HPSG) parser,<br />

which identifies predicate-argument structure (PAS) from a text sentence.<br />

The use <strong>of</strong> PAS, as template for biomolecular relation extraction, was firstly reported<br />

in [TOT04] [YMTT05]. Recently, two proposition bank were reported, that are designed<br />

to capture relations in molecular biology: PASBio [WSC04] and BioProp [TCS + 07].<br />

Within this work, there are two types <strong>of</strong> semantic relations to be extracted.<br />

The<br />

first is the residue-protein association.<br />

The system called MEMA [RSMA + 04] uses a<br />

word distance metric to associate a list <strong>of</strong> residue-protein pairs with the smallest word<br />

distance.<br />

Another approach is to look up valid associations between a residue and a<br />

protein in context <strong>of</strong> a predetermined association <strong>of</strong> a protein and an organism. Three<br />

systems have been reported, that adopt this approach: MuteXt [HLC04], MutationMiner<br />

[BW05], and MutationGraB [LHC07].<br />

The other semantic relation to be extracted in this work is the association between<br />

a residue entity and its description <strong>of</strong> function. The systems MuteXt [HLC04], MEMA<br />

[RSMA + 04], MutationMiner [BW05], and MutationGraB [LHC07] are all dedicated to<br />

40


the extraction <strong>of</strong> point mutations, but provide no extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>. In<br />

a recent publication [WK07], an ontological model was proposed that should hold information<br />

extracted from MutationMiner as well as point mutation <strong>annotation</strong>s. However,<br />

the author did not provide any results <strong>of</strong> feature extraction nor was a strategy proposed.<br />

2.4 Conclusion<br />

In this chapter, I have reviewed some <strong>of</strong> the most relevant data resources and research<br />

works in the field <strong>of</strong> protein structure data mining and text mining. Some <strong>of</strong> the data<br />

resources are used in this thesis. In the following, I will present the extraction systems I<br />

have developed during my PhD.<br />

41


Chapter 3<br />

Mining residue interactions as triads<br />

from PDB<br />

In this chapter, I present a novel approach in mining 3D patterns from protein structures.<br />

More specifically, a pattern is defined as the irreducible interaction <strong>of</strong> a chemical and<br />

spatial configuration <strong>of</strong> residues. The goal is to identify new information from a nonredundant<br />

dataset on the basis <strong>of</strong> using solely mathematical targets.<br />

The mined 3D<br />

patterns represent prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong> in proteins.<br />

3.1 Algorithms<br />

The novelty <strong>of</strong> this presented 3D pattern mining approach is based on the classification <strong>of</strong><br />

residue triplets into one <strong>of</strong> four interaction classes. The idea <strong>of</strong> analysing side chain interactions<br />

within a residue triplet is based on the work <strong>of</strong> [Old02], while the classification <strong>of</strong><br />

residue interaction relies on the methodology developed by [JB04]. The developed data<br />

mining method consists <strong>of</strong> three processing steps: structural feature extraction, detection<br />

<strong>of</strong> significant configurations as interactions, and grouping and selection <strong>of</strong> frequent configurations.<br />

Figure 3.1 illustrates the procedures <strong>of</strong> the entire protein structure data mining<br />

system developed in this thesis.<br />

42


Figure 3.1: Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed 3D pattern identification<br />

system.<br />

43


3.1.1 Structural feature extraction<br />

Theory<br />

Residue triplet as spatial pattern unit.<br />

The presented protein structure data mining<br />

algorithm aims to identify significant interaction <strong>of</strong> residues within a triplet configuration.<br />

The rational <strong>of</strong> analysing residue triplets is described in the following. In order to<br />

form a <strong>functional</strong> site in a protein structure, residues need to be physically in closed contact.<br />

In other words there exists a mutual dependency or interaction among the residues.<br />

The interaction can be studied on a two-residue basis (doublet 3D pattern). However,<br />

regarding the size <strong>of</strong> structure data the probability <strong>of</strong> any two-residue configurations is<br />

too high to be detected as specific. Hence, the signal/noise ratio issue is the reason why<br />

a two-residue 3D pattern is not the target <strong>of</strong> protein structure data mining [Old02].<br />

A two residue contact is completely defined by a scalar property, while a three residue<br />

contact is defined by vectors. Consequently, a three residue constellation encodes much<br />

more information. This makes information theory based methods tractable to find conserved<br />

residue interactions as signals.<br />

In reality, <strong>functional</strong> <strong>sites</strong> can be composed <strong>of</strong> more than three residues, e.g. various<br />

metal binding <strong>sites</strong> used four coordinative cysteine residues. However, data sparseness<br />

and the mathematical complexity [CL64] [Sin04] in modelling four or larger residue interactions<br />

makes it infeasible. In principle, the more variables are introduced in modelling<br />

residue interactions, the more specific the data mining. It should be noted, that the identification<br />

<strong>of</strong> N-body interactions <strong>of</strong> residues can be solved from a combinatorial approach.<br />

Two triplets are combined, if there is equality in two out <strong>of</strong> three residues from each<br />

triplet [Old02]. This approach was adopted in this study to demonstrate that larger interaction<br />

configurations are extractable. However, this investigation concentrates mainly<br />

on the identification <strong>of</strong> three residue interactions. The assumption is, that if the output<br />

<strong>of</strong> a data mining provides valid result, the approach is justified and more complex residue<br />

configurations may inherit this property.<br />

44


Side chain interaction model.<br />

The determination <strong>of</strong> residue interactions requires a<br />

transformation <strong>of</strong> a full atom model into a simpler representation. This is because the<br />

mathematical model, that needs to describe all combinations <strong>of</strong> atom interactions <strong>of</strong> two<br />

residues, would be too complex. The solution is to replace the all-atom structure model<br />

with a coarse grained model, by reducing each residue to a single point. In principle,<br />

a residue point can be calculated either by the centre <strong>of</strong> mass, or the geometric centre<br />

(centroid). Each representation can be calculated from main chain atoms, main and side<br />

chain atoms, or side chain atoms only.<br />

The focus in this study is the side chain interactions within residue triplet configuration.<br />

For this reason, a protein structure is represented as a point spread <strong>of</strong> side chain<br />

centroids.<br />

Protein structure triangulation.<br />

The extraction <strong>of</strong> residue triplets from a protein is<br />

based on triangulation <strong>of</strong> structures. Here structures are triangulated on the basis <strong>of</strong> three<br />

criteria. The first is the compositional constraint. Each residue in a triplet must be an<br />

element <strong>of</strong> the 20 natural amino acids, while hetero atoms are excluded. One prominent<br />

reason is that there are not many examples <strong>of</strong> residue-hetero atom interactions in the<br />

dataset that would support a statistical analysis.<br />

The second condition <strong>of</strong> triplet extraction requires that none <strong>of</strong> the residues are direct<br />

neighbours in the protein sequence. The assumption made here is, that any covalently<br />

bonded residues have a higher likelihood than any other two residues being next to each<br />

other in space that are not bonded. Similarly, the probability <strong>of</strong> finding three residues in<br />

space that are connected, is higher than finding unconnected triplets <strong>of</strong> residues. Consequently,<br />

the distribution <strong>of</strong> interacting residues in space would be over-represented. The<br />

definition <strong>of</strong> residue neighbourhood affects the data mining result, e.g. by requiring a<br />

pair interaction in the triplet to have a distance <strong>of</strong> more than one residue, patches <strong>of</strong><br />

residues at one side <strong>of</strong> a beta-sheet may not be discovered. While tuning this parameter<br />

can modify the result <strong>of</strong> the data mining, the objective here is to discover new knowledge<br />

45


from the input data set by providing as little as possible <strong>of</strong> biological information.<br />

The last criterion in triplet extraction is concerned with the geometrical property <strong>of</strong> a<br />

triplet. The Euclidean distances between the residues must fulfil the triangular inequality,<br />

while only two interaction distances <strong>of</strong> less than 6Åwere allowed. Although the interaction<br />

distance threshold is based on an empirical study <strong>of</strong> a number <strong>of</strong> protein structures, this<br />

value may not be adequate, because it would prefer close contacts <strong>of</strong> large side chains<br />

<strong>of</strong> residue pairs. For example the pair interaction <strong>of</strong> two tryptophans may have a near<br />

maximal allowed interaction distance <strong>of</strong> the centroids, while the distance <strong>of</strong> the contacting<br />

atoms are actually very close. The alternative is to set up a threshold system for residue<br />

pairs or triplets, which depends on the types <strong>of</strong> residues. Although this approach was not<br />

studied in this thesis, future work could improve the developed algorithm. Yet another<br />

approach in selecting residue interactions from a protein structure is based on the analysis<br />

<strong>of</strong> surface contacts <strong>of</strong> the side chain groups. While not all <strong>functional</strong> <strong>sites</strong> require their<br />

constituents to be in physicochemical contact (e.g. a metal binding site consists <strong>of</strong> metal<br />

ion coordinating residues without physical contacts), a protein binding site is an example<br />

where residues <strong>of</strong> two different proteins are in non-covalent interaction.<br />

However, the<br />

presented data mining approach aims in the unbiased search for residue interactions from<br />

a dataset <strong>of</strong> monomeric protein structure domains, and therefore a surface-based selection<br />

criterion will biased the analysis.<br />

Implementation<br />

A coarse grained representation is used in this protein structure analysis. From a full atom<br />

model <strong>of</strong> a protein structure, centroid positions <strong>of</strong> each protein residue were calculated<br />

on the basis <strong>of</strong> their side chain atoms. The resulting simplified structure model is then<br />

triangulated based on three criteria: (1) each residue in a triplet must be an element <strong>of</strong><br />

the 20 natural amino acids; (2) pairs <strong>of</strong> residues in the triplet must not have a sequential<br />

relation in respect <strong>of</strong> their protein sequence position; and (3) only two pairs <strong>of</strong> residues<br />

46


can have a maximal interaction distance <strong>of</strong> 6Å, and only one pair with an interaction<br />

distance <strong>of</strong> less than 12Å.<br />

For the interaction analysis it is necessary to define a hash table, based on integer<br />

values <strong>of</strong> centroid distances, and the name <strong>of</strong> residue. The integer value <strong>of</strong> a distance is<br />

calculated by dividing the measured distance by a precision value (hash precision), which<br />

was set at +/- 0.5Å. Given a 3-body with<br />

trip = (A, B, C), (3.1)<br />

a three-dimensional hash table is defined as<br />

HT (A, B, C) = 3D hash bin[i][j][k], (3.2)<br />

where i, j, and k are the integer values <strong>of</strong> measured distances between two spatial coordinate<br />

<strong>of</strong> residues. The integer values are given by the equation<br />

i = INT (dist(A, B)/hash precision)<br />

j = INT (dist(B, C)/hash precision)<br />

(3.3)<br />

k = INT (dist(A, C)/hash precision).<br />

For a detailed definition <strong>of</strong> the implemented hashtable cf. [Old01].<br />

3.1.2 Detection <strong>of</strong> significant configurations as interactions<br />

Theory<br />

The method for residue interaction detection relies on the comparison <strong>of</strong> two probabilistic<br />

models: the reductionistic part-to-whole approximation model, and the holistic reference<br />

model. Part-to-whole approximation is modelled with a collection <strong>of</strong> marginal distributions<br />

defined by subsets <strong>of</strong> the variables. Formally, a 3-body consists <strong>of</strong> three variables<br />

(cf. equation 3.1). To verify whether the probability <strong>of</strong> a triplet, P (A, B, C), can be<br />

47


factorised, we attempt to approximate it by using all attainable marginals<br />

M = {P (A, B), P (A, C), P (B, C), P (A), P (B), P (C)}. (3.4)<br />

If the approximation fits the data, i.e.<br />

the probability <strong>of</strong> finding a particular triplet<br />

is explained by the approximation model, then there is no evidence for an interaction.<br />

In other words, a significant interaction is given when the two models are significantly<br />

different.<br />

The difference between two joint probability density functions O and M is<br />

measured by the Kullback-Leibler divergence<br />

D(O||M)<br />

= ∑ O(i)<br />

i<br />

O(i)log( ). (3.5)<br />

M(i)<br />

In this context O usually refers to the observed probability or the reference model,<br />

while M is the approximation model. The null hypothesis in testing the interaction model<br />

is that the part-to-whole approximation matches the observed data. The alternative one<br />

is that the approximation does not fit and that there is an interaction. Three cases can<br />

be listed:<br />

D(O||M) > 0 : there is a pattern among k attributes<br />

D(O||M) = 0 : there is no pattern <strong>of</strong> order k<br />

(3.6)<br />

D(O||M) < 0 : there is redundancy among the parts.<br />

Within a 3-body system, four different configurations <strong>of</strong> interactions can be defined<br />

(cf. figure 3.2): no-interaction, one-pair interaction, two-pair interactions, and three-pair<br />

interactions. For each <strong>of</strong> these configurations it is possible to formulate a part-to-whole<br />

approximation model, i.e. the interaction can be factorised. In the case <strong>of</strong> no-interaction,<br />

the probability <strong>of</strong> the observable is expected to be estimated by its singlet probabilities<br />

{<br />

k = 0 :<br />

ˆP 0 (A, B, C) = P (A)P (B)P (C) , (3.7)<br />

48


Figure 3.2: Four classes <strong>of</strong> interactions within a 3-body. A circle represents a protein residue, and an<br />

intersection resembles an interaction between two residues. k=0: no-interaction; k=1: one-way or one<br />

pair interaction; k=2: two-way or two-pair interactions; k=3: three-way or three-pair interactions.<br />

49


whereas in a system with one-pair interaction, two variables are dependent on each other.<br />

Consequently, within a 3-body state there are three is<strong>of</strong>orms <strong>of</strong> one-pair interactions:<br />

⎧<br />

⎪⎨<br />

k = 1 :<br />

⎪⎩<br />

ˆP 1,1 (A, B, C)<br />

ˆP 1,2 (A, B, C)<br />

ˆP 1,3 (A, B, C)<br />

=<br />

=<br />

=<br />

P (A,B)P (C)<br />

P (A)P (B)<br />

P (A,C)P (B)<br />

P (A)P (C)<br />

P (B,C)P (A)<br />

P (B)P (C)<br />

. (3.8)<br />

There are two forms <strong>of</strong> three variable interactions, but with different dependencies:<br />

two-pair interaction (k=2) and three-pair interaction (k=3). These interactions represent<br />

the target <strong>of</strong> 3D pattern mining. In a two-pair interaction, two pairs <strong>of</strong> variables are dependent<br />

on each other, while sharing a common attribute. For example, given A interacts<br />

with B, and B interacts with C, there is no clear observation that A also interacts with<br />

C. Three is<strong>of</strong>orms are formulated for this interaction:<br />

⎧<br />

⎪⎨<br />

k = 2 :<br />

⎪⎩<br />

ˆP 2,1 (A, B, C) =<br />

ˆP 2,2 (A, B, C) =<br />

ˆP 2,3 (A, B, C) =<br />

P (A,B)P (B,C)<br />

P (B)<br />

P (A,C)P (A,B)<br />

P (A)<br />

P (B,C)P (A,C)<br />

P (C)<br />

. (3.9)<br />

In case <strong>of</strong> a three-pair interaction, all three variables are dependent on each other, and<br />

the approximation model is defined as<br />

{<br />

k = 3 :<br />

ˆP 3,1 (A, B, C) =<br />

P (A,B)P (B,C)P (A,C)<br />

P (A)P (B)P (C)<br />

. (3.10)<br />

If the state is disturbed, e.g. by exchanging one variable, a partial interaction will<br />

not be observed. In respect <strong>of</strong> protein biology, this could mean that a residue mutation<br />

abolishes an intramolecular stabilising network. However, as this does not provide an<br />

evolutionary advantage the conservation <strong>of</strong> this residue is likely to be promoted and can<br />

be detected as a recurrent structural feature.<br />

The determined sets <strong>of</strong> two-way (k=2) and three-way (k=3) interactions are the targets<br />

in this data mining.<br />

50


Implementation<br />

Triplets <strong>of</strong> residues are classified into one <strong>of</strong> the four defined interaction configurations.<br />

The classification is based on a non-parametric cross-validation sampling method described<br />

by [JB04]. A significant interaction is given when the two models O and M are<br />

significantly different. Because the data can be regarded as a sample <strong>of</strong> a multinomial distribution,<br />

the representativeness <strong>of</strong> the approximation model can be tested by the self-loss<br />

function D(P ′ ||P ). Here, P ′ and P are the probability distributions from two equal sample<br />

sizes. The weight <strong>of</strong> evidence <strong>of</strong> accepting the null hypothesis, i.e. the approximation<br />

model, can be estimated by p cv -values from a 2-fold cross-validation. For each random<br />

sampling the dataset is partitioned into two equally sized subsets: one training set and<br />

one test set. From these subsets two joint probability distribution functions, P ′ and P<br />

are determined from the training and test set, respectively. The marginal distributions,<br />

singlets and doublets, are determined from P ′ to construct the part-to-whole approximation<br />

ˆP ′ . The p cv -value is defined as the probability where the self-loss is greater or equal<br />

to the approximation loss<br />

p cv {D(P ||P ′ ) ≥ D(P || ˆP ′ )}. (3.11)<br />

On the basis <strong>of</strong> p cv -values, an interaction is discovered if p cv ≤ α, and an interaction<br />

is rejected when p cv > α. High threshold values <strong>of</strong> α, e.g. 0.95, will bias towards an<br />

interaction and risk overfitting, while lower values, e.g. 0.05, moves the bias towards nointeraction<br />

model and risk underfitting. In this study, a reductionistic bias approach was<br />

chosen, to prefer a simpler no-interaction model, by selecting α = 0.05. The used value<br />

<strong>of</strong> α is based on the research work <strong>of</strong> [JB04].<br />

51


3.1.3 Grouping and selecting frequent configurations<br />

Theory<br />

The result <strong>of</strong> data mining protein structures can be a large set <strong>of</strong> 3D pattern.<br />

The<br />

data needs to be clustered in order to select the most frequent pattern. The assumption<br />

behind data clustering is, that residue configurations in protein structures are unlikely<br />

to be absolute and static. By grouping spatially similar configurations, the geometrical<br />

variation <strong>of</strong> patterns can be compensated and their frequencies improved.<br />

Implementation<br />

The objective in this section is to identify frequent groups <strong>of</strong> geometrically similar triplets<br />

with identical chemical configurations. Data clustering was done in two steps. For each<br />

residue triplet combinations, the initial step is to group geometrically similar patterns,<br />

and then count the combined frequencies<br />

i+1 j+1<br />

∑ ∑ ∑k+1<br />

G(HT (i, j, k)) =<br />

HT (i, j, k), (3.12)<br />

i−1 j−1 k−1<br />

where HT is a hash table <strong>of</strong> the residue triplets (cf. equation 3.2). Then local geometrical<br />

peaks were searched by comparing the frequencies <strong>of</strong> the grouped triplets<br />

arg max G(HT (a, b, c)) < G(HT (i, j, k)), (3.13)<br />

where HT (a, b, c) ≠ HT (i, j, k) with a = {i − 1, i, i + 1}, b = {j − 1, j, j + 1} and<br />

c = {k − 1, k, k + 1}.<br />

The second step in data clustering finds subgroups <strong>of</strong> triplets from a local peak, based<br />

on an all atom structure alignment. The determined clusters are ranked by their proba-<br />

52


Dataset PDBIDs Domains Domain definition Data selection Properties<br />

OLDFIELD 1,442 2,320 mathematical Sequence alignment<br />

SCOP40 3,449 4,734 human expert Sructure comparison<br />

Homologous structural<br />

features <strong>of</strong> divergent<br />

proteins.<br />

Convergent structural<br />

features <strong>of</strong> divergent<br />

proteins.<br />

Figure 3.3: Non-redundant structure set for 3D pattern mining. The dataset OLDFIELD is based on<br />

the publication <strong>of</strong> [Old02], and SCOP40 was obtained from ASTRAL Compendium [BKL00]. The size<br />

<strong>of</strong> the datasets, the method for data selection, and key properties are summarised.<br />

bility scores, which is defined as:<br />

P (cluster) =<br />

#cluster member<br />

#peak member . (3.14)<br />

On the basis <strong>of</strong> P (cluster) a cluster <strong>of</strong> residue interaction is selected if P (cluster) ≥<br />

τ. In this study, the threshold tau for selecting a cluster was set to 0.66.<br />

3.2 Analysing available non-redundant protein structure<br />

sets<br />

The significance <strong>of</strong> this data mining result is greatly dependent on the representativeness<br />

<strong>of</strong> the data. For the frequencies <strong>of</strong> structural features to be true, they would have to be<br />

taken from protein structures <strong>of</strong> all <strong>of</strong> the naturally occurring protein folds. However,<br />

such a data resource is not available at present (cf. section 2.1.1). This effectively means<br />

that protein structure data mining is bound by the availability <strong>of</strong> fold examples. While<br />

from a bioinformatical point <strong>of</strong> view, little can be done to improve the coverage <strong>of</strong> the<br />

fold space, a number <strong>of</strong> efforts have been dedicated to the compilation <strong>of</strong> non-redundant<br />

datasets from PDB.<br />

The results in this thesis are based on the study <strong>of</strong> two non-redundant protein structure<br />

sets: OLDFIELD [Old02] and SCOP40 [HMBC97] [BKL00]. Table 3.3 summarises<br />

53


key features <strong>of</strong> each dataset. The major distinction between both datasets lies in the<br />

definition <strong>of</strong> a non-redundant dataset. The purpose in compiling OLDFIELD is to create<br />

a dataset that allows the detection <strong>of</strong> interesting structural equivalence from the<br />

non-specific structural features. The primary data selection is in sequence space. The<br />

resulting dataset contains only sequentially dissimilar protein fragments, while common<br />

fold motifs are preserved. This allows the detection <strong>of</strong> homologous structural components<br />

<strong>of</strong> divergent proteins. In contrast, SCOP represents a biased view <strong>of</strong> protein data by defining<br />

classes in structure space. The assignment to a class, <strong>of</strong> a novel protein, is based on<br />

structure and sequence comparisons. SCOP40 is the data subset <strong>of</strong> SCOP, where sequentially<br />

divergent proteins with convergent structural features are retained. Because the<br />

classification contains structurally divergent proteins, any identified recurrent structural<br />

feature in SCOP40 is an indication <strong>of</strong> convergent evolution.<br />

Another distinction between OLDFIELD and SCOP40 is the method <strong>of</strong> identifying<br />

domain structures. In OLDFIELD, protein fragmentation was done mathematically by<br />

analysis Cα distances [Old01], while in SCOP40 human experts were recruited to process<br />

a batch <strong>of</strong> protein structures. Both approaches have their advantages and caveats. On one<br />

hand, an automatic structure domain identification system can deliver reproducible data,<br />

while the results may not be justified in some cases. On the other hand, expert curated<br />

data represent a single precision view, but the information is difficult to be reproduced<br />

as new data become available.<br />

The difference in automatic and manual data selection is also reflected in the size <strong>of</strong><br />

the datasets. In 2002, the compiled non-degenerated domain structure set from OLD-<br />

FIELD listed 2,320 domain structures, corresponding to 1,442 PDB identifiers. In contrast,<br />

SCOP40 contained 4,734 domain structures determined from 3,449 PDB identifiers<br />

in the same year.<br />

54


3.3 Evaluation methods<br />

The presented 3D pattern identification system is a discovery-driven data mining solution.<br />

The assessment <strong>of</strong> performance is done on two levels: the study <strong>of</strong> parameter dependency<br />

(presented in this chapter), and the validation <strong>of</strong> biological significance <strong>of</strong> the data (cf.<br />

chapter 4).<br />

The effect <strong>of</strong> data-related parameters was studied by comparing the mined results from<br />

OLDFIELD and SCOP. In the first part <strong>of</strong> the analysis, the distributions <strong>of</strong> extracted<br />

residue triplets were compared. Then the determined sets <strong>of</strong> k=2 and k=3 interactions<br />

were studied.<br />

The developed data mining method is a three step process, and the study <strong>of</strong> algorithmrelated<br />

parameter effects was studied on two levels. Although, the developed data mining<br />

method is controlled by many different parameters, the following key parameters were<br />

studied: residue interaction distance, and size <strong>of</strong> cross-validation to compute p-values.<br />

The effect <strong>of</strong> the interaction distance parameter was studied by varying the maximal<br />

distance between the centroids <strong>of</strong> residues. Three different distance settings were tested:<br />

4Å, 6Å, and 8Å.<br />

Repeated cross-validation sampling was used to determine confidence values for residue<br />

triplet classification. Various iterations were tested (from 100 to 1,500 in steps <strong>of</strong> 100) to<br />

study the effect on the size <strong>of</strong> interaction datasets.<br />

3.4 Results<br />

3.4.1 Identification <strong>of</strong> residue interactions is dependent on data<br />

selection<br />

The result <strong>of</strong> a data mining analysis is greatly dependent on the input dataset.<br />

The<br />

objective in this section is to study the effect <strong>of</strong> data-related parameters by comparing<br />

55


esults from data mining on OLDFIELD and SCOP40.<br />

With 590,255 unique triplet configurations in SCOP40 and 429,471 in OLDFIELD,<br />

the common set <strong>of</strong> triangulated triplets is 381,578 (cf. figure 3.4). Due to the difference<br />

in the probability distributions <strong>of</strong> both datasets, the classification <strong>of</strong> residue interactions<br />

resulted in different sizes <strong>of</strong> interaction classes. A set analysis on the classification data<br />

shows, that the classes have different sizes <strong>of</strong> overlaps (cf. figure 3.5). For example,<br />

OLDFIELD/k=3 and SCOP40/k=3 have a large common set <strong>of</strong> residue configurations <strong>of</strong><br />

around 89 per cent for OLDFIELD and 44 per cent for SCOP. In contrast, the common<br />

set <strong>of</strong> k=2 interaction is much lower, i.e. 21 per cent for OLDFIELD and 13 per cent for<br />

SCOP40. The analysis also found two proportions <strong>of</strong> non-agreed classifications (k2/k3<br />

between OLDFIELD/SCOP40).<br />

These results highlight the effect <strong>of</strong> data selection on the data mining result. A different<br />

probability distribution <strong>of</strong> residue triplets, singlets and doublets is the reason, why certain<br />

residue configurations were classified as k=2 in one dataset, and k=3 in another dataset.<br />

3.4.2 The interaction distance correlates with the distribution<br />

<strong>of</strong> residue triads<br />

The extraction <strong>of</strong> residue configurations is controlled by the data representation, feature<br />

extraction, and by the feature selection method. Structural features were extracted by<br />

triangulation <strong>of</strong> a protein structure, which was modelled by a point spread <strong>of</strong> side chain<br />

centroids. The goal in this section is to study the effect <strong>of</strong> varying the interaction distance<br />

parameter. For this analysis the dataset OLDFIELD was used.<br />

Table 3.1 summarises the determined set <strong>of</strong> residue triplets by using three different<br />

maximal interaction distances. With the change <strong>of</strong> the distance threshold, the amount<br />

<strong>of</strong> extracted triplets, and the probability distributions <strong>of</strong> the singlets and doublets are<br />

changed (data not shown). Consequently, the testing <strong>of</strong> significance <strong>of</strong> residue interactions<br />

returns different results. It must be noted, that a complete analysis with 8 Åinteraction<br />

56


Figure 3.4: Distribution analysis <strong>of</strong> extracted residue triplets. The determined residue triplet distribution<br />

from OLDFIELD is compared with SCOP40. The upper panel shows a set analysis <strong>of</strong> the extracted<br />

residue triplets (numbers are the unique counts <strong>of</strong> the residue configuration). The middle panel illustrates<br />

the frequency <strong>of</strong> each triplet (t) (represented as information, I(t)) from the set <strong>of</strong> triplets (T). For<br />

a better visualisation the difference <strong>of</strong> the distributions is measured by the Kullback-Leibler divergence<br />

(lower panel).<br />

57


Figure 3.5: Comparison <strong>of</strong> extracted residue triplets based on their interaction type. The determined<br />

k=2 and k=3 classification sets from OLDFIELD and SCOP40 are compared by a set analysis. Due to<br />

the interaction classification (k=2, and k=3) there is no intersection <strong>of</strong> all four datasets.<br />

Triplets<br />

Distance Total Unique k=2 k=3<br />

4 2,938 1,799 16 165<br />

6 1,379,545 429,471 9,681 134,465<br />

8 7,128,886 2,016,306 N/A N/A<br />

Table 3.1: Study on the effect <strong>of</strong> varying the interaction distance threshold in structure triangulation.<br />

The different determined sets <strong>of</strong> residue triplet configurations in OLDFIELD were achieved by using the<br />

interaction distance thresholds: 4Å, 6Å, and 8Å.<br />

58


distance was not done in this study.<br />

In conclusion, the effect <strong>of</strong> varying the interaction distance on the triangulation output<br />

is in agreement with the expected result. While the frequencies <strong>of</strong> ”small” triplet<br />

configurations are the same for incrementing interaction distance threshold, the calculated<br />

probabilities are different, because <strong>of</strong> the different distributions. This also affects<br />

the result <strong>of</strong> interaction classification.<br />

3.4.3 Interaction classification is sensitive to the size <strong>of</strong> crossvalidation<br />

Significance testing <strong>of</strong> residue interactions is a method for assigning confidence values to<br />

the classification <strong>of</strong> residue triplets. The p-values were calculated from a two-fold crossvalidation<br />

with n-iterations <strong>of</strong> random data sampling. Here, the effect <strong>of</strong> varying the size<br />

<strong>of</strong> iterations is studied. OLDFIELD is used as dataset for this analysis.<br />

Figure 3.6 shows the logarithmic dependency between iteration size and determined<br />

classification sets.<br />

Regression analysis indicates, that the finite classification set was<br />

not found after 1,500-iterations. The study <strong>of</strong> classified residue interactions from each<br />

iteration revealed, that the set from iteration i is always a subset from the iteration j<br />

with i < j.<br />

In conclusion, the result <strong>of</strong> varying the iteration sizes indicates, that the classification<br />

sets are stable and reproducible. With the increase <strong>of</strong> iteration size, the determined sets do<br />

not altered, meaning classification result is reliable but additional elements are identified.<br />

3.5 Discussion<br />

3D pattern identification is the result <strong>of</strong> a data mining method that finds recurrent structural<br />

features within a protein dataset. The developed analysis method consists <strong>of</strong> three<br />

major modules: triangulation <strong>of</strong> a protein structure, significance testing <strong>of</strong> residue inter-<br />

59


Figure 3.6: The effect <strong>of</strong> varying the cross-validation sample size on significance testing <strong>of</strong> residue<br />

interaction. The diagram shows the increasing but converging number <strong>of</strong> determined residue triplet<br />

configurations with one-way, two-way, and three-way interactions at various iteration steps (from 100 to<br />

1,500 in steps <strong>of</strong> 100) <strong>of</strong> a non-parametric cross-validation sampling.<br />

60


action, and data clustering <strong>of</strong> the determined residue interactions.<br />

Protein structure triangulation is the basis <strong>of</strong> collecting spatial configurations <strong>of</strong> residues.<br />

The definition <strong>of</strong> residue interaction is a complex task, because an amino acid consists<br />

<strong>of</strong> many atoms. Many <strong>of</strong> them are candidates <strong>of</strong> interaction partners. A coarse grained<br />

model was used to overcome this problem, however, with the cost <strong>of</strong> redefining the interaction<br />

distance. Instead <strong>of</strong> measuring interaction distances between atoms <strong>of</strong> two different<br />

amino acids, the distance between the side chain centroids is used. The theoretical physicochemical<br />

interaction distance between two atoms cannot be transferred to measure the<br />

centroid based side chain interactions. The upper bound <strong>of</strong> interaction distance <strong>of</strong> 6Åwas<br />

determined from several visual inspections and measurements <strong>of</strong> residue configurations.<br />

The analysis shows that with d = 6Å, various side chain rotamer configurations are captured,<br />

which may represent a physicochemical interaction. By reducing the interaction<br />

distance threshold, a bias towards tightly inert residue configurations is observed. Conversely,<br />

the increase in d results in a huge set <strong>of</strong> triplet combinations. Some <strong>of</strong> the larger<br />

triplets do not capture a 3-body interaction, but may be part <strong>of</strong> a four-body interaction,<br />

where the fourth residue is situated between all three residues. Although larger interaction<br />

states may reflect a complete picture <strong>of</strong> a structural unit, the primary aim here is to<br />

find local and adjacent interactions <strong>of</strong> residues.<br />

The performance <strong>of</strong> correlation analysis based on hash tables is sensitive to positional<br />

errors, which is typically translated into the computation <strong>of</strong> ”wrong” hash bin indices.<br />

Consider the sample values a = 3.99, b = 4.01, and c = 4.99, where a is assigned to hash<br />

bin index i(a) = 1, while b and c are assigned to i(b) = i(c) = 2. The difference between a<br />

and b is actually less than b and c. The correlation analysis with these hashed data seems<br />

to be inadequate, although the ”correct” hash bin is in the neighbourhood. A solution<br />

to this problem is to consider adjacent hash bins, i.e. rectangular region, <strong>of</strong> the table<br />

[LW91].<br />

The identification <strong>of</strong> an interaction class, e.g. a two-way interaction, is based on a<br />

61


probabilistic classification approach. Confidence values were assigned to the classification<br />

result, by calculating p-values from non-parametric cross-validation sampling. Theoretically,<br />

the more sampling iterations are used the more stable become the calculated p-<br />

values. At a certain point, the size <strong>of</strong> the determined interacting residues should converge<br />

to some value. The implication <strong>of</strong> determining a stable p-value is the identification <strong>of</strong> a<br />

finite set <strong>of</strong> residue interactions. Within this study, the final set was not determined and<br />

for practical reasons, a set after 100 iterations was used.<br />

The output <strong>of</strong> extracted patterns depends on the distribution <strong>of</strong> structural features<br />

in the input dataset. The introduced algorithm is based on the assumption that there<br />

are significant trends <strong>of</strong> residue configurations in proteins, if these interactions provide<br />

a significant <strong>functional</strong> or structural advantage. Obviously, we cannot expect that data<br />

mining on two differently defined data selection would deliver the same mining output.<br />

From a mathematical point <strong>of</strong> view, the results are still correct, because the algorithm is<br />

detecting recurrent residue configurations in the data.<br />

3.6 Conclusion<br />

In this chapter, I have presented a novel data mining approach for the discovery <strong>of</strong> 3D<br />

patterns in protein structures.<br />

A pattern is a residue triplet with two- or three-way<br />

interaction <strong>of</strong> residues. The extraction <strong>of</strong> 3D patterns is not only dependent on algorithmrelated<br />

parameters, but also on the data selection.<br />

The validity <strong>of</strong> the data mining<br />

approach is justified on the basis <strong>of</strong> knowing the limits and effects <strong>of</strong> data and parameters.<br />

In the following chapter, I will present the biological significance <strong>of</strong> the mined result.<br />

62


Chapter 4<br />

Prediction <strong>of</strong> functions for mined<br />

residue triads<br />

In the previous chapter, a data mining approach was introduced, that identifies recurrent<br />

interacting residues as triplets in protein structures. Assuming, that a certain residue<br />

configuration is conserved in evolution, if it provides a structural or <strong>functional</strong> advantage,<br />

then the mined 3D pattern may represent a <strong>functional</strong> site in the protein. The objective in<br />

this chapter, is to demonstrate the biological validity <strong>of</strong> the data mined results, by crossvalidation<br />

with a reference database. I present two example cases <strong>of</strong> validated residue<br />

interactions. The first example represents the validation <strong>of</strong> a metal binding site, where<br />

the mined patterns represent either a homologous or a convergent structural feature.<br />

The second validation identifies the catalytic triad from the mined data. The analysis<br />

includes the search for a 4-body configuration <strong>of</strong> the catalytic triad (quartet), in order to<br />

find a previously reported conserved serine residue. The result presented in this chapter<br />

demonstrates the biological significance <strong>of</strong> the mined data, and justify the data mining<br />

approach.<br />

63


4.1 Evaluation methods<br />

The biological significance <strong>of</strong> the mined 3D patterns is demonstrated by the rediscovery <strong>of</strong><br />

known residue interactions. A systematic performance analysis, in terms <strong>of</strong> coverage and<br />

accuracy is not possible, because a test set with complete <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> local<br />

residue interactions with biological function is not available. Therefore, various protein<br />

databases were used as references for cross-validations.<br />

The automatic cross-validation <strong>of</strong> metal binding <strong>sites</strong> is based on the comparison <strong>of</strong><br />

the mined 3D patterns with a metal binding site database. Two reference databases were<br />

used and the results compared with each other: MSDsite [GDO + 05] and MDB [CHR + 02].<br />

The identification <strong>of</strong> available metal binding <strong>sites</strong> in the input dataset considered only<br />

configurations with more than 2 residues. A hit was found, if all residues <strong>of</strong> a metal binding<br />

site were present in a protein structure. Likewise, a mined 3D pattern was identified as<br />

a metal binding site, if all residues <strong>of</strong> the pattern resemble a subset <strong>of</strong> a metal binding<br />

site. However, because a metal binding site can contain more than three residues, and<br />

the mined patterns can have two overlapping triplets, only identified metal binding <strong>sites</strong><br />

were counted and not every matched pattern. The coverage is computed as:<br />

ccoverage =<br />

#unique <strong>sites</strong> matched by all residues in a 3D pattern<br />

. (4.1)<br />

#available <strong>sites</strong> in protein structure set<br />

The result <strong>of</strong> metal binding site cross-validation is compared with the performance <strong>of</strong><br />

SIDEMINE [Old02] extraction. Because a similar experiment was not performed before,<br />

it was repeated here. The cross-validation <strong>of</strong> a metal binding site is analogous to the<br />

identification <strong>of</strong> <strong>active</strong> <strong>sites</strong> in the dataset (cf. above).<br />

The identification <strong>of</strong> a convergent metal binding site was done by a manual search in<br />

the mined output from SCOP40. The protein structures <strong>of</strong> a found metal binding site<br />

pattern were analysed in respect <strong>of</strong> their SCOP classification identifiers.<br />

64


OLDFIELD<br />

#Triangulated Interaction #Classified #Clustered #Pattern<br />

triplet type interactions patterns frequencies<br />

429,471<br />

k=2 9,681 925 5,697<br />

k=3 134,465 1,007 11,957<br />

SCOP40<br />

#Triangulated Interaction #Classified #Clustered #Pattern<br />

triplet type interactions patterns frequencies<br />

590,255<br />

k=2 15,455 765 927<br />

k=3 269,683 2,019 2,361<br />

Table 4.1: Summary <strong>of</strong> extracted data at each protein structure data mining step. The data mining<br />

was performed on OLDFIELD and SCOP40. The number <strong>of</strong> identfied residue triplet interactions is<br />

given in ”#Classified interactions”, while the column ”#Clustered patterns” indicates the size <strong>of</strong> unique<br />

residue interaction configurations after data clustering, and ”#Pattern frequencies” is the total amount<br />

<strong>of</strong> examples <strong>of</strong> the found residue interactions in the dataset.<br />

The automatic cross-validation <strong>of</strong> catalytic residues was done by comparing residues<br />

from <strong>active</strong> site templates in CSA [PBT04]. The validation <strong>of</strong> a catalytic <strong>active</strong> site for<br />

all example protein structures was based on manual analysis.<br />

To test whether the mined result contains a second conserved serine residue in the<br />

catalytic triad (quartet) (Asp-His-Ser/Ser), larger residue configurations were constructed.<br />

The method for finding N-bodies is based on the algorithm <strong>of</strong> [Old02]: two 3D patterns<br />

(triplets) from the same protein structure were combined, if they share two common<br />

residues. The analysis considered only the search for 4-, 5-, and 6-bodies.<br />

4.2 Results<br />

In the following sections, the biological significance <strong>of</strong> the mined 3D patterns is evaluated.<br />

Data mining was performed on the datasets OLDFIELD and SCOP40 with the following<br />

parameters: interaction distance d = 6Å, cross-validation iteration = 100, and selection<br />

<strong>of</strong> cluster based on τ = 0.66 (cf. section 3.4). Table 4.1 summarises the extracted data<br />

at each processing step.<br />

65


MSDsite<br />

Reference Dataset Determined Validated Coverage<br />

OLDFIELD 567 85 0.15<br />

SIDEMINE OLDFIELD 567 60 0.11<br />

MDB<br />

Reference Dataset Determined Validated Coverage<br />

OLDFIELD 302 36 0.12<br />

SIDEMINE OLDFIELD 302 18 0.06<br />

Table 4.2: Identification <strong>of</strong> metal binding <strong>sites</strong> in OLDFIELD. The available metal binding <strong>sites</strong> in the<br />

protein domain structures in OLDFIELD (input dataset) were determined by two reference databases<br />

(MSDsite and MDB). The figures were compared with the cross-validated metal binding <strong>sites</strong> in the<br />

mined 3D pattern dataset. A hit was found in the pattern data, if all three residues <strong>of</strong> a pattern is a<br />

subset <strong>of</strong> residues <strong>of</strong> a metal binding site. The performance was measured in terms <strong>of</strong> coverage.<br />

4.2.1 Identification <strong>of</strong> homologous metal binding <strong>sites</strong><br />

Metal binding proteins play a vital role in a wide range <strong>of</strong> biological processes, such as<br />

structural stability and complex formation. The identification <strong>of</strong> metal binding proteins<br />

is therefore crucial. The objective in this section is to identify metal binding <strong>sites</strong> within<br />

the mined 3D patterns from OLDFIELD by cross-validation with the reference databases<br />

MSDsite [GDO + 05] and MDB [CHR + 02].<br />

Table 4.2 lists the number <strong>of</strong> determined metal binding <strong>sites</strong> in the input dataset and<br />

the validated 3D patterns. The analysis shows that the determined coverage for both<br />

references is quite similar providing some confidence in the determined value.<br />

While<br />

the mined result covers only a small fraction <strong>of</strong> the available metal binding <strong>sites</strong>, the<br />

performance is comparable with SIDEMINE.<br />

A manual analysis shows, that some <strong>of</strong> the annotated metal binding <strong>sites</strong> can be partially<br />

recovered by merging two 3-bodies into a single 4-body. For example, the MSDsite<br />

lists the iron binding site, Asp-3His, for the PDB entry 1ar5 with the residues ASP161,<br />

HIS27, HIS75, and HIS165. The mined result from OLDFIELD contains the patterns<br />

66


2His-Trp and Asp-His-Trp, with the residues HIS27, HIS75, TRP126, and ASP161, HIS75,<br />

TRP126, respectively. Both triplets can be merged into the 4-body Asp-2His-Trp.<br />

A systematic analysis <strong>of</strong> false negatives is beyond the scope <strong>of</strong> this work. However,<br />

preliminary studies indicate, that the selection <strong>of</strong> interaction distance, plays an important<br />

role in discovering 3D patterns. For example, by setting the interaction distance d to 8Å,<br />

various triplet configurations can be extracted that contain the missing histidine, HIS165,<br />

from the example above.<br />

The validity <strong>of</strong> a mined 3D pattern as a metal binding site is demonstrated by manual<br />

analysis <strong>of</strong> several example structures. The examples shows that the residues <strong>of</strong> a metal<br />

binding site have a strong conservation <strong>of</strong> the side chain groups, indicating a high energy<br />

bond in the formation <strong>of</strong> a coordinative tetrahedral site. Figure 4.1 illustrates an example<br />

configuration with three cysteines from six structure examples. The listed proteins are<br />

heterogeneous in nature but are common in the 3Cys mediated ion binding site. Except<br />

for one entry all structures coordinate a zinc ion in a tetrahedral configuration.<br />

Another metal binding site with the configuration Cys-2His is shown in figure 4.2.<br />

The cluster lists 11 proteins with the majority being electron transfer proteins.<br />

In conclusion, the mined 3D pattern data contain validated metal coordinating residue<br />

configurations. The result indicates, that the presented data mining system is able to<br />

identify homologous structural features, which are recurrent in the dataset.<br />

4.2.2 Validation <strong>of</strong> convergent metal binding <strong>sites</strong><br />

Proteins with different folds can share a common structural feature. For example, various<br />

metal binding <strong>sites</strong> share a common residue arrangement, while the global fold <strong>of</strong> the<br />

metal binding proteins is quite different. In this case, the common pattern represents<br />

a convergent structural feature.<br />

The objective in this section is to test whether the<br />

developed data mining algorithm is able to find patterns <strong>of</strong> convergent structural features.<br />

For this analysis, the data mining was performed on SCOP40.<br />

67


PDBID Description Bound metal<br />

1h2r periplasmic hydrogenase nickel-iron<br />

1lat glucocorticoid receptor zinc<br />

2nll retinoic acid receptor zinc<br />

1ptq protein kinase c zinc<br />

2ohx alcohol dehydrogenase zinc<br />

4mt2 metallothionein is<strong>of</strong>orm II zinc<br />

Figure 4.1: A metal binding site with the 3Cys pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with 3D<br />

pattern from OLDFIELD identified the 3Cys configuration (top panel). List <strong>of</strong> protein structures with<br />

the common 3Cys residue configuration (bottom panel).<br />

68


PDBID Description Bound metal<br />

1kdi plastocyanin cu<br />

1aoz ascorbate oxidate cu<br />

6paz pseudoazurin cu<br />

1jer stellacyanin cu<br />

2azu azurin cu<br />

1bqk pseudoazurin cu<br />

1aac amicyanin cu<br />

1byo plastocyanin cu<br />

1as7 nitrite reductase cu<br />

1nic nitrite reductase cu<br />

1rcy rusticyanin cu<br />

Figure 4.2: A metal binding site with the Cys-2His pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with<br />

3D pattern from OLDFIELD identified the Cys-2His configuration (top panel). List <strong>of</strong> protein structures<br />

with the common Cys-2His residue configuration (bottom panel).<br />

69


PDBID Description Bound metal<br />

1iml metal-binding protein zn<br />

1zin phosphotransferase zn<br />

1kk1 translation zn<br />

1ibi metal-binding protein zn<br />

1dgs ligase zn<br />

1hc7 aminoacyl-trna synthetase zn<br />

1gax ligase/rna zn<br />

1dsv virus/virus protein zn<br />

1i50 transcription zn<br />

1ptq phosphotransferase zn<br />

1zbd complex (gtp-binding/effector) zn<br />

1kb4 transcription/dna zn<br />

1dcq metal binding protein zn<br />

1jj2 ribosome cd<br />

1vfy transport protein zn<br />

1ffy ligase/rna zn<br />

1dcq metal binding protein zn<br />

1dsz transcription/dna zn<br />

1d66 transcription regulation cd<br />

2alc dna binding protein zn<br />

1tfi transcription regulation zn<br />

4mt2 metallothionein zn<br />

1jr3 transferase zn<br />

1a5t zinc finger zn<br />

1jjd metal binding protein zn<br />

1bor transcription regulation zn<br />

1zbd complex (gtp-binding/effector) zn<br />

1g25 metal binding protein zn<br />

1pyi complex (dna-binding protein/dna) zn<br />

1hwt complex (activator/dna) zn<br />

1het oxidoreductase) zn<br />

Figure 4.3: A metal binding site with the 3Cys pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with<br />

3D pattern from SCOP40 identified the 3Cys configuration (top panel). List <strong>of</strong> protein structures with<br />

the common 3Cys residue configuration (bottom panel).<br />

70


PDBID Description Bound metal<br />

1ncs transcription regulation zn<br />

1rmd dna-binding protein zn<br />

2drp complex (transcription regulation/dna) zn<br />

1yuj complex (dna-binding protein/dna) zn<br />

1a1i complex (zinc finger/dna) zn<br />

1ubd complex (transcription regulation/dna) zn<br />

5znf zinc finger dna binding domain zn<br />

2gli complex (dna-binding protein/dna co<br />

1tf3 complex (transcription regulation/dna) zn<br />

1bhi dna-binding regulatory protein n/a<br />

1e53 transcription zn<br />

1g2a hydrolase ni<br />

1jym hydrolyse co<br />

Figure 4.4: A metal binding site with the Cys-2His pattern. Cross-validation <strong>of</strong> metal binding <strong>sites</strong> with<br />

3D pattern from SCOP40 identified the Cys-2His configuration (top panel). List <strong>of</strong> protein structures<br />

with the common Cys-2His residue configuration (bottom panel).<br />

71


3Cys<br />

SCOP classification<br />

SCOP domain identifiers<br />

a.4.11.1 1i50j<br />

a.27.1.1 1ffya1<br />

a.60.2.2 1dgsa1<br />

b.35.1.2 1heta1<br />

c.26.1.1 1gaxa3<br />

c.37.1.8 1kk1a3<br />

c.37.1.13 1jr3a2, 1a5t 2<br />

g.38.1.1 1d66a1, 2alca , 1pyia1, 1hwtc1<br />

g.39.1.2 1kb4b , 1dsza<br />

g.39.1.3 1iml 2, 1ibia1, 1ibia2<br />

g.39.1.6 1jj2t<br />

g.40.1.1 1dsva<br />

g.41.2.1 1zin 2<br />

g.41.3.1 1tfi<br />

g.44.1.1 1bor , 1g25a<br />

g.45.1.1 1dcqa2<br />

g.46.1.1 4mt2 , 1jjda<br />

g.49.1.1 1ptq<br />

g.50.1.1 1vfya , 1zbdb<br />

g.56.1.1 1hc7a3<br />

Cys2His<br />

SCOP classification<br />

SCOP domain identifiers<br />

g.37.1.1 11ncs , d1rmd 1, d2drpa1, d2drpa2, d1yuja , d1a1ia1, d1ubdc1, d5znf ,<br />

d1ubdc2, d2glia4, d2glia2, d2glia3, d1tf3a1, d1bhi<br />

g.49.1.2 d1e53a<br />

d.167.1.1 d1g2aa , d1jyma<br />

Table 4.3: Convergent metal binding <strong>sites</strong> identified in SCOP40. The determined metal binding <strong>sites</strong><br />

from the 3D patterns in SCOP40 belong to different fold classes <strong>of</strong> unrelated proteins (convergent structural<br />

feature).<br />

Two patterns were identified in this study that represent metal binding <strong>sites</strong>. The<br />

3Cys configuration is the first example with 31 structure examples (cf. figure 4.3). The<br />

second metal binding configuration is the Cys-2His pattern with 17 structure examples<br />

(cf. figure 4.4). A visual analysis determined that the identified metal binding <strong>sites</strong> from<br />

SCOP40 are similar to the mined result from OLDFIELD (cf. previous section). According<br />

to the SCOP classification scheme, groups <strong>of</strong> protein structures can be determined,<br />

that have different domain structures, but share the same metal binding site (cf.<br />

table<br />

4.3). This indicates that the pattern was found as a recurrent structural feature in<br />

evolutionary distant proteins.<br />

72


The result <strong>of</strong> this analysis suggests that the developed data mining algorithm is able<br />

to find recurrent and convergent structural features in a non-redundant structure set.<br />

4.2.3 Recovering <strong>active</strong> <strong>sites</strong> and catalytic triads from the dataset<br />

The catalytic triad is one <strong>of</strong> the most characterised non-metal <strong>active</strong> <strong>sites</strong> <strong>of</strong> serine proteases.<br />

The enzymatic reaction is based on the conserved residues serine, aspartate, and<br />

histidine that work together in a specific spatial arrangement. Previously, the identification<br />

<strong>of</strong> the catalytic triad has been described as the key evaluation analysis in protein<br />

structure data mining, because the occurrence <strong>of</strong> this pattern is just above the noise level<br />

in a dataset <strong>of</strong> analogous proteins [Old02]. The objective in this section is the search<br />

for <strong>active</strong> <strong>sites</strong>, and the catalytic triad in particular, by cross-validation with CSA. The<br />

mined result from OLDFIELD was analysed in this study.<br />

Within OLDFIELD, 235 <strong>active</strong> <strong>sites</strong> were determined, while the number <strong>of</strong> crossvalidated<br />

<strong>active</strong> <strong>sites</strong> from the mined output was 27. Table 4.4 lists the validated protein<br />

residues.<br />

The majority <strong>of</strong> these residues are found in the Asp-His-Ser pattern, which<br />

was validated as the catalytic triad by manual analysis. The identified catalytic triad<br />

configuration lists 22 structure examples, with the majority belonging to the enzyme class<br />

hydrolase, and only a few belongs to the class oxidoreductase. In comparison, [Old02]<br />

identified 9 proteins, where 7 out <strong>of</strong> 9 were rediscovered in this analysis. The remaining<br />

15 out <strong>of</strong> 22 are additional and approved solutions. Figure 4.5 shows the superimposed<br />

structures for the Asp-His-Ser configuration.<br />

This study shows that the presented data mining system is able to find the catalytic<br />

triad in OLDFIELD. The mined result contains 15 additional valid solutions that were<br />

not discovered in [Old02].<br />

73


3D pattern (k=2)<br />

Cross-validated<br />

Pattern PDBID RID CSA SIDEMINE EC UID<br />

Ala-Arg-Asn 1qgj A ALA 71, A ARG 38, A ASN67 + 1.11.1.7 PER59 ARATH<br />

7atj A ALA 74, A ARG 38, A ASN 70 1.11.1.7 PER1A ARMRU<br />

His-2Ser 1elt A HIS 57, A SER 195, A SER 214 + 3.4.21.36 ELA1 SALSA<br />

1ppf E HIS 57, E SER 195, E SER 214 3.4.21.37 ELNE HUMAN<br />

1bma A HIS 60, A SER 203, A SER 222 + 3.4.21.36 ELA1 PIG<br />

1avw A HIS 57, A SER 195, A SER 214 + 3.4.21.4 N/A<br />

1hyl A HIS 57, A SER 195, A SER 214 + 3.4.21.- COGS HYPLI<br />

1bit A HIS 57, A SER 195, A SER 214 3.4.21.4 TRY1 SALSA<br />

1jrt A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRY1 BOVIN<br />

1try A HIS 57, A SER 195, A SER 214 + 3.4.21.4 TRYP FUSOX<br />

1au8 A HIS 57, A SER 195, A SER 214 3.4.21.20 CATG HUMAN<br />

1ct0 E HIS 57, E SER 195, E SER 214 + N/A N/A<br />

Asp-His-Ser 1a8q A ASP 223, A HIS 252, A SER 94 + 1.11.1.10 BPA1 STRAU<br />

1a7u A ASP 228, A HIS 257, A SER 98 + 1.11.1.10 PRXC STRAU<br />

1a88 A ASP 226, A HIS 255, A SER 96 + 1.11.1.10 PRXC STRLI<br />

1a8s A ASP 224, A HIS 253, A SER 94 + 1.11.1.10 PRXC PSEFL<br />

1tib A ASP 201, A HIS 258, A SER 146 3.1.1.3 LIP THELA<br />

3tgl A ASP 203, A HIS 257, A SER 144 3.1.1.3 LIP RHIMI<br />

1bs9 A ASP 175, A HIS 187, A SER 90 + 3.1.1.6 AXE2 PENPU<br />

1avw A ASP 102, A HIS 57, A SER 195 + + 3.4.21.4 N/A<br />

1acb E ASP 102, E HIS 57, E SER 195 + + 3.4.21.4 CTRA BOVIN<br />

1taw A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 N/A<br />

1au8 A ASP 102, A HIS 57, A SER 195 + + 3.4.21.20 CATH HUMAN<br />

1elt A ASP 102, A HIS 57, A SER 195 + 3.4.21.36 ELA1 SALSA<br />

3tgi E ASP 102, E HIS 57, E SER 195 + 3.4.21.4 TRY2 RAT<br />

1agj A ASP 120, A HIS 72, A SER 195 + 3.4.21.- ETA STAAU<br />

1auo A ASP 168, A HIS 199, A SER 114 + 3.4.22.38 CATK HUMAN<br />

1arb A ASP 113, A HIS 57, A SER 194 3.4.21.50 API ACHLY<br />

1jrt A ASP 102, A HIS 57, A SER 195 + 3.4.21.4 TRY1 BOVIN<br />

1try A ASP 102, A HIS 57, A SER 195 3.4.21.4 TRYP FUSOX<br />

2tec E ASP 38, E HIS 71, E SER 225 + 3.4.21.66 THET THEVU<br />

1ppf E ASP 102, E HIS 57, E SER 195 + + 3.4.21.37 ELNE HUMAN<br />

1jfr A ASP 177, A HIS 209, A SER 131 N/A P83850 STREX<br />

1ct0 E ASP 102, E HIS 57, E SER 195 + + N/A N/A<br />

3D pattern (k=3)<br />

Cross-validated<br />

Pattern PDBID RID CSA SIDEMINE EC UID<br />

Ala-Asp-Ser 1brt A ALA 123, A ASP 228, A SER 98 + 1.11.1.10 BPOA2 STRAU<br />

1onr A ALA 225, A ASP 17, A SER 176 2.2.1.2 TALB ECOLI<br />

Asp-Cys-Lys 1nba A ASP 51, A CYS 177, A LYS 144 + 3.5.1.59 CSH ARTSP<br />

Table 4.4: List <strong>of</strong> cross-validated <strong>active</strong> site residues. The catalytic residues in the mined k=2 or k=3<br />

residue triplets were compared against <strong>active</strong> site templates in CSA. RID = a Residue identifier consisting<br />

<strong>of</strong> a chain identifier + a residue name + a residue sequence position.<br />

74


Figure 4.5: Re-discovery <strong>of</strong> the catalytic triad in OLDFIELD. Examples <strong>of</strong> protein structures with the<br />

Asp-His-Ser pattern were cross-validated by CSA.<br />

4.2.4 Discovering the conserved serine residue in the catalytic<br />

triad (quartet)<br />

The catalytic triad template (Asp-His-Ser) has been reported as a four residue configuration<br />

(Asp-His-Ser/Ser) [WBT97] [BFW + 94]. Based on the identified catalytic triad<br />

pattern in OLDFIELD (cf.<br />

previous section), the objective in this section is to test<br />

whether a 4-body or even larger residue configurations can be generated, based on the<br />

mined 3D patterns. In addition, the analysis searches the conserved serine residue in these<br />

extended configurations.<br />

The result <strong>of</strong> extending the catalytic triad is summarised in table 4.5. With 10 out <strong>of</strong><br />

22 structure examples having a single residue extension, only 7 out <strong>of</strong> the 10 determined<br />

4-bodies contain the conserved serine residue (Asp-His-2Ser).<br />

Other 4-bodies were also found with an additional alanine or cysteine residue. Preliminary<br />

studies indicate that even larger configurations can be obtained, by combining the<br />

determined 4-bodies into a 5- or 6-body. However, the biological validity <strong>of</strong> the additional<br />

75


PDBID Asp-His-Ser His-2Ser Ala-His-Ser Cys-His-Ser Ala-Asp-His<br />

1jrt + + +<br />

1au8 + + +<br />

1ppf + + +<br />

1avw + + + +<br />

1ct0 + + + +<br />

1elt + + + +<br />

1try + + + +<br />

3tgi + + + +<br />

1acb + + + +<br />

1arb + + +<br />

2tec +<br />

1agj +<br />

1taw +<br />

1a8s +<br />

1jfr +<br />

1a7u +<br />

1auo +<br />

1a88 +<br />

1a8q +<br />

1tib +<br />

3tgl +<br />

Table 4.5: Extending the catalytic triad into 4-bodies. Two pairs <strong>of</strong> residue triplets from the same<br />

protein structure are merged together if two <strong>of</strong> the residues are identical. The first column indicates the<br />

catalytic triad configuration, while the second column represents an extension with a previously reported<br />

conserved serine residue. The remaining columns shows other solutions <strong>of</strong> 3-body extensions with the<br />

catalytic triad.<br />

alanine or cysteine in a 4-body, or even other amino acids in larger residue configurations,<br />

needs to be determined.<br />

In conclusion, the presented algorithm is able to find the catalytic triad (quartet),<br />

i.e. the second conserved serine residue was rediscovered from data mining. While other<br />

residue configurations <strong>of</strong> 4-bodies were also found, the biological role <strong>of</strong> these residues is<br />

being investigated further.<br />

4.3 Discussion<br />

The biological cross-validation <strong>of</strong> the mined 3D patterns requires an adequate knowledge<br />

base as reference. A precision score cannot be estimated from cross-validation studies,<br />

because the result is the solution <strong>of</strong> discovery-driven data mining, and current knowledge<br />

bases have an incomplete coverage <strong>of</strong> <strong>functional</strong> <strong>sites</strong>.<br />

In this respect, the mined 3D<br />

patterns may contain known biological motifs, which are the detectable true positives,<br />

76


or unknown <strong>functional</strong> <strong>sites</strong>, which cannot be confirmed yet. In addition, the result may<br />

contain noise, which is impossible to detect as false positives. The biological significance<br />

<strong>of</strong> the presented data mining was evaluated by examples <strong>of</strong> known biological <strong>functional</strong><br />

<strong>sites</strong>: the metal binding site, and the catalytic triad. In particular, only known <strong>functional</strong><br />

<strong>sites</strong> for proteins in the input structure set were used as benchmark. An alternative to this<br />

stringent evaluation is to transfer <strong>functional</strong> <strong>sites</strong> from homologous proteins, e.g. based<br />

on the Homology-dervied Secondary Structure <strong>of</strong> proteins (HSSP) database [SS96], and<br />

consider these information as true positive reference.<br />

About one third <strong>of</strong> the data in the PDB are protein structures co-crystallised with<br />

metal ions, which allows the study <strong>of</strong> metal binding <strong>sites</strong> [BW03]. Within the analysis,<br />

only a small fraction <strong>of</strong> proteins with metal binding <strong>sites</strong> were rediscovered. A systematic<br />

optimisation <strong>of</strong> the developed data mining algorithm was not pursued, e.g. by modification<br />

<strong>of</strong> feature selection criteria, because this would have exceeded the limit <strong>of</strong> this thesis.<br />

Preliminary studies on the source <strong>of</strong> false negative rate indicates, that the interaction<br />

distance threshold is the first parameter to be optimised. However with the change <strong>of</strong> this<br />

parameter the probability distribution <strong>of</strong> triangulated structural features is also modified<br />

and the effect cannot be estimated easily.<br />

The datasets OLDFIELD and SCOP40 are quite different (cf. section 3.2). OLD-<br />

FIELD consists <strong>of</strong> sequentially dissimilar protein structures, while the proteins may still<br />

share structure similarity.<br />

This property allows the mining <strong>of</strong> homologous structural<br />

features <strong>of</strong> divergent proteins, such as metal binding <strong>sites</strong> or the catalytic triad. The developed<br />

data mining method was also tested, whether it can extract convergent structural<br />

features, by analysing SCOP40. This dataset consists only <strong>of</strong> divergent proteins with no<br />

global structural similarities. As a consequence, structural components are mainly represented<br />

by convergent features, and the detection <strong>of</strong> these residue configurations might be<br />

below detection level. That is, the occurrences <strong>of</strong> convergent structural features are similar<br />

to background level. However, metal binding <strong>sites</strong> are examples <strong>of</strong> convergent patterns<br />

77


that were found in this study. The coordination <strong>of</strong> metal ions is greatly dependent on the<br />

distances and orientations <strong>of</strong> the conjugating residues. For that reason, data mining can<br />

detect these convergent structural features in structurally unrelated proteins.<br />

The presented data mining system identifies local three residue interactions with respect<br />

<strong>of</strong> their spatial and chemical configuration. In addition, examples <strong>of</strong> 4- and 5-body<br />

interactions were shown as a solution in extending the catalytic triad pattern. The analysis<br />

shows, that larger residue configurations can be found with the presented combinatorial<br />

approach. However, the search for larger structural patterns might deliver only protein<br />

stabilising features or other biological units in protein structures that are difficult to<br />

interpret.<br />

4.4 Conclusion<br />

The solution <strong>of</strong> this developed data mining algorithm is justified by the cross-validation<br />

<strong>of</strong> biologically relevant structure motifs provided in this study.<br />

The mining system is<br />

able to detect recurrent homologous or convergent structural features in the dataset.<br />

More importantly, two biological motifs, the metal binding site, and the catalytic triad,<br />

were rediscovered indicating, that the mined output contains biologically valid solutions.<br />

While the prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong> is an important task in structural biology, the<br />

biological interpretation <strong>of</strong> a 3D pattern requires evidences <strong>of</strong> biological significance. The<br />

combination with published biochemical and experimental data can provide evidences and<br />

a biological context for data interpretation. In the next chapter, I will present a biomedical<br />

literature mining system, for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues.<br />

78


Chapter 5<br />

Identification <strong>of</strong> protein residues in<br />

MEDLINE<br />

In this chapter, I present a text mining method to identify protein residues in biomedical<br />

texts. In the first step, the algorithm identifies the biological entities <strong>of</strong> residue, protein,<br />

and organism, and then determines the association <strong>of</strong> entity triplets. As a result a residue<br />

is linked to its source protein, and the protein is mapped to its hosting organism. Because<br />

the developed text mining solution relies on information from UniProtKB, an identified<br />

protein residue is directly linked to a unique Uniprot entry.<br />

One application <strong>of</strong> this<br />

method is the search for abstract texts in MEDLINE with protein residues, and then use<br />

the result for the update <strong>of</strong> citations in UniProtKB. The identification <strong>of</strong> protein residues<br />

in biomedical texts is a prerequisite for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> residues.<br />

5.1 Algorithms<br />

The developed protein residue identification system is based on the algorithm <strong>of</strong> [HLC04].<br />

Basically, the developed method is a four step procedure: biological entity recognition<br />

<strong>of</strong> organism, protein, and residue, and the association <strong>of</strong> the entity triplet. Figure 5.1<br />

illustrates the procedures <strong>of</strong> this text mining system.<br />

79


Figure 5.1: Overview <strong>of</strong> processes and evaluation methods for the developed protein residue identification<br />

system.<br />

80


5.1.1 Protein and organism entity recognition<br />

Theory<br />

The recognition <strong>of</strong> protein and organism entities in text is based on a dictionary lookup<br />

approach. Basically, names <strong>of</strong> proteins, their synonyms, and their gene names are collected<br />

from UniProtKB to populate a protein terminology dictionary. The lookup <strong>of</strong> the<br />

protein dictionary considers the matching <strong>of</strong> morphological variants. The dictionary is<br />

not expanded by syntactical variants <strong>of</strong> terminological entries, like structural or formal<br />

variants, and addition <strong>of</strong> modifier or head word, because the lookup approach with the<br />

vast number <strong>of</strong> permutations requires much more computational memory resources. The<br />

alternative is to use a probabilistic approach.<br />

A similar method is also used to populate the organism terminology dictionary with<br />

names and synonyms from the NCBI Taxonomy database [WBB + 06].<br />

The lookup <strong>of</strong><br />

terminologies also considers the matching <strong>of</strong> morphological variants.<br />

Implementation<br />

The recognition <strong>of</strong> protein entities was based on an approach that combined dictionary<br />

lookup with basic disambiguation [RSKA + 07].<br />

All protein names and synonyms were<br />

collected from UniProtKB.<br />

Names <strong>of</strong> species were extracted from the NCBI Taxonomy references in UniProtKB,<br />

and their scientific and common names collected. The dictionary was complemented with<br />

terminologies describing only the referenced genus. Full organism names were augmented<br />

with abbreviated genus forms, i.e. first letter abbreviation <strong>of</strong> genus + specie.<br />

The fast and efficient method for annotating texts with protein and organism names<br />

was based on the publicly available web service called Whatizit [RSAG + 08]. The result is<br />

an <strong>annotation</strong> <strong>of</strong> protein and organism names in text with references to UniProtKB and<br />

NCBI Taxonomy.<br />

81


5.1.2 Entity recognition <strong>of</strong> protein residue<br />

Theory<br />

The identification <strong>of</strong> residue entities is based on the re-implementation <strong>of</strong> previously published<br />

regular expression patterns for point mutations [HLC04] [RSMA + 04]. Here, the<br />

patterns are extended to capture in total three types <strong>of</strong> residues: wild-type, point mutation,<br />

and range <strong>of</strong> residues or pair <strong>of</strong> residues.<br />

Although amino acid sequences can<br />

be considered in the residue entity identification, the lack <strong>of</strong> information about sequence<br />

position prevents the precise association detection with proteins.<br />

The first basic type <strong>of</strong> residue mention is the single protein residue sequence reference,<br />

which consists <strong>of</strong> the name <strong>of</strong> an amino acid, followed by the sequence position number,<br />

e.g. ”Gly-12”, ”arginine 4”, ”Tyr74”, ”Arg(53)”. A point mutation is the second type <strong>of</strong><br />

residue mention, where the description details the exchange <strong>of</strong> an amino acid at a given<br />

position.<br />

The common notation is the name <strong>of</strong> the amino acid, its sequence position<br />

number, followed by the exchange. The following are examples <strong>of</strong> point mutations found<br />

in text: ”W77R”, ”Cys560Arg”, ”ser-52->ala”, ”ala2-methionine”.<br />

Finally, the third<br />

type <strong>of</strong> residue mention describes either a range <strong>of</strong> residues or an interaction pair, e.g.<br />

”Tyr 85 to Ser 85”, ”Trp27–Cys29”. The correct identification <strong>of</strong> this type <strong>of</strong> residue<br />

mention requires the consideration <strong>of</strong> contextual information, which is not handled in<br />

this version. The common notation is the string sequence: amino acid name, sequence<br />

position, a connection symbol or connection word, amino acid name, and then sequence<br />

position.<br />

In addition to the abbreviated notation, protein residues can be expressed in syntactical<br />

form, e.g. ”isoleucine at position 3”, ”substitution <strong>of</strong> Ala at position 4 to Gly”,<br />

”Ser472 to glutamic acid”. Additional patterns were developed to accommodate these<br />

and other less precise defined residue mentions in syntactical form, e.g. ”residue at position<br />

22, 34, and 40”. Although the entity triplet association algorithm does not utilise<br />

the latter identified residue mentions, <strong>annotation</strong> can generally be extracted for these<br />

82


underspecified residues to increase the recall in information extraction.<br />

Implementation<br />

The extraction <strong>of</strong> residue mentions reuses the idea <strong>of</strong> designing regular expressions to find<br />

residue entities in text [RSMA + 04] [HLC04]. Some <strong>of</strong> the previously published regular<br />

expression patterns were adopted, while other patterns were created to cover other types<br />

<strong>of</strong> residue mentions, such as basic abbreviational point mutation patterns. In this thesis,<br />

sets <strong>of</strong> regular expressions were developed and implemented as finite state transducer to<br />

identify three types <strong>of</strong> residue entities (cf. table 5.1): wild-type, point mutation, and<br />

range or pair <strong>of</strong> residues. The result is an <strong>annotation</strong> <strong>of</strong> residue mention in text with<br />

normalised expressions.<br />

5.1.3 Association identification <strong>of</strong> the entity triplet organism,<br />

protein, and residue<br />

Theory<br />

The association <strong>of</strong> the entities organism, protein, and residue is a difficult text mining<br />

task. Unlike the association <strong>of</strong> two proteins, e.g. the physical interactions <strong>of</strong> two proteins<br />

(protein-protein interaction), the binary semantic relationships <strong>of</strong> organism-protein and<br />

protein-residue are not necessarily explicitly stated in biomedical texts. For example, a<br />

protein may be mentioned at the beginning <strong>of</strong> a paragraph, while a site-directed mutation<br />

on the same protein is described in later sections. This is one reason why approaches<br />

relying only on language patterns or word distance metrics are not feasible to find proteinresidue<br />

associations. The association task becomes more complex, when multiple proteins<br />

are mentioned in the text. Usually a residue has a one-to-one relationship with a protein,<br />

however two proteins can have the same residue at the same sequence position. While<br />

this ambiguity cannot be solved without deeper natural language processing techniques,<br />

the problem can be tackled with a knowledge based approach.<br />

83


RANGE-TO = ("-"+ ("to" "-+")? | "to");<br />

CONVERT-TO = ("to" | "-"+ ">"?);<br />

XAA = ( "X" | "XAA" | "xaa" );<br />

POS = (1-9)(0-9)*;<br />

RESN1<br />

RESN3<br />

= [ARNDCQEGHILKMFPSTWYVOUBZX];<br />

= ( [aA]la|ALA | [aA]rg|ARG | [aA]sn|ASN | [aA]sp|ASP | [cC]ys|CYS<br />

| [gG]ln|GLN | [gG]lu|GLU | [gG]ly|GLY | [hH]is|HIS | [iI]le|ILE<br />

| [lL]eu|LEU | [lL]ys|LYS | [mM]et|MET | [pP]he|PHE | [pP]ro|PRO<br />

| [sS]er|SER | [tT]hr|THR | [tT]rp|TRP | [tT]yr|TYR | [vV]al|VAL<br />

| [pP]yl|PYL | [sS]ec|SEC | [aA]sx|ASX | [gG]lx|GLX | [xX]aa|XAA);<br />

RESNF = ( [aA]lanine | [aA]rginine | [aA]sparagine | [aA]spart(ate|ic acid) |<br />

[cC]ysteine<br />

| [gG]lutamine | [gG]lutam(ate|ic acid) | [gG]lycine | [hH]istidine |<br />

[iI]soleucine<br />

| [lL]eucine | [lL]ysine | [mM]ethionine | [pP]henylalanine | [pP]roline<br />

| [sS]erine | [tT]hreonine | [tT]ryptophan | [tT]yrosine | [vV]aline<br />

| [pP]yrrolysine | [sS]elenocysteine | [aA]spartic acid or [aA]sparagine<br />

| [gG]lutamic acid or[gG]lutamine);<br />

SITE<br />

SITES<br />

= ( (RESN3 | RESNF) POS "residue"?<br />

| (RESN3 | RESNF) "-"+ POS "residue"?<br />

| (RESN3 | RESNF) "residue"? "at position"? POS "residue"?<br />

| (RESN3 | RESNF) "(" POS ")" "residue"?<br />

| "amino acid"? "residue" "at position"? POS<br />

| "amino acid" "residue"? "at position"? POS<br />

| RESNF "residue" POS);<br />

= ( RESNF"s" (("," | "and" | "or") RESNF"s")*<br />

| RESNF"s"? ("at position""s"?)? ("," | "and" | "or") (("at position""s"?)?<br />

("," | "and" | "or") POS)+<br />

| RESNF "residue""s"?<br />

| RESN3 "residue""s"? ("at position""s"?)? POS (("at position""s"?)? ("," |<br />

"and" | "or") POS)+<br />

| RESN3 "residue""s"?<br />

| "residue""s"? ("at position""s"?)? POS ("," | "and" | "or") POS)+<br />

| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS ("," | "and" | "or")<br />

POS)+<br />

| RESNF ("," | "and" | "or") POS)* "residue""s"?);<br />

RANGE/PAIR = ( "residue""s"? ("," | "and" | "or") RANGE-TO POS)+<br />

| "amino acid" "residue"? "s"? ("," | "and" | "or") RANGE-TO POS)+<br />

| ("resiude""s"?)? "at position""s"? ("," | "and" | "or") RANGE-TO POS)+<br />

| RESI RANGE-TO RESI);<br />

MUTATION<br />

= ( RESN1 POS RESN1<br />

| RESN1 "-" POS "-" RESN1<br />

| RESN1 "(" POS ")" RESN1<br />

| RESI CONVERT-TO (RESN3 | RESNF)<br />

| RESI RESN3<br />

| "from" (RESNF | RESN3) CONVERT-TO (RESNF | RESN3) "at position" POS<br />

| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS<br />

| RESI ("-"+ | CONVERT-TO) RESI "substitution");<br />

Table 5.1: Regular expression patterns for the detection <strong>of</strong> residue mentions in text. The patterns<br />

recognise single (SITE) or multiple wild-type residue <strong>sites</strong> (SITES), a sequence range or residue pair<br />

(RANGE/PAIR), and point mutation (MUTATION). The set covers abbreviated notations <strong>of</strong> residues<br />

as well as grammatical expressions found in text.<br />

84


The developed method in this work is based on the algorithm <strong>of</strong> [HLC04]. Basically,<br />

the identification <strong>of</strong> a protein residue can only be validated, if it is part <strong>of</strong> the protein<br />

sequence, as it is denoted in a reference database, e.g. UniProtKB. This requires that the<br />

protein mentioned in the text is further supported by evidence for the organisms under<br />

scrutiny to select the appropriate protein sequence from the bioinformatics database; that<br />

excludes the risk <strong>of</strong> using orthologous protein sequences.<br />

Implementation<br />

In this study, the developed system to identify the entity triplet association <strong>of</strong> organism,<br />

protein, and residue, was based on the algorithm described by [HLC04] with some modifications.<br />

In the first step proteins were associated with their hosting organisms. Given a<br />

protein, all pairs <strong>of</strong> protein-organism (specie) were determined from text and ranked according<br />

to a word distance measure. The word distance between two entities was defined<br />

by the smallest number <strong>of</strong> words between them. The identification <strong>of</strong> protein-organism<br />

began with the pair with the smallest word distance measure. A valid association was<br />

found, if a semantic relation was specified in UniProtKB. If an association was validated<br />

then the search was terminated, and the protein was annotated with the corresponding<br />

Uniprot identifier, otherwise the next entity pair from the list was tested. If no match<br />

between protein and organism (specie) was found, then the search was relaxed to genus<br />

matching. This relaxed matching is the expansion to the [HLC04] algorithm. Because<br />

entries in UniProtKB are species specific, the protein-organism (genus) association will<br />

result in a list <strong>of</strong> Uniprot identifiers as <strong>annotation</strong> <strong>of</strong> the protein.<br />

The second step <strong>of</strong> this algorithm was the association <strong>of</strong> residues with their source<br />

proteins. The procedure <strong>of</strong> selecting and ranking the residue-protein pairs was similar<br />

to the protein-organism association identification. For each pair that was to be tested<br />

the annotated Uniprot identifier <strong>of</strong> the protein was used to retrieve the protein sequence<br />

from the database. Three cases <strong>of</strong> results can be distinguished: (1) the residue correctly<br />

85


matches the protein sequence; (2) several alternative sequences are matching from a list<br />

<strong>of</strong> proteins; and (3) no match can be found for the residue with the available protein<br />

sequences. If a match was found, then the residue was annotated with references to the<br />

protein, otherwise the search continued with the next pair from the ranked list.<br />

5.2 The construction <strong>of</strong> evaluation test corpora<br />

UniProtKB is one <strong>of</strong> the most comprehensive protein knowledge bases (cf. section 2.1.2).<br />

It contains manually curated <strong>functional</strong> <strong>annotation</strong>s on three levels: protein, protein sequence,<br />

and protein residue. Information is derived from surveys <strong>of</strong> biomedical articles,<br />

and entries are annotated with citation references (PMIDs; PubMed identifiers). However,<br />

the precise association <strong>of</strong> a citation and a protein residue in context <strong>of</strong> <strong>functional</strong><br />

<strong>annotation</strong> is generally not available.<br />

The test dataset for the developed <strong>functional</strong> <strong>annotation</strong> extraction is based on the<br />

citation references from UniProtKB. A Uniprot corpus was generated by retrieving abstract<br />

texts from MEDLINE that are indexed by the knowledge base. From the 136,566<br />

citations listed in UniProtKB, a virtually complete set <strong>of</strong> 136,559 abstract texts was retrieved<br />

from MEDLINE. Although not all information presented in the UniProtKB are<br />

necessarily available in the Uniprot corpus, the Uniprot corpus is a starting point for the<br />

evaluation <strong>of</strong> the developed text mining modules. In particular three derived test corpora<br />

were generated from the Uniprot corpus: the gold standard corpus with manual <strong>annotation</strong><br />

(GC), and the two cross-validation corpora with annotated information derived from<br />

UniProtKB (XC1, and XC2). Figure 5.2 summarises key features in both test corpora.<br />

For the automatic evaluation <strong>of</strong> extracted data, a cross-validation corpus (XC) was<br />

derived from Uniprot corpus. This test set was used to analyse the performance <strong>of</strong> proteinorganism<br />

(XC1) and residue-protein (XC2) associations.<br />

The test set was annotated<br />

automatically, i.e. the biological entities were detected with the same ER systems. The<br />

documents in the Uniprot corpus were scanned for tri-occurrences <strong>of</strong> organism, protein,<br />

86


Dataset<br />

Gold standard corpus<br />

(GC)<br />

Cross-validation<br />

corpus (XC1)<br />

Cross-validation<br />

corpus (XC2)<br />

Abstracts count 100 55,998 4,503<br />

Method <strong>of</strong> <strong>annotation</strong> manual automatic automatic<br />

total/unique residues 362/262 (with N/A<br />

N/A<br />

262/191 having<br />

residue name +<br />

residue sequence<br />

position)<br />

total/unique proteins 990/511 N/A N/A<br />

total/unique organisms 323/123 N/A N/A<br />

total/unique associations 240/172 residueprotein-organism<br />

NA/70,401<br />

associations<br />

protein-organism<br />

as UTP<br />

as URP<br />

Application<br />

Test the the type,<br />

amount and reliability<br />

<strong>of</strong> the<br />

extracted information<br />

(reproduction<br />

<strong>of</strong> manually annotated<br />

information).<br />

Test set is assumed<br />

to contain the same<br />

type <strong>of</strong> information<br />

as GC, but certainty<br />

is not clear.<br />

Study the reproduction<br />

<strong>of</strong> information<br />

contained in<br />

the database.<br />

NA/10,152<br />

protein-residue<br />

Test set is assumed<br />

to contain the same<br />

type <strong>of</strong> information<br />

as GC, but certainty<br />

is not clear.<br />

Study the reproduction<br />

<strong>of</strong> information<br />

contained in<br />

the database.<br />

Figure 5.2: Test corpora for information extraction evaluation. Based on the citation references from<br />

UniProtKB a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora<br />

were derived from this corpus: (1) the gold standard corpus (GC), which resembles a manually annotated<br />

test set; and (2) the cross-validation corpora (XC1, XC2), which contains automatically assigned<br />

<strong>annotation</strong>s based on information from UniProtKB.<br />

and residue in text and a subset was retained if the combinations <strong>of</strong> the identifier triplet<br />

(UID+TID+PMID) for each document can be found in the database. UID is the Uniprot<br />

ID, TID is the NCBI Taxonomy ID, and PMID is the PubMed identifier. If at least a single<br />

match was found, then a document was selected. For the non-matching combinations the<br />

corresponding <strong>annotation</strong>s were removed from text. This results in the test set XC1 with<br />

the associated set <strong>of</strong> the triple identifier combinations UTP = (UID+TID+PMID). XC2<br />

is a subselection from XC1 by filtering for documents where the identifier combination<br />

URP=(UID+RID+PMID) were validated by entries in UniProtKB. RID is a residue<br />

identifier which consists <strong>of</strong> a residue name + sequence position. 70,401 UTPs from 55,998<br />

abstract texts were determined for XC1, and correspondingly 10,152 URPs were derived<br />

from 4,503 MEDLINE articles in XC2.<br />

The gold standard corpus (GC) was created through manual curation, since no suitable<br />

annotated corpora are available for this study.<br />

A random sample <strong>of</strong> 100 MEDLINE<br />

87


abstract texts was drawn from the Uniprot corpus, where every abstract text must contain<br />

the tri-occurrences <strong>of</strong> organism, protein and residue. Notice that the detection <strong>of</strong> the<br />

entities was based on the entity recognition (ER) systems described in the previous section.<br />

It is not expected that the ER systems are performing at top level, and therefore a certain<br />

proportion <strong>of</strong> the filtered abstract texts contains false positives <strong>of</strong> identified entities.<br />

From this set <strong>of</strong> 100 abstract texts, manual analysis provided four types <strong>of</strong> <strong>annotation</strong>s.<br />

The first type is the <strong>annotation</strong> <strong>of</strong> the biological entities <strong>of</strong> organism, protein, and residue,<br />

while the second is the <strong>annotation</strong> <strong>of</strong> entity triplet associations, i.e. organism-proteinresidue.<br />

Notice that this process did not include the grounding <strong>of</strong> protein or organism<br />

entities to entries in the specialised databases, i.e. UniProtKB and NCBI Taxonomy. In<br />

addition, text segments <strong>of</strong> sentences with a residue entity were annotated, if they represent<br />

keywords for <strong>functional</strong> <strong>annotation</strong>. Finally, the association <strong>of</strong> a keyword and a residue<br />

was also annotated in GC.<br />

Notice, that the set <strong>of</strong> documents in GC is partially contained in XC2; only 26 abstracts<br />

are shared among both datasets. From manual <strong>annotation</strong> 38 entity triplet associations<br />

were determined, while the corresponding number from XC2 was 58. The total number<br />

<strong>of</strong> manually annotated triplet associations in GC is 172 (cf. figure 5.2).<br />

The major difference between both evaluation corpora is, that GC contains manually<br />

confirmed biological entities and their associations. In contrast, the same <strong>annotation</strong>s<br />

in XC1 and XC2 were done with UniProtKB, based on the assumption that the same<br />

database information is present in abstract texts.<br />

The interpretation <strong>of</strong> performance<br />

analysis has to consider the properties <strong>of</strong> these evaluation test corpora.<br />

5.3 Evaluation methods<br />

The performance <strong>of</strong> each process <strong>of</strong> the developed protein residue identification system<br />

was scored against a manually annotated gold standard corpus.<br />

Proteins, where the<br />

protein entity recognition system and manual curation assigned the same entity (full<br />

88


term matching) were considered as true positives (TP). The same rule also applied for<br />

counting TP for the detection <strong>of</strong> residue and organism entities.<br />

The evaluation <strong>of</strong> the entity triplet association detections considered only associations<br />

as TP, if both pair relations organism-protein and protein-residue were determined correctly.<br />

If one <strong>of</strong> the relations was incorrect, a found association was counted as false<br />

positive (FP).<br />

In contrast, the automatic evaluation <strong>of</strong> the entity recognition and entity association<br />

detection systems were performed on XC. A true positive <strong>of</strong> an annotated entity within<br />

an abstract text was identified, if UniProtKB lists the same entity in context <strong>of</strong> the<br />

given PMID. For example, if organism X in text Y is also indexed in UniProtKB as a<br />

combination <strong>of</strong> TID+PMID, then a TP was counted.<br />

A correct protein-organism association was detected, if the determined identifier combination<br />

UTP was found in XC. Similarly, a correct residue-protein association was found,<br />

if the derived identifier combination URP was found in the test corpus.<br />

The effectiveness <strong>of</strong> the ER and the association detection systems was measured in<br />

terms <strong>of</strong> precision, recall and the balanced F-measure (F1):<br />

precision =<br />

#true positive<br />

#true positive + #false positive , (5.1)<br />

recall =<br />

#true positive<br />

#true positive + #false positive , (5.2)<br />

F 1 =<br />

2 ∗ precision ∗ recall<br />

. (5.3)<br />

precision + recall<br />

5.4 Results<br />

The developed protein residue identification system in this study consists <strong>of</strong> four modules.<br />

The following sections assess first performances <strong>of</strong> biological entity recognition, and then<br />

89


Unique residue entities<br />

Reference Dataset Available Extracted Common Precision Recall F1<br />

Gold standard corpus 191 203 187 0.92 0.98 0.95<br />

MutationGraB GPCR corpus N/A N/A N/A 0.98 0.77 0.86<br />

MutationMiner Xylanase corpus N/A N/A N/A 1.00 0.85 0.92<br />

MEMA Mutation corpus N/A N/A N/A 0.98 0.75 0.85<br />

Table 5.2: Performance evaluation <strong>of</strong> residue entity recognition. The performance is compared with other<br />

published residue entity recognition systems: MutationGraB (GPCR corpus) [LHC07]; MutationMiner<br />

(Xylanase corpus) [BW05]; and MEMA (Mutation corpus) [RSMA + 04]. Performance was measured in<br />

terms <strong>of</strong> precision, recall, and F1 measure.<br />

the association <strong>of</strong> the entity triplet organism, protein, and residue.<br />

The final section<br />

presents an application <strong>of</strong> the presented text mining solution that can be used to update<br />

the citation set <strong>of</strong> UniProtKB or any other derived databases.<br />

5.4.1 Evaluation <strong>of</strong> organism, protein, and residue entity recognition<br />

The goal <strong>of</strong> biological entity recognition, in this study, is to detect the mentions <strong>of</strong> residue,<br />

protein, and organism in biomedical abstract texts. In order to evaluate the performance<br />

<strong>of</strong> the developed ER systems, the detections were compared against the results from<br />

manual curated test set, the gold standard corpus (GC).<br />

The evaluation shows that the developed regular expression patterns are highly usable<br />

for the detection <strong>of</strong> residue mentions in biomedical texts. ER for residue mention yields<br />

in a precision <strong>of</strong> 0.92 and a recall <strong>of</strong> 0.98. With an F1 measure <strong>of</strong> 0.95 the performance<br />

<strong>of</strong> this ER system is within range <strong>of</strong> previous reports on point mutation identification<br />

[LHC07] [BW05] [RSMA + 04] (cf. table 5.2).<br />

The performance for protein mention identification is evaluated with 65% precision and<br />

60% recall (62% F1 measure). The result is difficult to compare to previously reported<br />

systems, e.g. ProMiner and MutationMiner (cf. table 5.3), due to the different experimental<br />

setup. ProMiner was evaluated on the BioCreAtIvE corpus (80% F1 measure)<br />

90


Unique protein entities<br />

Reference Dataset Available Extracted Common Precision Recall F1<br />

Gold standard corpus 511 471 305 0.65 0.60 0.62<br />

ProMiner BioCreAtIvE corpus N/A N/A N/A 0.8 0.8 0.8<br />

MutationMiner Xylanase corpus N/A N/A N/A 0.88 0.71 0.79<br />

Table 5.3: Performance evaluation <strong>of</strong> protein entity recognition. The performance is compared with the<br />

other published protein entity recognition systems: ProMiner (BioCreAtIvE corpus, Task 1B, protein<br />

and gene name identification) [HFM + 05]; and MutationMiner (Xylanase corpus) [BW05]. Performance<br />

was measured in terms <strong>of</strong> precision, recall, and F1 measure.<br />

Unique organism entities<br />

Reference Dataset Available Extracted Common Precision Recall F1<br />

Gold standard corpus 123 109 88 0.81 0.72 0.76<br />

MutationMiner Xylanase corpus N/A N/A N/A 0.88 0.71 0.79<br />

Table 5.4: Performance evaluation <strong>of</strong> organism entity recognition. The performance is compared with<br />

the NER system <strong>of</strong> MutationMiner (Xylanase corpus) [BW05]. Performance was measured in terms <strong>of</strong><br />

precision, recall, and F1 measure.<br />

which links the contained protein mentions to only a small set <strong>of</strong> organisms. However,<br />

we have repeated the experiment on the BioCreAtIvE dataset and the result suggests<br />

that our method yields a comparable performance (76% F1 measure). Conversely, the<br />

evaluation <strong>of</strong> MutationMiner not only considers abstract texts but also the content <strong>of</strong> the<br />

full-text articles which should improve the results (79% F1 measure).<br />

Although the developed organism entity recognition system relies on a similar dictionary<br />

lookup approach as protein entity recognition, the performance is higher (precision<br />

<strong>of</strong> 0.81 and recall <strong>of</strong> 0.72; cf. table 5.4). This indicates that the list <strong>of</strong> terminologies are<br />

precise and covers a wide range <strong>of</strong> expressions.<br />

In conclusion, with F1 measures <strong>of</strong> 0.95, 0.62, and 0.76 for the entity recognition <strong>of</strong><br />

residue, protein, and organism, the developed text mining system is able to detect these<br />

three biological entities in biomedical abstract texts.<br />

91


Unique resi.-prot.-org.-associations<br />

Reference Dataset Available Extracted Common Precision Recall F1<br />

Gold standard corpus 172 79 65 0.82 0.38 0.52<br />

MutationGraB Mutation corpus N/A N/A N/A 0.85 0.69 0.76<br />

MEMA Mutation corpus N/A N/A N/A 0.93 0.35 0.51<br />

MuteXt tinyGRAP N/A N/A N/A 0.88 0.83 0.85<br />

Table 5.5: Performance evaluation <strong>of</strong> residue-protein-organism entity association detection. The performance<br />

is compared with the other published point mutation detection systems: MutationGraB (Mutation<br />

corpus1) [LHC07]; and MEMA (Mutation corpus2) [RSMA + 04]. Notice that MEMA identified only associations<br />

but without grounding. Performance was measured in terms <strong>of</strong> precision, recall, and F1 measure.<br />

5.4.2 Performance study on the entity triplet association<br />

The objective <strong>of</strong> the developed association detection system is to identify the entity triplet<br />

<strong>of</strong> organism, protein, and residue.<br />

In this section, the performance <strong>of</strong> this detection<br />

system is studied by comparing the <strong>predicted</strong> association with the manually annotated<br />

associations in the gold standard corpus (GC).<br />

With a precision <strong>of</strong> 0.82 and a recall <strong>of</strong> 0.38 the developed detection system is a reliable<br />

method for association detection, and the precision is comparable to other related reports<br />

(cf. table 5.5). In comparison to the systems, MutationGraB and MuteXt, the low recall<br />

can be explained by the differences in the test corpora; both systems were evaluated on<br />

protein family specific full-text articles. The evaluated precision <strong>of</strong> MEMA is different<br />

from this study, because MEMA identifies only associations without grounding to Uniprot<br />

entries.<br />

Manual analysis isolated two main reasons for the low recall. First, the association <strong>of</strong><br />

all the three entities failed in several cases, because the system did not find an association<br />

between protein and organism.<br />

Other cases were also encountered, where a proteinorganism<br />

association was correctly identified, but a protein-residue association could not<br />

be found. A detailed explanation is given in the discussion section.<br />

Despite the low recall <strong>of</strong> this text mining module, the evaluation indicates that the<br />

developed method is able to detect associations <strong>of</strong> residue, protein, and organism. More<br />

92


UTP<br />

Dataset Available Extracted Common Precision Recall F1<br />

XC1 70,401 77,407 62,068 0.82 0.88 0.85<br />

URP<br />

Dataset Available Extracted Common Precision Recall F1<br />

XC2 10,152 10,876 9,325 0.86 0.92 0.89<br />

Table 5.6: Performance evaluation <strong>of</strong> protein-organism and protein-residue entity association detection.<br />

A cross-validation corpus (XC) from UniProtKB was obtained from MEDLINE, by first retrieving<br />

abstract texts from MEDLINE, searching for tri-occurrences <strong>of</strong> the named entities residue, protein, organism,<br />

and then retaining only those entries for which the identifier combination <strong>of</strong> UTP (Uniprot identifier<br />

+ NCBI Taxonomy identifier + PubMed identifier) was found in UniProtKB. The result is the test set<br />

XC1 for protein-organism association study. XC2 is a subset <strong>of</strong> XC1 by scaning for documents where<br />

the identifier combination URP identifier combination (Uniprot identifier + Residue identifier + PubMed<br />

identifier) was validated by UniProtKB. Performance was measured in terms <strong>of</strong> precision, recall, and F1<br />

measure.<br />

importantly, the detected associations are in accordance with manually identified semantic<br />

relations between the three biological entities. With a precision <strong>of</strong> 0.82 the developed<br />

method is able to identify precisely protein residues in biomedical texts.<br />

5.4.3 Cross-validation <strong>of</strong> identified residues with UniProtKB<br />

In the previous section the system for the association <strong>of</strong> the entity triplet organism,<br />

protein, and residue, was evaluated manually on the gold standard corpus. The objective<br />

in this section is to perform an analysis on a larger test set by cross-validation with<br />

UniProtKB. For this task, the cross-validation corpora XC1 and XC2 were used. The<br />

analysis consists <strong>of</strong> a two-step association study, i.e. the association <strong>of</strong> protein-organism<br />

and residue-protein were evaluated individually. Table 5.6 summarises the results.<br />

With a precision <strong>of</strong> 0.82 and a recall <strong>of</strong> 0.88, the result for organism-protein association<br />

indicates that the system is able to extract correct semantic relations from XC1. The second<br />

step <strong>of</strong> the evaluation determines the performance <strong>of</strong> the residue-protein association<br />

detection. A similar precision score <strong>of</strong> 0.86 was determined, while the recall (0.92) was<br />

93


triplet association/UTRP<br />

Resource Available Extracted Common Precision Recall F1<br />

GC 38 61 29 0.48 0.76 0.59<br />

XC2 58 61 52 0.84 0.90 0.87<br />

Table 5.7: A specialised performance evaluation between GC and XC2. The test set consists <strong>of</strong> the 26<br />

common documents between GC and XC2. A comparison <strong>of</strong> the annotated entity triplet associations<br />

from both resources shows that the list <strong>of</strong> targets are different.<br />

almost twice as high as the triple entity association determined with GC (cf. table 5.5).<br />

This can be explained by the differences <strong>of</strong> the used <strong>annotation</strong> methods for both test<br />

corpora. The entities and their associations in GC were determined manually and did not<br />

considered a grounding step.<br />

To better compare the performance between the GC and XC2 data the common set <strong>of</strong><br />

26 abstract texts from both corpora were studied (cf. section 5.2). By reusing the URP<br />

information from the cross-validation corpus the determined performance is similar to the<br />

one evaluated on the whole XC2 dataset (compare table 5.7 with table 5.6). However,<br />

the XC2-based evaluation is different form the manual-based <strong>annotation</strong> study.<br />

However, this result is different from the evaluation based on manual <strong>annotation</strong>. A<br />

detailed analysis shows that manual <strong>annotation</strong> determined 38 entity triplets, whereas<br />

XC2 lists 58 associations and only 25 <strong>of</strong> these are common among both data sets (data<br />

not shown). This indicates that the annotated targets in GC and XC2 are different and<br />

cannot be compared directly.<br />

The results indicate that the developed method is able to detect correct associations<br />

<strong>of</strong> residue, protein, and organism.<br />

5.4.4 Identified residues in MEDLINE for Uniprot/PDB proteins<br />

The developed text mining system annotates an identified protein residue in a text passage<br />

with references to its source protein and its hosting organism. Therefore, each MEDLINE<br />

94


Figure 5.3: Identified protein residues in MEDLINE. From a MEDLINE extraction, a subset <strong>of</strong> 2,884<br />

Uniprot proteins were identified, with cross-references to 14,007 PDB entries, and a corresponding set <strong>of</strong><br />

18,427 MEDLINE records. In comparison, the citation set <strong>of</strong> the corresponding entries in UniProtKB<br />

has only 4,652 PMIDs. Only 657 out <strong>of</strong> 18,427 PMIDs are cross-validated by UniProtKB data. Dashed<br />

line = MEDLINE based extraction; solid line = database values.<br />

record with an identified protein residue can be used to update the citation set <strong>of</strong> a<br />

correspondent protein entry in UniProtKB, or any other hyperlinked database, e.g. PDB<br />

(UniProtKB/PDB). In this study, the whole MEDLINE was scanned with the developed<br />

protein residue identification method, and the determined set <strong>of</strong> PMIDs compared with the<br />

citation sets in UniProtKB/PDB (cf. figure 5.3; for an overview <strong>of</strong> databanks hyperlinks<br />

and citation references cf. section 2.1).<br />

The protein residue identification system found a total <strong>of</strong> 40,750 MEDLINE records<br />

where residues were associated with co-mentioned proteins. The unique count <strong>of</strong> Uniprot<br />

proteins within the entity triplet associations is 9,354, where 2,884 out <strong>of</strong> 9,364 proteins<br />

have hyperlinks to 14,007 PDB entries. Corresponding to these 2,884 Uniprot proteins<br />

95


is the set <strong>of</strong> 18,427 out <strong>of</strong> 40,750 PMIDs. In comparison, UniProtKB indexes for these<br />

2,884 Uniprot entries a set <strong>of</strong> 4,652 PMIDs. A set analysis determined that both datasets<br />

are common in 657 PMIDs. This means that only 3.6 per cent <strong>of</strong> the identified PMIDs<br />

can be cross-validated with UniProtKB (cf. figure 5.4).<br />

The low number <strong>of</strong> rediscovery can be explained, in that most <strong>of</strong> the <strong>annotation</strong>s<br />

in UniProtKB are done from sections only available in full-text articles. Although the<br />

analysis was based on MEDLINE, the extraction was already able to find a large number<br />

<strong>of</strong> relevant abstract texts for citation expansion. With a precision <strong>of</strong> 0.82 (determined<br />

by gold standard evaluation), the estimated number <strong>of</strong> true positives in the PMID set is<br />

15,110. In context <strong>of</strong> the 4,652 citations from the database for the 2,884 Uniprot proteins,<br />

and the consideration <strong>of</strong> the 657 re-discovered abstract texts, the result <strong>of</strong> MEDLINE<br />

analysis expands the citation set by 3 fold.<br />

In conclusion, the presented text mining system can be used to determine relevant<br />

literature data for the update <strong>of</strong> the citation sets in UniProtKB/PDB.<br />

The extracted abstract texts for those proteins provide the basis for <strong>functional</strong> <strong>annotation</strong><br />

extraction.<br />

5.5 Discussion<br />

The presented text mining method identifies protein residues in biomedical texts. The<br />

first step is the recognition <strong>of</strong> the entities residue, protein, and organism in texts. The<br />

language expressions <strong>of</strong> all three biological entities are quite different. A residue entity,<br />

for example, is generally mentioned in the text by its three-letter abbreviation form +<br />

protein sequence position. The regular expression patterns were designed specifically for<br />

these and other derived expressions, which explains the high precision and recall <strong>of</strong> the<br />

residue entity recognition system. However, a residue can also be expressed by its oneletter<br />

abbreviation or syntactical form.<br />

While the latter expression is considered and<br />

implemented in this thesis, it was suggested that these expressions resemble only a small<br />

96


Figure 5.4: Cross-validation <strong>of</strong> citations from identified protein residues with UniProtKB/PDB. For a<br />

subset <strong>of</strong> UniProtKB/PDB proteins (i.e. proteins with UID and PDBID) the determined PMIDs can be<br />

cross-validated with the relevant citation set from UniProtKB. Dashed line = the number <strong>of</strong> common<br />

PMIDs; uni = UniProtKB/PDB based citations; med = protein residue identification based citations;<br />

comm = common set <strong>of</strong> citations between uni and med.<br />

97


fraction [LHC07] in biomedical texts.<br />

The implementation <strong>of</strong> one-letter abbreviation<br />

would increase the recall, but the method would become less precise. For example the<br />

matched string ”C4” could be a nucleotide, a gene, an atom in a chemical compound, or<br />

any other acronym.<br />

The identification <strong>of</strong> protein terminologies in text is a great challenge in the biomedical<br />

text mining community. This is based on the fact that protein names are not standardised,<br />

and the usage <strong>of</strong> many alternative names are common, e.g. abbreviations, pet names,<br />

or synonymous names. In addition, there is no guideline in the construction <strong>of</strong> names,<br />

therefore a name can be short or long in respect <strong>of</strong> word counts, e.g. ”MAP kinase kinase”<br />

and ”MAP kinase kinase kinase”.<br />

The developed protein entity recognition system is<br />

based on a lookup <strong>of</strong> names and synonyms in a dictionary. Because the entries are finite,<br />

syntactical variants <strong>of</strong> protein names cannot be detected, if they are not covered by the<br />

dictionary. This explains the low recall <strong>of</strong> this ER system. In contrast, sub-matching <strong>of</strong><br />

a whole protein name or the tagging <strong>of</strong> ambiguous protein names reduces the precision<br />

<strong>of</strong> the method. For example, ”SNF” could be a protein in yeast or the funding agency<br />

”Swiss National Science Foundation.<br />

The principle method for organism entity recognition is the same as protein name<br />

identification in this investigation. A list <strong>of</strong> terms from NCBI taxonomy was utilised to<br />

generate an organism name dictionary. Although the developed method is the same as<br />

protein entity recognition, the system yielded in a higher performance. One explanation is,<br />

that the dictionary contains predominantly unambiguous terminologies. However, some<br />

ambiguous terms can also be found, e.g. ”RAT” could be a protein, an organism, or a<br />

method. To my knowledge, a dedicated research in organism entity recognition has not<br />

been published nor is a gold standard for performance evaluation available.<br />

Based on the finding <strong>of</strong> residue, protein, organism entities in a text, the developed system<br />

identifies semantic relations between these biological entities. The approach is based<br />

on the idea <strong>of</strong> reusing explicitly stated relations contained in UniProtKB. The correct<br />

98


association between protein and residue relies on several factors: the ER performance,<br />

the correct protein sequence retrieval, which is dependent on the correct organism-protein<br />

association, and the correct alignment <strong>of</strong> a residue with a protein sequence at the specified<br />

position. On one hand, a low recall in residue-protein association can be explained by<br />

a missing protein sequence variant in the repository. On the other hand, an incorrect<br />

protein-organism association leads to the retrieval <strong>of</strong> a wrong protein sequence. Another<br />

consideration is, that the protein sequence in the database could deviate from the author’s<br />

data, because either side may have used different indexing rules. Conversely, the<br />

true positive rate can also be blurred by the same reason that a non corresponding residue<br />

sequence index results in a by chance matching with a protein sequence. One solution to<br />

this specific problem is to consider all residues <strong>of</strong> the same protein in the sequence alignment.<br />

However, this method may only be applicable for full-text analysis, as abstract<br />

texts rarely mention multiple residues <strong>of</strong> the same protein.<br />

The evaluation <strong>of</strong> the entity recognition and the association detection systems was<br />

done by a manual analysis on the gold standard corpus, and by an automatic crossvalidation<br />

study. This has the following reasons. Protein <strong>annotation</strong>s in UniProtKB are<br />

primarily derived from manual information extraction from full-text articles. Although a<br />

considerable amount <strong>of</strong> these information may not be present in MEDLINE, the combination<br />

<strong>of</strong> X+PMID, where X is either UID or TID, can be used to estimate the information<br />

extraction performance. However, the false positive rate in this cross-validation study<br />

cannot be determined, because the knowledge base is incomplete with information, and<br />

even for the indexed citations. Therefore, manual evaluation on a gold standard test set<br />

has the advantage to study the false positive and false negative rate.<br />

An identified protein residue is annotated with references to its source protein (Uniprot<br />

identifier) and the hosting organism (NCBI Taxonomy identifier). Based on these <strong>annotation</strong>s<br />

a link can be made between MEDLINE and biological knowledge bases.<br />

One<br />

immediate application is to scan MEDLINE for protein residues and use the Uniprot<br />

99


identifier <strong>annotation</strong>s in combination with the MEDLINE identifier (or PubMed identifier;<br />

PMID) to update the citation sets <strong>of</strong> corresponding Uniprot entries. The significance<br />

<strong>of</strong> this approach was studied by automatic cross-validation analysis. Although, the results<br />

indicate that only a small proportion <strong>of</strong> Uniprot proteins can be found and associated with<br />

residues from MEDLINE analysis, the identified set <strong>of</strong> PMIDs has only a small overlap<br />

with the corresponding citation sets. One explanation is, that <strong>annotation</strong>s were extracted<br />

from full-text articles, where the same information is not present in the abstract texts;<br />

they represent the true negative fraction in sense that the information cannot be identified<br />

from abstract sections. Another explanation is based on the fact that curators provide<br />

only a list <strong>of</strong> relevant citations from a batch <strong>of</strong> processed biomedical articles. In other<br />

words, the information <strong>of</strong> irrelevant citations (false positives) or the complete list <strong>of</strong> true<br />

positives <strong>of</strong> citations, from the sample <strong>of</strong> reviewed biomedical articles, is not available in<br />

UniProtKB which would have allowed a more precise evaluation.<br />

5.6 Conclusion<br />

The developed text mining solution identifies protein residues in text and annotates them<br />

with references to UniProtKB and NCBI Taxonomy. Based on these references, a link<br />

between MEDLINE and UniProtKB is created. Although the identification <strong>of</strong> protein<br />

residues in MEDLINE does not necessarily mean that <strong>functional</strong> <strong>annotation</strong>s are present<br />

in abstract texts, the analysis is a prerequisite for the mining <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>.<br />

The extraction <strong>of</strong> contextual feature as <strong>annotation</strong>s <strong>of</strong> a protein residue is the topic <strong>of</strong> the<br />

following chapter.<br />

100


Chapter 6<br />

Information extraction from the<br />

context <strong>of</strong> a residue in text<br />

In the previous chapter, I have introduced a method for the identification <strong>of</strong> protein<br />

residues in biomedical texts. The objective, in this chapter, is to extract textual features<br />

from the context <strong>of</strong> protein residues that can be used as <strong>functional</strong> <strong>annotation</strong>. Because a<br />

terminological resource is not utilised, the developed method can discover new information<br />

from text.<br />

The extracted contextual features are then enriched with semantic labels<br />

according to a categorisation scheme. The design <strong>of</strong> this scheme was data-driven, and<br />

contains concepts <strong>of</strong> biological interests. The overall result <strong>of</strong> this text mining solution<br />

is the <strong>annotation</strong> <strong>of</strong> protein residues with text segments that are classified by a set <strong>of</strong><br />

biological categories.<br />

6.1 Algorithms<br />

The developed information extraction system can be divided into two parts: extraction<br />

<strong>of</strong> contextual features associated with protein residues, and classification <strong>of</strong> the extracted<br />

textual features. Figure 6.1 illustrates the procedures involved in the developed information<br />

extraction system.<br />

101


Figure 6.1: Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> the developed contextual feature extraction<br />

system.<br />

102


6.1.1 Extraction <strong>of</strong> contextual features<br />

Theory<br />

Finding <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues in biomedical text.<br />

In this<br />

study, several assumptions have been made for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s<br />

from biomedical texts, which are explained in the following. The first assumption is, that<br />

noun phrases in a text are semantically rich in sense, that they are able to represent<br />

a subject content (keyword) [JK95]. Consequently, they are good candidates <strong>of</strong> textual<br />

features for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues.<br />

The second assumption is, that a biological function <strong>of</strong> a protein residue, can be found<br />

as verbal or nominal expression in natural language. In other words, a syntactical relation<br />

between a residue and a term can capture their semantic relation. Therefore, a syntactical<br />

analysis <strong>of</strong> a sentence enables the identification <strong>of</strong> an explicitly stated biological function.<br />

For example, from the phrase<br />

”A inhibits B by phosphorylation <strong>of</strong> C”,<br />

the relations<br />

A—inhibits—by-phosphorylation-<strong>of</strong>-C<br />

A—inhibits—B-by-phosphorylation<br />

A—inhibits—B<br />

UNK—phosphorylate—C,<br />

can be identified. Although the identification <strong>of</strong> a residue-keyword association can be<br />

attempted with co-occurrence analysis, the target is to extract reliable associations with<br />

contextual information on their association. In other words the type <strong>of</strong> association expressed<br />

by a verb or by a preposition, and the context expressed by a prepositional phrase,<br />

are important bits <strong>of</strong> information that represent a justifiable <strong>functional</strong> <strong>annotation</strong>. A<br />

103


discussion on semantic relation and syntactical relation extraction can be found in section<br />

2.3.2.<br />

Generally, to identify description <strong>of</strong> biological function in text, the terminologies from<br />

GO can be reused. However, this ontology is actually not specialised on protein residues,<br />

for example the term ”<strong>active</strong> site” does not even appear as a stand-alone term in the<br />

repository. Generally, description <strong>of</strong> protein function refers to higher level <strong>of</strong> biological<br />

function, e.g. metabolomics or cell signalling. In contrast, the <strong>annotation</strong> <strong>of</strong> protein<br />

residues requires a different set <strong>of</strong> terminologies that describe molecular interactions or<br />

chemical reactions.<br />

Because a suitable terminological resource is not available, the extraction <strong>of</strong> syntactical<br />

relation focuses on semantic relations with the elements: residue entity and contextual<br />

feature (keyword). The following is a demonstration <strong>of</strong> how a description <strong>of</strong> function can<br />

be identified from a parsed sentence. Given the example sentence from MEDLINE<br />

”Parathyroid hormone inhibits renal phosphate transport by phosphorylation<br />

<strong>of</strong> serine 77 <strong>of</strong> sodium-hydrogen exchanger regulatory factor-1.”<br />

(PMID:17975671),<br />

a syntactical analysis produces the following phrase structure representation<br />

104


[Parathyroid hormone]/NP<br />

[inhibits]/V<br />

[renal phosphate transport]/NP<br />

[by]/P<br />

[phosphorylation]/NP<br />

[<strong>of</strong>]/P<br />

[serine 77]/NP<br />

[<strong>of</strong>]/P<br />

[sodium-hydrogen exchanger regulatory factor-1]/NP,<br />

where NP is a noun phrase, P a preposition, and V a verb. From this parsed sentence,<br />

the following semantic relations can be determined:<br />

Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation-<strong>of</strong>-serine<br />

77<br />

Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation<br />

Parathyroid hormone—inhibits—renal phosphate transport<br />

UNK—phosphorylate—serine 77.<br />

In the next section, a template for storing the extracted relation information is discussed.<br />

Semantic representation <strong>of</strong> extracted relations.<br />

The objective <strong>of</strong> syntactical relation<br />

extraction is to identify biological relations in a sentence, i.e. a semantic relation<br />

between a residue entity and a terminology. While the result is a set <strong>of</strong> syntactical relations<br />

with different contextual specification (cf. example in previous section), a suitable<br />

105


data collation method is necessary to avoid data redundancy. That is, the set <strong>of</strong> determined<br />

relations, within a given syntactic frame contains a relation, which is a specification<br />

<strong>of</strong> another one. For example, the relation<br />

A—inhibits—B-by-phosphorylation,<br />

is a specification <strong>of</strong> the relation<br />

A—inhibits—B.<br />

Here, the predicate-argument structure (PAS) is proposed as a semantic representation<br />

<strong>of</strong> extracted syntactical relations. A PAS is a template for information extraction,<br />

where the predicate and the arguments represent the slots to be filled. In this study, the<br />

predicate (pred) <strong>of</strong> a PAS is defined as the verb, while the arguments <strong>of</strong> the verb are<br />

the numerically labelled arguments arg1 and arg2, or even higher numerically labelled<br />

arguments. The arg1 label is assigned to arguments, which are understood as agents,<br />

causers, or experiencers, i.e. the semantic subject. Conversely, the arg2 label is usually<br />

assigned to the patient argument, i.e. the argument which undergoes the change <strong>of</strong> state<br />

or is being affected by the action.<br />

The transformation <strong>of</strong> the extracted relations into PAS data, does not consider the<br />

analysis <strong>of</strong> the semantic role <strong>of</strong> the verb arguments, i.e.<br />

argument modifiers, such as<br />

location, time, cause, etc. Noun phrases <strong>of</strong> the extracted relations can have prepositional<br />

attachments, and the preposition are <strong>of</strong>ten indicators <strong>of</strong> thematic roles <strong>of</strong> the verb arguments.<br />

Therefore, prepositional phrases are listed as modifiers <strong>of</strong> arguments with the<br />

following label notations: main argument label + preposition, e.g.<br />

arg1-<strong>of</strong>, and arg2-<br />

by. The following illustrates the transformation <strong>of</strong> relations into a PAS for the previous<br />

example:<br />

106


pred = inhibit<br />

arg1 = Parathyroid hormone<br />

arg2 = renal phosphate transport<br />

arg2-by = phosphorylation<br />

arg2-<strong>of</strong> = serine 77,<br />

which corresponds to the following verb frame set:<br />

inhibit sub-arg1 obj-arg2 P by-arg2 P <strong>of</strong>-arg2.<br />

Notice, that the defined PAS does not accord to PAS schemes <strong>of</strong> some propositional<br />

banks, e.g. PropBank or PASBio. For example, for the verb ”inhibit” PropBank lists the<br />

following frame set:<br />

inhibit sub-ARG0 obj-ARG1<br />

inhibit sub-ARG0 S-ARG1,<br />

while additional arguments are not defined (notice, that the definition <strong>of</strong> ARG0 in Prop-<br />

Bank is equivalent to arg1 in this definition, and ARG1 corresponds to arg2). Although<br />

verb frame sets from publicly available propositional banks can be considered in this study,<br />

the set <strong>of</strong> listed verbs have a low coverage with the set <strong>of</strong> verbs co-occurring with residue<br />

mentions in MEDLINE. The low coverage and the non-domain specific verb frame sets<br />

are the main reasons why these resources were not reused.<br />

Implementation<br />

The extraction <strong>of</strong> contextual features is based on a syntactical analysis <strong>of</strong> natural language<br />

sentences. Two approaches were developed in this work and compared in the performance<br />

107


evaluation study: shallow parser based relation extraction, and full parser based relation<br />

extraction.<br />

Shallow parser based relation extraction.<br />

The first approach was to develop a<br />

shallow parser, which aims to find the boundaries <strong>of</strong> major constituents in a sentence,<br />

such as noun phrases. The design is based on heuristics and the idea <strong>of</strong> finding general<br />

relations between closed-class English words [LCM03]. The reported parser finds verbal<br />

relations between noun phrases, and prepositional relations <strong>of</strong> a set <strong>of</strong> the most frequent<br />

prepositions, i.e. ”<strong>of</strong>”, ”in”, and ”by”. Here, the parser is implemented as a general<br />

relation extraction method, where the list <strong>of</strong> prepositions are not limited to the three<br />

mentioned ones. The purpose is to find more contextual features, and thereby discover<br />

more information.<br />

Initially, an abstract text was split into sentences, and then annotated with part<strong>of</strong>-speech<br />

(POS) tags using the CISTAGGER. The tagger was trained in the CISLEX<br />

lexical resource that contains a rich terminological set <strong>of</strong> the biomedical domain [Gue96].<br />

Based on a rule set and the POS information the developed shallow parser identified noun<br />

phrases, verb groups, verb phrases, and prepositional phrases for analysed sentences:<br />

NP = Det? (Adj|Adv|N)* N<br />

PP = P NP<br />

VG = (Adv|Aux|V|InfTo)* V<br />

VP = VG NP PP*.<br />

N is a noun, Det a determiner, Adj an adjective, Adv an adverb, P a preposition, PP a<br />

prepositional phrase, VP a verb phrase, and VG a verb group. Notice, that the grammar<br />

does not consider coordinating conjunctions, e.g. with ”and”, ”or” and ”,”. The grammar<br />

can be easily extended to capture conjunctions by<br />

108


NPx = NP (CC NP)*,<br />

where<br />

CC = (”and” | ”or” | ”,”){1,2}.<br />

However, the pattern would then also find false positives as illustrated in the following<br />

example. The sentence<br />

”Highly conserved phosphopantothenate binding residues include Asn59,<br />

Ala179, Ala180, and Asp183 from one monomer and Arg55’ from the<br />

adjacent monomer.” (PMID:12906824),<br />

contains the noun phrases<br />

NP1 = ”Asn59, Ala179, Ala180, and Asp183 from one monomer”<br />

NP2 = ”Arg55’ from the adjacent monomer”.<br />

The extended patterns would have extracted a single noun phrase, from which the identification<br />

<strong>of</strong> the correct post-nominal prepositional phrase attachment cannot be done<br />

easily:<br />

NPx =<br />

”Asn59, Ala179, Ala180, and Asp183 from one monomer and<br />

Arg55’ from the adjacent monomer”.<br />

Based on the determined phrase structure, the parser then extracts verbal relations <strong>of</strong><br />

noun phrases or prepositional phrases. A condition <strong>of</strong> the extraction is, that at least one<br />

relation element must contain one or more residue mentions:<br />

109


REL = NP PP* VP.<br />

The extracted relation is then transformed to fill the slots <strong>of</strong> the predefined PAS template.<br />

Full parser based relation extraction.<br />

The second approach in contextual feature<br />

extraction utilises the full parser ENJU [MT05] (version 2.3), which generates a so called<br />

head-driven parse tree from a sentence. The advantage <strong>of</strong> this parser is, that a parsing<br />

model adapted to biomedical text is utilised. This parser generates predicate-argument<br />

relations between words.<br />

Because the generated output contains a lot <strong>of</strong> information,<br />

different interpretations are possible. In this study, a wrapper was developed that converts<br />

the parser’s output into the presented PAS data format.<br />

The assumption is, that by<br />

following the direct links <strong>of</strong> a verb to its arguments in the tree, and then collecting all the<br />

sub-branches <strong>of</strong> each argument, the phrase structure <strong>of</strong> a verb argument can be found.<br />

The identified NP PP* VP structures are then decomposed to fill the PAS template.<br />

6.1.2 Categorisation <strong>of</strong> contextual features<br />

Theory<br />

A PAS captures a verb frame within a text sentence, where the arguments may represent a<br />

subject content. In order to evaluate the relevance <strong>of</strong> these arguments a semantic interpretation<br />

is needed. Here, a classification method was developed, that assigns automatically<br />

semantic labels to the arguments <strong>of</strong> a PAS. For this task, the categories have to be defined<br />

as suitable labels for information interpretation. Although an ontological model <strong>of</strong> protein<br />

residue function is not available, there are two approaches to this problem. The first is<br />

to adopt <strong>annotation</strong> schemes from various protein databases, e.g. the UniProtKB. This<br />

represents a top-down approach. One motivation for reusing the categorisation scheme <strong>of</strong><br />

UniProtKB is, that classified information with this scheme can be directly used to update<br />

110


the relevant fields in the database.<br />

Alternatively, a bottom-up approach can propose new categories. In this study, suitable<br />

text segments from MEDLINE were analysed, if they represent suitable <strong>functional</strong><br />

<strong>annotation</strong>s for residues. The result, is an overview <strong>of</strong> information distribution in MED-<br />

LINE, which has led to the proposition <strong>of</strong> a categorisation scheme. The defined categories<br />

<strong>of</strong> both schemes are compared in table 6.1. Both categorisation schemes reflect concepts<br />

<strong>of</strong> biological interest. However the bottom-up approach has the advantage that proposed<br />

categories are data-driven, while in a top-down approach examples <strong>of</strong> listed categories may<br />

not be present in natural language text, or other categories are missing in the scheme.<br />

The assignment <strong>of</strong> categories to contextual features is based on the endogenous classification<br />

approach [Cer00]. In contrast, the exogenous, i.e. corpus-based, approach requires<br />

large amounts <strong>of</strong> contextual cues, which are difficult to obtain. According to the author,<br />

the endogenous approach is more reliable to produce results even under conditions <strong>of</strong><br />

sparse data.<br />

From a reference set <strong>of</strong> terms with manually assigned labels according to a categorisation<br />

scheme, the algorithm computes the mutual information <strong>of</strong> the lexical constituents <strong>of</strong><br />

terms and their assigned categories. These scores are then used to calculate and select the<br />

highest scoring association <strong>of</strong> a term and a category. The algorithm was re-implemented<br />

and used in this study.<br />

Implementation<br />

The semantic interpretation <strong>of</strong> contextual features, which are the arguments <strong>of</strong> the extracted<br />

PAS, relies on the endogenous classification approach described by [Cer00]. The<br />

method was re-implemented in this study. The algorithm relies only on the mutual information<br />

<strong>of</strong> the lexical constituents <strong>of</strong> terms and their assigned categories.<br />

During the training phase, lexical constituents <strong>of</strong> multi-word terms were extracted<br />

from a labelled reference set. They represent the features <strong>of</strong> the predefined categories.<br />

111


MAN FEAT<br />

Category Defintion Category Defintion<br />

STR COMP<br />

Structure component. Class denoting concepts that<br />

represent pieces and parts <strong>of</strong> the protein structure.<br />

DOMAIN Extent <strong>of</strong> a domain, which is defined as a specific combination <strong>of</strong> secondary<br />

structures organised into a characteristic three-dimensional structure <strong>of</strong> fold.<br />

MOTIF Short (up to 20 amino acids) sequence motif <strong>of</strong> biological interest.<br />

TOPO DOM Topological domain.<br />

CHAIN Extent <strong>of</strong> a polypeptide chain in the mature protein.<br />

TRANSMEM Extent <strong>of</strong> a transmembrane region.<br />

COILED Extent <strong>of</strong> a coiled-coil region.<br />

CHEM MOD<br />

Chemical modification. Class denoting changes to<br />

the protein sequence and the chemical composition.<br />

VARIANT Authors report that sequence variants exist.<br />

MOD RES Posttranslational modification <strong>of</strong> a residue.<br />

PEPTIDE Extent <strong>of</strong> a released <strong>active</strong> peptide.<br />

VAR SEQ Description <strong>of</strong> sequence variants produced by alternative splicing, alternative<br />

promoter usage, alternative initiation and ribosomal frameshifting.<br />

LIPID Covalent binding <strong>of</strong> a lipid moiety.<br />

CARBOHYD Glycosylation site.<br />

STR MOD Structural modification. Class denoting the changes<br />

to the protein structure without changes to the<br />

chemical composition.<br />

REGION Extent <strong>of</strong> a region <strong>of</strong> interest in the sequence.<br />

SITE Any interesting single amino-acid site on the sequence, that is not defined by<br />

another feature key.<br />

BINDING Binding type. Class denoting different<br />

physico-chemical forces leading to a bond formation<br />

between a protein structure component and a<br />

chemical entity.<br />

BINDING Binding site for any chemical group (co-enzyme, prosthetic group, etc.).<br />

METAL Binding site for a metal ion.<br />

DISULFID Disulfide bond.<br />

CROSSLNK Posttranslationally formed amino acid bonds.<br />

DNA BIND Extent <strong>of</strong> a DNA-binding region.<br />

NP BIND Extent <strong>of</strong> a nucleotide phosphate-binding region.<br />

ZN FING Extent <strong>of</strong> a zinc finger region.<br />

CA BIND Extent <strong>of</strong> a calcium-binding region.<br />

ENZ ACT Enzymatic activity. Types <strong>of</strong> enzymatic reactions as<br />

a subpart to protein functions.<br />

ACT SITE Amino acid(s) involved in the activity <strong>of</strong> an enzyme.<br />

CELL Cellular phenotype. Class denoting different cellular<br />

phenotypes that can be affected by structural or compositional<br />

changes <strong>of</strong> a protein.<br />

N/A<br />

Table 6.1: Biological categories for the classification <strong>of</strong> protein residue related information. Two sets<br />

<strong>of</strong> schemes were used: a text data motivated definition <strong>of</strong> categories (MAN) determined from manual<br />

analysis <strong>of</strong> sentences with <strong>annotation</strong>s for protein residues from MEDLINE, and key categories from the<br />

feature table <strong>of</strong> UniProtKB (FEAT).<br />

112


The association between both, a feature (w) and a category (c), was estimated based on<br />

their mutual information score<br />

I(w, c) = log 2<br />

P (w,c)<br />

P (w)P (c) . (6.1)<br />

The association between the multi-word term T = {w i } n i=1<br />

and a category c was<br />

computed by the sum <strong>of</strong> the associations <strong>of</strong> its words<br />

A(T, c)<br />

= P ∗ (c) ∑ n<br />

i=1 I(w i, c), (6.2)<br />

where P ∗ (c) is the probability <strong>of</strong> a category associated with a term. The categorization<br />

<strong>of</strong> a multi-word term into one <strong>of</strong> the categories, amounts to the identification <strong>of</strong> the best<br />

fitting category C ∗ for a term, based on the words in a term<br />

c ∗ = arg max c A(T, c). (6.3)<br />

The reference set was generated, by using maximal length noun phrase (MLNP) analysis.<br />

The assumption <strong>of</strong> this approach is that textual features co-occurring with a residue<br />

within a noun phrase (NP r ) are good candidates <strong>of</strong> terms for <strong>functional</strong> <strong>annotation</strong>. In<br />

order to identify the boundaries <strong>of</strong> these candidate terms, the MLNP algorithm relies on<br />

the lookup <strong>of</strong> a determined set <strong>of</strong> noun phrases without nested residue entities (NP ¬r ). In<br />

other words, the algorithm assumes that nested terms in NP r are also expressed as standalone<br />

noun phrases, which can be identified by a broad syntactical analysis on MEDLINE.<br />

The following is an example for illustration. Consider the term<br />

”complex formation”,<br />

which is identified as a stand-alone noun phrase NP ¬r in the sentence<br />

113


”The GlyNH2 was removed and the re<strong>active</strong>-site peptide bond X18-<br />

Glu19 was synthesized by complex formation with proteinase K.”<br />

(PMID:9047374).<br />

The same term co-occurs with a residue entity within another noun phrase (NP(r))<br />

”Rb-E2F-DNA complex formation”<br />

in the sentence<br />

”MDM2 also interacts with Rb through its central acidic domain and inhibits<br />

Rb function in part by blocking Rb-E2F-DNA complex formation.”<br />

(PMID:16337594).<br />

The determined MLNP in this example is ”complex formation”.<br />

Once the set <strong>of</strong> MLNPs were extracted, each item (NP) was manually labelled, based<br />

on a categorisation scheme. Within this study, two categorisation schemes (cf. table 6.1)<br />

were used independently and studied: the categories defined by manual analysis on MED-<br />

LINE sentences (bottom-up approach), and the categories defined as keys in the feature<br />

table from UniProtKB (top-down approach). The sets <strong>of</strong> categories from the bottom-up<br />

approach and from the top-down approach are referred as MAN and FEAT in this study.<br />

Table 6.2 compares the distribution <strong>of</strong> labels within the reference set.<br />

An illustration, where a determined MLNP can be used to find relevant information<br />

from contextual features <strong>of</strong> a protein residue, is the following example. From the sentence<br />

114


MAN<br />

FEAT<br />

Category Frequency Category Frequency<br />

STR COMP 433 DOMAIN 28<br />

MOTIF 8<br />

TOPO DOM 4<br />

CHAIN 2<br />

TRANSMEM 2<br />

COIL 1<br />

CHEM MOD 361 VARIANT 275<br />

MOD RES 59<br />

PEPTIDE 13<br />

VAR SEQ 6<br />

LIPID 3<br />

CARBOHYD 1<br />

STR MOD 25 REGION 100<br />

SITE 246<br />

BINDING 195 BINDING 139<br />

METAL 25<br />

DISULFID 11<br />

CROSSLNK 10<br />

DNA BIND 6<br />

NP BIND 5<br />

ZN FING 2<br />

CA BIND 1<br />

ENZ ACT 90 ACT SITE 110<br />

CELL 161 N/A<br />

GEN BIOL 2,172 GEN BIOL 2,372<br />

GEN ENG 643 GEN ENG 651<br />

Table 6.2: Category distribution in the text feature reference set. The text feature reference set was<br />

compiled from maximal length noun phrase analysis (MLNP) from two sets <strong>of</strong> noun phrases: one without<br />

residue mentions and the other with identified protein residue entities. The features in the reference set<br />

were manually assigned with labels <strong>of</strong> the categorisation scheme MAN and FEAT. GEN BIOL = general<br />

biological terminologies; GEN ENG = general English words.<br />

115


”Mutation K241Q completely abolishes DNA glycosylase activity and<br />

covalent complex formation in the presence <strong>of</strong> NaBH4.” (PMID:9241232),<br />

the following relation can be identified<br />

mutation K241Q—abolish—covalent complex formation.<br />

A semantic label can be assigned to the relation argument ”covalent complex formation”<br />

because the term ”complex formation” is labelled in the reference set.<br />

6.2 Evaluation methods<br />

The extraction <strong>of</strong> contextual features <strong>of</strong> residues results in a set <strong>of</strong> syntactical relations,<br />

which are represented as PAS. The performance <strong>of</strong> this extraction module was evaluated<br />

by comparing the returned PAS data with manual <strong>annotation</strong>s in the gold standard test<br />

corpus (cf. section 5.2). A true positive was counted, if the syntactical relations in a PAS<br />

were correct, and if the arguments in the PAS contained the annotated residue entity and<br />

the marked keyword(s) in the test corpus. If any <strong>of</strong> these conditions were not met, then a<br />

false positive was registered. The performance was measured in terms <strong>of</strong> precision, recall<br />

and F1-measure, as described earlier in section 5.3.<br />

The performance <strong>of</strong> the developed classification method was evaluated by a 100 times<br />

5-fold cross-validation. For each iteration, terms in the reference set were shuffled, and<br />

partitioned into a test set (1/5 <strong>of</strong> the data) and a training set (4/5 <strong>of</strong> the data). The<br />

average precision, recall and F1-measure (cf. section 5.3) were calculated for each classifier<br />

from the determined confusion matrix.<br />

116


PAS<br />

Method Available Extracted Common Precision Recall F1<br />

Shallow parsing 117 82 56 0.68 0.48 0.56<br />

Full parsing 117 86 32 0.37 0.27 0.31<br />

Table 6.3: Evaluation <strong>of</strong> syntactical language parser performance. The performance <strong>of</strong> the two language<br />

parsers (shallow and full parsing) were evaluated on the basis <strong>of</strong> precision, recall and F1 measures by<br />

comparing the annotated PAS data in the test set with the returned PAS output from the parsers.<br />

6.3 Results<br />

In this section, the performances <strong>of</strong> contextual feature extraction and categorisation are<br />

studied. The test dataset is the gold standard corpus.<br />

6.3.1 Contextual feature extraction evaluated<br />

The objective in contextual feature extraction is to find textual features that are suitable<br />

as <strong>functional</strong> <strong>annotation</strong>s for protein residues.<br />

In this section, the performance <strong>of</strong> this extraction system is studied by comparing<br />

the results produced with two different language parsers: the shallow parser, and the full<br />

parser. Sentences from the gold standard corpus (GC) were used as test dataset for this<br />

analysis.<br />

Within this study, the analysis determined that the developed shallow parser has a<br />

better performance than the full parser ENJU. The shallow parser yielded in a F1 measure<br />

<strong>of</strong> 0.56 (precision <strong>of</strong> 0.68 and recall <strong>of</strong> 0.48), while the full parser ENJU has a F1 measure<br />

<strong>of</strong> 0.31 (precision <strong>of</strong> 0.37 and recall <strong>of</strong> 0.27) (cf. table 6.3).<br />

The results suggest that contextual information <strong>of</strong> a residue entity can be extracted<br />

from a syntactical analysis with a F1 measure <strong>of</strong> 0.56 and 0.31 for shallow parsing and<br />

full parsing, respectively.<br />

117


6.3.2 Performance analysis <strong>of</strong> the classifiers<br />

One problem in <strong>functional</strong> <strong>annotation</strong> extraction is the semantic interpretation <strong>of</strong> the<br />

extracted text data.<br />

The solution proposed in this work, is based on a classification<br />

approach.<br />

Two different categorisation schemes were tested in this study: MAN and<br />

FEAT. The performance <strong>of</strong> the developed classification method was evaluated by repeated<br />

cross-validation studies. Table 6.5 summarises the results from the determined confusion<br />

matrix (cf. table 6.4).<br />

For MAN, the top three performing classifiers with F1 measures <strong>of</strong> 0.62, 0.57, and 0.57<br />

are STR COMP (precision <strong>of</strong> 0.56, recall <strong>of</strong> 0.69), CHEM MOD (precision <strong>of</strong> 0.54, recall<br />

<strong>of</strong> 0.59) and BINDING (precision <strong>of</strong> 0.63, recall <strong>of</strong> 0.52). The average performance <strong>of</strong> the<br />

whole classification system for this categorisation scheme yielded in an average precision<br />

<strong>of</strong> 0.48 and an average recall <strong>of</strong> 0.42. In comparison the classification based on FEAT has<br />

a much lower average performance: average precision <strong>of</strong> 0.24, average recall <strong>of</strong> 0.18. The<br />

weak performances <strong>of</strong> the FEAT classifiers is explained by the distribution <strong>of</strong> examples<br />

in the categories; for some categories the number <strong>of</strong> corresponding features or examples<br />

is low (cf. table 6.2). A discussion is presented in section 6.4<br />

Examining the false positive rate in the confusion matrix <strong>of</strong> MAN reveals that the classifiers<br />

are confused with the category GEN BIOL (general biological terms) or GEN ENG<br />

(general English terms). This is not surprising considering that English terms are ambiguous.<br />

In addition, some categories show confusions with others, e.g. STR COMP with<br />

CHEM MOD, and ENZ ACT with STR COMP. One explanation is that some terms<br />

can be assigned to more than one category. For example, ”mutant structure” refers to<br />

an altered protein structure state, which is based on a chemical change in the protein<br />

sequence.<br />

Despite the average performances <strong>of</strong> some classifiers, the presented method can be<br />

used to assign categories to textual features. However, significant improvements on the<br />

performances <strong>of</strong> some classifiers are necessary before the system can be used automatically.<br />

118


Prediction<br />

BINDING GEN BIOL CELL CHEM MOD GEN ENG ENZ ACT STR COMP STR MOD<br />

BINDING 1,772 762 28 93 165 26 546 0<br />

A | GEN BIOL 560 15,815 525 1,496 4,514 159 1,714 65<br />

c | CELL 96 1,167 836 150 325 91 67 0<br />

t | CHEM MOD 38 1,103 12 3,742 761 79 546 25<br />

u | GEN ENG 144 2,556 126 510 1,820 46 480 35<br />

a | ENZ ACT 33 338 80 201 226 324 457 0<br />

l | STR COMP 160 783 64 551 592 35 4,914 11<br />

STR MOD 1 91 1 129 125 0 21 43<br />

Table 6.4: Performance analysis <strong>of</strong> the classifiers (confusion matrix). Classification with categories<br />

from MAN were analysed by cross-validation studies with 100-iterations. The result is represented as a<br />

confusion matrix.<br />

119


MAN<br />

FEAT<br />

Category Precision Recall F1 Category Precision Recall F1<br />

STR COMP 0.56 0.69 0.62 DOMAIN 0.50 0.24 0.32<br />

MOTIF 0.98 0.36 0.53<br />

TOPO DOM 0 0 0<br />

CHAIN 0 0 0<br />

TRANSMEM 0 0 0<br />

COIL 0 0 0<br />

CHEM MOD 0.54 0.59 0.57 VARIANT 0.50 0.69 0.58<br />

MOD RES 0.40 0.23 0.29<br />

PEPTIDE 0.05 0.06 0.05<br />

VAR SEQ 0 0 0<br />

LIPID 1 0.32 0.48<br />

CARBOHYD 0 0 0<br />

STR MOD 0.24 0.10 0.15 REGION 0.44 0.44 0.44<br />

SITE 0.40 0.55 0.46<br />

BINDING 0.63 0.52 0.57 BINDING 0.41 0.45 0.43<br />

METAL 0.05 0.02 0.03<br />

DISULFID 0.53 0.15 0.23<br />

CROSSLNK 0 0 0<br />

DNA BIND 0 0 0<br />

NP BIND 0 0.06 0<br />

ZN FING 0 0 0<br />

CA BIND 0 0 0<br />

ENZ ACT 0.43 0.20 0.27 ACT SITE 0.45 0.31 0.36<br />

CELL 0.50 0.31 0.38 N/A<br />

GEN BIOL 0.70 0.64 0.67 GEN BIOL 0.76 0.65 0.70<br />

GEN ENG 0.21 0.32 0.26 GEN ENG 0.23 0.32 0.27<br />

0.48 0.42 0.43 0.25 0.18 0.19<br />

Average<br />

Average<br />

Table 6.5: Performance evaluation <strong>of</strong> the classifiers (precision, recall, F1 measure).Evaluation <strong>of</strong> classification<br />

<strong>of</strong> textual features (noun phrases). Classification with categories from MAN and FEAT were<br />

analysed by cross-validation studies with 100-iterations. The performance was measured in terms <strong>of</strong><br />

precision, recall, and F1 measure.<br />

120


One option is to increase the number <strong>of</strong> training data, or the size <strong>of</strong> features for each<br />

classifier. Another alternative is to modify the definition <strong>of</strong> classes. The results suggest<br />

that the algorithm is, in generally, suitable for classification.<br />

6.4 Discussion<br />

The presented text mining solution extracts textual features from the context <strong>of</strong> residue<br />

entities. The identification <strong>of</strong> the contextual features, and the association with the residue<br />

entity, is based on the syntactical analysis <strong>of</strong> the sentence. More specifically, only a subset<br />

<strong>of</strong> semantic relations that are found in verbal and prepositional relations are extracted<br />

from text. The advantage <strong>of</strong> this approach is, that not only the semantic relation partners<br />

and the semantic relation type are found, but also contextual information is extracted.<br />

Within this study two approaches in syntactical analysis were compared, i.e. shallow<br />

parsing and full parsing, while the result indicates that the ENJU parser had a weaker<br />

performance than the developed shallow parser. Manual analysis on the false positive rate<br />

indicates that the source <strong>of</strong> incorrectly determined syntactical structure originates from<br />

false part-<strong>of</strong>-speech tagging. For example, in the sentence<br />

”Conversely, K382Q displays a highly altered responsiveness to the activator,<br />

suggesting that Lys(382) is involved in both activator binding and<br />

allosteric transition mechanism.” (PMID:10751408),<br />

both parsers identified ”altered” as a verb in past tense, although the correct POS is a<br />

noun modifier. The performance <strong>of</strong> the POS tagger is critical for the detection <strong>of</strong> phrase<br />

boundaries. However, both parsers rely on two different methods for POS tagging and the<br />

performance <strong>of</strong> the POS tagger has to be considered as well when comparing the shallow<br />

and full parser. Table A.1 lists some examples, where a parser failed in extracting the<br />

annotated PAS data from GC.<br />

121


The extracted information is difficult to normalise, because there is no gold standard<br />

<strong>of</strong> how to represent the association, and how to qualify the contextual information. In<br />

this work, the predicate-argument structure is used as a template for the extracted information.<br />

Although verb frame sets from PropBank or PASBio can be used to normalise<br />

the extracted data, they are not designed to capture description <strong>of</strong> protein residue function.<br />

On the other hand, this gives the extraction method the advantage to discover new<br />

knowledge. Because the extracted information is not normalised, the performance can<br />

only be measured in terms <strong>of</strong> sensitivity.<br />

The evaluation <strong>of</strong> the classification method indicates, that the presented approach can<br />

provide an automatic solution for text interpretation. However, some <strong>of</strong> the categories<br />

have only few examples, which is reflected in weak performances <strong>of</strong> the classifiers. One<br />

solution to this problem is to balance the example sets <strong>of</strong> each category, for example,<br />

by collecting more terminologies from MEDLINE. Alternatively, other categories may<br />

be defined to balance the ratio between a category and the associated set <strong>of</strong> examples.<br />

Yet another approach is not to classify arguments <strong>of</strong> a PAS, but cluster them based on<br />

their, for example, contextual usage.<br />

The advantage here is to find more information<br />

similarities among the PAS data by overcoming the information representativeness <strong>of</strong> a<br />

training (reference) set.<br />

Despite the fact, that semantic labels can be assigned to the arguments in a PAS,<br />

the developed method is not able to interpret the meaning <strong>of</strong> the whole extracted text<br />

segment. For example, in the sentence<br />

”Specific binding <strong>of</strong> the WT and mutant receptors Cys14Ala and<br />

Cys199Ala was inhibited in the presence <strong>of</strong> the disulfide bond reducing<br />

agent, DTT, implying that disulfide bonds are formed and can be<br />

reduced in these mutant receptors.” (PMID:9202220).<br />

The following information was extracted and semantic categories were assigned to the<br />

122


arguments <strong>of</strong> the PAS<br />

pred = inhibited<br />

arg1 = Specific binding<br />

arg1-<strong>of</strong> = [the WT and mutant receptors CYS14 ALA and<br />

CYS199 ALA]/CHEM MOD<br />

arg2-in = the presence<br />

arg2-<strong>of</strong> = the disulfide bond reducing agent.<br />

Although one part <strong>of</strong> the information in the example has been correctly assigned with the<br />

label CHEM MOD, the entire text phrase should be labelled with BINDING. A solution<br />

to this problem is not trivial and requires several levels <strong>of</strong> linguistic analysis.<br />

6.5 Conclusion<br />

In this chapter, I have presented the developed contextual feature extraction system for<br />

the <strong>annotation</strong> <strong>of</strong> residue entities. Because a suitable terminological resource is not available,<br />

the identification <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> is based on the extraction <strong>of</strong> syntactical<br />

relations between a residue entity and a noun phrase. The developed method allows the<br />

discovery <strong>of</strong> novel information that can provide key information for <strong>functional</strong> <strong>annotation</strong>.<br />

In the next chapter, I will demonstrate the validity <strong>of</strong> the extracted information as<br />

<strong>functional</strong> <strong>annotation</strong> <strong>of</strong> protein residues.<br />

123


Chapter 7<br />

Extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong><br />

for protein residues from MEDLINE<br />

In the previous two chapters, two fundamental text mining components for the <strong>functional</strong><br />

<strong>annotation</strong> extraction were presented. In this chapter, I provide results <strong>of</strong> the combined<br />

extraction result, and assesses the performance <strong>of</strong> the combined system. The objective in<br />

this study is to determine the qualitative and quantitative distribution <strong>of</strong> information in<br />

MEDLINE. Because the information is derived solely from biomedical abstract texts, it<br />

is necessary to examine the data in terms <strong>of</strong> validity, novelty, and biological significance.<br />

In the first part <strong>of</strong> the evaluation, the performance <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction<br />

is studied on the gold standard corpus. Then the biological significance <strong>of</strong> the<br />

extracted data from MEDLINE is studied on two example proteins, the suppressor protein<br />

p53, and the Janus kinase 2 protein. Finally, the distribution <strong>of</strong> information is examined<br />

by two specific analysis: the cross-validation <strong>of</strong> identified <strong>active</strong> site residues with CSA,<br />

and the cross-validation <strong>of</strong> binding residues with MSDsite.<br />

124


7.1 Evaluation methods<br />

The evaluation <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system was based on the performance<br />

analysis <strong>of</strong> its extraction components: protein residue identification, and contextual<br />

feature extraction (cf. section 5.3 and section 6.2).<br />

The analysis on the biological validity <strong>of</strong> the mined <strong>functional</strong> <strong>annotation</strong>s was done by<br />

manual analysis. For each protein residue, the set <strong>of</strong> extracted <strong>annotation</strong>s was reviewed<br />

and grouped by similar topics. Because a set <strong>of</strong> <strong>annotation</strong>s for each associated protein<br />

residue can be very large, random samples were drawn from a list <strong>of</strong> <strong>annotation</strong>s sorted<br />

by residue name and position. The result is a set <strong>of</strong> sample <strong>annotation</strong>s for each extracted<br />

residue <strong>of</strong> a protein. The information was compared with the corresponding <strong>annotation</strong>s<br />

in UniProtKB.<br />

The validation <strong>of</strong> catalytic residues was done by cross-validation with CSA [PBT04].<br />

The analysis was performed on three levels, i.e.<br />

the comparison <strong>of</strong> identified protein<br />

residues from MEDLINE with CSA, comparison <strong>of</strong> residues with extracted <strong>functional</strong> <strong>annotation</strong>s,<br />

and comparison <strong>of</strong> residues with extracted <strong>annotation</strong>s classified as ENZ ACT<br />

(cf. section 6.1.2). The residues were compared by using the combination <strong>of</strong> the identifiers<br />

RID+UID (cf. section 5.3).<br />

The validation <strong>of</strong> binding residues from MEDLINE extraction was done accordingly.<br />

The third level <strong>of</strong> validation compared residues with extracted <strong>annotation</strong>s classified as<br />

BINDING.<br />

125


7.2 Results<br />

7.2.1 Evaluation <strong>of</strong> the developed <strong>functional</strong> <strong>annotation</strong> extraction<br />

system<br />

The presented <strong>functional</strong> <strong>annotation</strong> extraction system consists <strong>of</strong> two basic modules:<br />

identification <strong>of</strong> protein residues, and contextual feature extraction. The following describes<br />

an analysis <strong>of</strong> the overall performance <strong>of</strong> the combined text mining system. The<br />

test set is the gold standard corpus (GC; cf. section 5.2). The evaluation was done<br />

in two respects: manual validation <strong>of</strong> extracted information, and cross-validation with<br />

UniProtKB <strong>annotation</strong>s.<br />

Manual validation <strong>of</strong> extracted information.<br />

The gold standard corpus consists<br />

<strong>of</strong> 100 abstract texts with tri-occurrences <strong>of</strong> the triplet protein, residue and organism.<br />

However, manual analysis identified only 51 abstract texts with residue entities that can<br />

be associated with their proteins and hosting organisms.<br />

The number <strong>of</strong> associations<br />

(OPR) is 172. This represents the target for protein residue identification.<br />

Corresponding to these OPRs is the set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s (PAS data). For 109<br />

out <strong>of</strong> 172 OPRs, keywords were co-mentioned in verbal relations. The number <strong>of</strong> PAS<br />

associated with the 109 OPRs is 117. This represents the target <strong>of</strong> <strong>functional</strong> <strong>annotation</strong><br />

extraction.<br />

Figure 7.1 summarises the performance <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction. With<br />

a previously determined precision <strong>of</strong> 0.82 and a recall <strong>of</strong> 0.38, the protein residue identification<br />

module detects 79 OPRs with 65 out <strong>of</strong> 79 being the correct ones. Contextual<br />

feature extraction for these 65 protein residues resulted in 35 PAS data. In comparison<br />

with the 117 annotated PAS <strong>of</strong> the 109 OPRs, only 16 out <strong>of</strong> 35 extracted PAS are true<br />

positives. However, the total number <strong>of</strong> extracted PAS is 46, which results in a precision<br />

<strong>of</strong> 0.35 and a recall <strong>of</strong> 0.13. A systematic analysis revealed, that the rate <strong>of</strong> false positives<br />

126


PAS data<br />

Dataset Available Extracted Common Precision Recall F1<br />

GC 117 46 16 0.35 0.13 0.25<br />

Figure 7.1: Performance evaluation <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system. The performance<br />

is dependent on the two combined text mining modules: protein residue identification; and contextual<br />

feature extraction. The performance was measured in terms <strong>of</strong> precision, recall, and F1 measure<br />

127


has the following sources: a false positive <strong>of</strong> OPR with extracted PAS, a true positive<br />

OPR with no annotated PAS, and a true positive <strong>of</strong> OPR with false positive <strong>of</strong> PAS.<br />

In comparison, if the system would have identified all protein residues correctly, the<br />

performance <strong>of</strong> the whole extraction would have yielded in a precision <strong>of</strong> 0.68 and a<br />

recall <strong>of</strong> 0.48 (cf. section 6.3). Considering, the presented text mining solution is a pilot<br />

approach to extract <strong>functional</strong> <strong>annotation</strong>s for the validation <strong>of</strong> <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong>,<br />

the result is good for this area and comparable to first studies in BioCreAtIvE or Critical<br />

Assessment <strong>of</strong> Techniques for Protein Structure Prediction (CASP). The recall can be<br />

explained by the performance <strong>of</strong> the contextual feature extraction module.<br />

The result indicates, that the extracted <strong>functional</strong> <strong>annotation</strong>s have a reasonable precision<br />

in this first attempt <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> extraction, but is low in coverage.<br />

This can be explained by the sum <strong>of</strong> the performances <strong>of</strong> each text mining module. On<br />

one hand, an incorrectly determined protein residue leads to a false positive <strong>of</strong> PAS. On<br />

the other hand, a failed entity recognition contributes to the false negative rate. In addition,<br />

language complexity, and incorrectly parsed sentences are the other reasons for the<br />

false positive and false negative rate <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> extraction.<br />

In conclusion, the presented <strong>functional</strong> <strong>annotation</strong> extraction system delivers precise<br />

information, but has a low coverage <strong>of</strong> extraction. However, in context <strong>of</strong> the bioinformatics<br />

work <strong>of</strong> this thesis, a precision-driven extraction system is prefered over a recall<br />

oriented text mining solution.<br />

Cross-validation with UniProtKB <strong>functional</strong> <strong>annotation</strong>s.<br />

Despite the low coverage<br />

<strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system, the extracted information is correct<br />

and reusable for the <strong>annotation</strong> <strong>of</strong> protein residues. Table B.1 lists the 16 verified PAS<br />

data, corresponding to 17 verified protein residues. A comparison with UniProtKB shows,<br />

that 5 out <strong>of</strong> 16 are rediscovered knowledge. The remaining 11 out <strong>of</strong> 16 contain novel<br />

information that can be used to update the protein knowledge base.<br />

The extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s is a multi-step system. Although the per-<br />

128


formances <strong>of</strong> each module may not be at optimal level, the results demonstrate that<br />

<strong>functional</strong> <strong>annotation</strong>s are available and extractable from MEDLINE.<br />

7.2.2 Studying mined <strong>functional</strong> <strong>annotation</strong>s for the proteins<br />

p53 and Jak2<br />

UniProtKB curates <strong>functional</strong> <strong>annotation</strong>s for proteins on three levels: protein level,<br />

protein domain level, and protein residue level. The objective in this section is to study the<br />

validity and novelty <strong>of</strong> mined <strong>functional</strong> <strong>annotation</strong>s from whole MEDLINE extraction.<br />

The result provides an indication <strong>of</strong> the biological significance for automatic extraction<br />

from MEDLINE. The <strong>annotation</strong>s <strong>of</strong> two example proteins, p53 and Jak2, are analysed<br />

and compared with relevant information from UniProtKB.<br />

Tumour suppressor protein p53.<br />

p53 plays a critical role in preventing human cancer<br />

formation. In the native state, the protein assembles to a tetrameric phosphoprotein.<br />

It consists <strong>of</strong> four <strong>functional</strong> domains: (1) the proline-rich, acidic, N-terminus, which is<br />

involved in transcriptional activation, e.g. Mdm2 binding; (2) the central core, which<br />

binds DNA; (3) the oligomerisation domain with nuclear localisation signals, which allows<br />

the transfer into the nucleus; and (4) the C-terminus, which regulates DNA-binding<br />

[SYH + 03].<br />

The extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s from MEDLINE for the human tumor protein<br />

p53 resulted in 1,665 PAS data.<br />

A manual analysis on samples <strong>of</strong> mined <strong>functional</strong><br />

<strong>annotation</strong>s indicates, that there are two main topics: the regulatory post-translational<br />

modification, and the binding activity <strong>of</strong> residues, where in some cases the interaction<br />

partner is also stated. Table C.1 lists example <strong>annotation</strong>s grouped by similar topics. For 5<br />

out <strong>of</strong> 6 <strong>of</strong> the identified residues with post-translational modification, i.e. THR18, SER46,<br />

SER15, THR55, and SER315, the extracted information is similar to the <strong>annotation</strong>s in<br />

the UniProtKB entry. The remaining residue, SER6, has no <strong>annotation</strong> in the UniProtKB.<br />

129


The knowledge base does not provide further information on the biological implication<br />

<strong>of</strong> these residues, while the extracted data contain more contextual information.<br />

For<br />

example:<br />

”[...]ATM-mediated phosphorylation <strong>of</strong> the ser15 site <strong>of</strong> p53[...]”<br />

(PMID:14757188),<br />

”[...]Ser46 phosphorylation activates p53-dependent apoptosis[...]”<br />

(PMID:17172844).<br />

The analysis also found <strong>annotation</strong>s for some critical residues that are not recorded in<br />

UniProtKB. For example:<br />

”[...]the amino acid change C135R generates the loss <strong>of</strong> TP53 DNAbinding<br />

activity[...]” (PMID:17914575),<br />

”[...]R248W abolish the association with p63[...]” (PMID:11172034).<br />

The activity <strong>of</strong> p53 is thought to be regulated through a number <strong>of</strong> post-translational<br />

modifications at the N- and C-terminal regions. Review articles report that seven serines<br />

(SER6, SER9, SER15, SER20, SER33, SER37, and SER46) and two threonines (THR18,<br />

and THR81) in the N-terminal domain are modified by kinases upon exposure <strong>of</strong> cells to<br />

ionising radiation or UV light. The analysis shows that MEDLINE extraction can recover<br />

this information for the residues SER6, SER15, SER46, and THR18.<br />

Janus Kinase 2 (Jak2).<br />

Jak2 plays a crucial part in various growth factors and cytokine<br />

signalling pathways. Similar to other protein tyrosine kinases <strong>of</strong> the Janus kinase<br />

family, Jak2 consists <strong>of</strong> a tyrosine kinase domain and a tyrosine kinase-like domain. It is<br />

thought that the kinase-like domain can negatively regulate the kinase domain.<br />

130


The set <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for Jak2 has the size <strong>of</strong> 624 PAS data, and<br />

contains only information on seven residues: L539 (1 <strong>annotation</strong>), W515 (1 <strong>annotation</strong>),<br />

K607 (2 <strong>annotation</strong>s), V617 (630 <strong>annotation</strong>s), F617 (5 <strong>annotation</strong>s; a reported variant<br />

associated with Budd-Chiari syndrome), V678 (3 <strong>annotation</strong>s), and D816 (1 <strong>annotation</strong>).<br />

A comparison with UniProtKB data shows, that the extracted information for F617, K607,<br />

and L539 are similar to the <strong>annotation</strong>s in the database. These and other <strong>annotation</strong>s for<br />

D816, V678, and W515 describe mutation events (data not shown).<br />

In order to assess the extracted information on V617, random samples were selected<br />

and studied manually. The result <strong>of</strong> the analysis indicates, that the set <strong>of</strong> <strong>annotation</strong>s<br />

contains a lot <strong>of</strong> redundant information. The data can be grouped into two main topics:<br />

disease, and genetical origin. Table D.1 lists some examples <strong>of</strong> extracted <strong>functional</strong><br />

<strong>annotation</strong>s.<br />

The effect <strong>of</strong> mutating residue 617 on cellular function, and its association with particular<br />

diseases has already been reported, but none <strong>of</strong> the extracted <strong>annotation</strong>s provide any<br />

molecular explanation. A survey <strong>of</strong> research publications on Jak2 revealed, that myeloid<br />

and lymphoid malignancies are associated with Jak2 V617F. It is proposed, that the<br />

residue 617 destabilises the kinase and kinase-like domain interactions, and thereby promotes<br />

activation <strong>of</strong> kinase activity [POHS05]. These results suggest that the extracted<br />

information reflects pieces <strong>of</strong> evidences, however, their biological relations may not be<br />

available in the mined output or even in MEDLINE.<br />

In summary, the study <strong>of</strong> the mined <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> residues for the two proteins<br />

presented here indicates, that MEDLINE contains information, which are recurrent<br />

in a number <strong>of</strong> abstract texts. Despite the data redundancy, some <strong>functional</strong> <strong>annotation</strong>s<br />

are not contained in UniProtKB, indicating that MEDLINE extraction retains its<br />

originality.<br />

131


7.2.3 Cross-validation <strong>of</strong> mined catalytic residues with CSA<br />

In the previous section, <strong>functional</strong> <strong>annotation</strong>s were extracted from MEDLINE, and for a<br />

range <strong>of</strong> <strong>annotation</strong>s, the contained information was analysed on its biological validity and<br />

novelty. This section focuses on enzyme-related information in the extracted <strong>annotation</strong>s.<br />

The objective is to study how reliable the extracted information is for the validation <strong>of</strong><br />

catalytic residues. The identified residues with these associated <strong>annotation</strong>s are compared<br />

with CSA. Figure 7.2 summarises the result <strong>of</strong> this analysis.<br />

The CSA lists 12,971 protein residues (RID+UID), <strong>of</strong> which 799 were identified in<br />

MEDLINE. The missing 12,172 protein residues in CSA can be explained by the performance<br />

<strong>of</strong> the identification system (cf. section 5.4). Another explanation is, that CSA<br />

is curated from full-text publication extraction, and the same information may not be<br />

available in MEDLINE.<br />

By selecting residues with extracted <strong>functional</strong> <strong>annotation</strong>s from MEDLINE, 691 out<br />

<strong>of</strong> 799 protein residues were retained. This result indicates that a lot <strong>of</strong> <strong>functional</strong> descriptions<br />

are available as contextual features <strong>of</strong> the identified protein residues. The result<br />

is consistent with previous performance evaluation studies (cf. section 6.4). With a precision<br />

<strong>of</strong> 0.43 and recall <strong>of</strong> 0.20, the classifier for the category ENZ ACT (cf. section 6.3)<br />

identified enzyme-related <strong>functional</strong> <strong>annotation</strong>s for 77 out <strong>of</strong> 691 protein residues. Manual<br />

analysis shows, that this reduction can be explained by the classifier’s performance.<br />

Another explanation is the absence <strong>of</strong> relevant contextual cues in the extracted text.<br />

A search for the term ”catalytic triad” in the sentences <strong>of</strong> the identified protein residues<br />

yielded in a sub-selection <strong>of</strong> 221 out <strong>of</strong> 46,750 residues. A comparison with CSA shows,<br />

that 44 out <strong>of</strong> 221 are re-discoveries <strong>of</strong> <strong>active</strong> site residues.<br />

The <strong>annotation</strong>s for the<br />

remaining 177 may contain supporting evidences to identify the residues as catalytic. A<br />

systematic analysis <strong>of</strong> these <strong>predicted</strong> catalytic residues should start with the 27 out <strong>of</strong><br />

177 residues, which have <strong>annotation</strong>s classified as ENZ ACT.<br />

In conclusion, the developed text mining system rediscovers <strong>active</strong> site residues, by<br />

132


Figure 7.2: Cross-validation <strong>of</strong> text mined catalytic residues with CSA. The analysis was done based<br />

on the comparison <strong>of</strong> the determined RID+UID pairs. The numbers reflect the determined RID+UID<br />

pairs. RID = Residue identifier; UID = Uniprot identifier.<br />

133


Figure 7.3: Cross-validaiton <strong>of</strong> text mined binding residues with MSDsite. Annotation was studied<br />

on the level <strong>of</strong> using solely the mentioned protein residue, the residue with PAS data, and residue with<br />

information on binding. The number indicates the counted RID+UID pairs in the data. RID = Residue<br />

identifier; UID = Uniprot identifier.<br />

solely mining abstract text from MEDLINE. While the rate <strong>of</strong> false positive is not known,<br />

the extraction identified 1,391 protein residues with enzyme-related <strong>functional</strong> <strong>annotation</strong>s.<br />

The significance <strong>of</strong> these potentially new CSA residues are further studied in<br />

ongoing work.<br />

7.2.4 Annotation <strong>of</strong> protein residues in MSDsite<br />

The MSDsite [GDO + 05] holds a number <strong>of</strong> <strong>predicted</strong> ligand binding <strong>sites</strong>, by automatically<br />

analysing ligand contacting residues in the PDB. The objective in this section is to analyse<br />

how many <strong>of</strong> these binding residues can be annotated from mining MEDLINE.<br />

134


The analysis shows that 512 out <strong>of</strong> the 46,750 identified protein residues in MEDLINE<br />

are also contained in MSDsite (cf. figure 7.3). A large proportion <strong>of</strong> these residues are<br />

associated with PAS data (429 out <strong>of</strong> 512), while only a smaller subset <strong>of</strong> 12 have information<br />

classified as BINDING. Manual analysis shows, that all <strong>of</strong> these 12 <strong>annotation</strong>s are<br />

correct. They can be used to validate the <strong>predicted</strong> ligand binding residues in MSDsite<br />

(table E.1).<br />

For the remaining 417 out <strong>of</strong> 512 residues, the associated PAS data may still contain<br />

valid information for the <strong>annotation</strong>. However, a systematic analysis was not performed<br />

at this stage <strong>of</strong> study.<br />

In summary, a relatively small set <strong>of</strong> protein residues recovered from MEDLINE extraction<br />

can be used for the <strong>annotation</strong> <strong>of</strong> MSDsite entries.<br />

7.3 Discussion<br />

The extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong> is a multi-step process, and the quality <strong>of</strong> the<br />

result has to be interpreted in context <strong>of</strong> each subprocess’ performance. Although the<br />

performances <strong>of</strong> each extraction module may not be at optimal level, the evaluation results<br />

indicate that the mined output contains biologically meaningful data. Considering the<br />

validation <strong>of</strong> a <strong>predicted</strong> function requires any evidences <strong>of</strong> biological function, the developed<br />

text mining system can become a valuable tool, for example for the protein function<br />

prediction assessement in the Critical Assessment <strong>of</strong> Techniques for Protein Structure Prediction<br />

(CASP) [LRTV07]. With the improvement <strong>of</strong> the information extraction modules,<br />

the quality <strong>of</strong> mined <strong>functional</strong> <strong>annotation</strong>s is expected to become more reliable.<br />

The biological relevance <strong>of</strong> the extracted <strong>functional</strong> <strong>annotation</strong> was demonstrated on<br />

two different proteins, p53 and Jak2.<br />

The results show, that not only information in<br />

UniProtKB can be rediscovered from MEDLINE, but also novel information can be extracted<br />

as well. These <strong>functional</strong> <strong>annotation</strong>s can be considered to complement existing<br />

<strong>annotation</strong>s in UniProtKB. However, manual analysis on subsets <strong>of</strong> the extracted annota-<br />

135


tions indicates, that the information is represented redundantly in MEDLINE. One major<br />

reason is, that biological facts are expressed repeatedly within the biological community.<br />

The study <strong>of</strong> identifying catalytic residues and binding residues from the mined <strong>functional</strong><br />

<strong>annotation</strong>s, and the cross-validation with CSA and MSDsite shows, that the developed<br />

text mining solution is able to find relevant data from MEDLINE. Although the<br />

developed classifiers have a weak performance, it is not clear whether this explains completely<br />

the cross-validation results. It is possible, that key information is not mentioned<br />

in abstract texts that would identify the biological role <strong>of</strong> the protein residues. Another<br />

explanation is based on the protein residue identification performance, which had been<br />

evaluated with a low recall score.<br />

Although abstract texts cover only a subset <strong>of</strong> information from full-text articles, and<br />

information is represented repeatedly in MEDLINE, this study shows that the text mined<br />

information is biologically valid and contains snippets <strong>of</strong> additional information that are<br />

relevant for UniProtKB. For example, the extracted <strong>annotation</strong>s complement existing<br />

information in UniProtKB and provide first data <strong>of</strong> yet not curated <strong>functional</strong> <strong>sites</strong> in<br />

proteins.<br />

7.4 Conclusion<br />

In this chapter, two text mining components were combined to form the <strong>functional</strong> <strong>annotation</strong><br />

extraction system. Performance analysis shows, that the system is precise, but<br />

has a low coverage. However, the low recall is compensated by the fact, that information<br />

is distributed redundantly. The extracted information is biologically valid, and contains<br />

some novel data, which can be used to update UniProtKB. So far, <strong>functional</strong> <strong>annotation</strong>s<br />

<strong>of</strong> residues have been evaluated in isolation, i.e. independent from structural context in<br />

proteins. In the following chapter a biological context is created, by combining <strong>functional</strong><br />

<strong>annotation</strong>s with protein structure data (cf. chapter 3 and chapter 4).<br />

136


Chapter 8<br />

Combining <strong>active</strong> site prediction<br />

with mined <strong>functional</strong> <strong>annotation</strong>s<br />

The goal in this thesis is to combine information from two disjoint information resources.<br />

In this course various methodologies were developed for the prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong><br />

in proteins, and the extraction <strong>of</strong> relevant information for the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong><br />

protein residues from scientific articles.<br />

More specifically, a <strong>predicted</strong> <strong>functional</strong> site<br />

can be validated by a set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues.<br />

Conversely, a<br />

set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s requires a structural context to understand the molecular<br />

mechanism <strong>of</strong> a protein function.<br />

In the previous chapters, I have presented the results on 3D pattern mining from PDB<br />

(cf. chapter 3) and <strong>functional</strong> <strong>annotation</strong> extraction from MEDLINE (cf. chapters 5, 6,<br />

and 7). Here, the produced datasets are combined and analysed. The objective in this<br />

chapter is to validate <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> that the data mining output may contain,<br />

by combining specific <strong>functional</strong> <strong>annotation</strong>s extracted from MEDLINE. The result is<br />

compared with data from CSA.<br />

137


Figure 8.1: Overview <strong>of</strong> processes and evaluation methods <strong>of</strong> combining the protein structure dataset<br />

and literature dataset.<br />

8.1 Algorithms<br />

8.1.1 Combining protein structure data with literature data<br />

Theory<br />

The method to combine PDB with MEDLINE data, i.e. the <strong>functional</strong> <strong>annotation</strong> <strong>of</strong> a<br />

residue from a protein structure, is based on the combination <strong>of</strong> two identifiers: RID+UID<br />

(cf. section 5.3). There are two major subtasks to combine the datasets (cf. figure 8.1):<br />

linking PDB entries to a Uniprot entry, and associating a residue with its co-mentioned<br />

protein in text.<br />

Mapping residues in PDB to UniProtKB.<br />

The mapping between PDB and UniProtKB,<br />

and the inherited mapping <strong>of</strong> a protein residue from a PDB entry to its UniProtKB sequence<br />

index, is a non-trivial task. One problem is that the author <strong>of</strong> a determined protein<br />

structure used an arbitrary residue index system that is not in accordance with the wild-<br />

138


type protein sequence.<br />

Furthermore, residues in a protein deletion mutant may have<br />

been numbered sequentially, irrespectively <strong>of</strong> sequence gaps. Another example is, that<br />

UniProtKB does not have the corresponding protein sequence for a crystallised protein,<br />

which may be, for example, a novel splice variant.<br />

In some cases, cross-links from PDB to UniProtKB, or UniProtKB to PDB are available.<br />

However, over time the links may have become outdated. In order to find the correct<br />

mapping between the protein residue indices in both databases, an exhaustive sequence<br />

alignment is required. Various solutions and services have been provided for the periodic<br />

update <strong>of</strong> UniProtKB-PDB mappings [VMMR + 05] [Mar05] [VZHC05] [MSD08].<br />

Here, I reuse a previously published lookup table file [Mar05] for the mapping <strong>of</strong><br />

protein residues in PDB to UniProtKB. Notice, that the lookup table is based on the<br />

alignment analysis work <strong>of</strong> the Macromolecular Structure Database (MSD) group at the<br />

<strong>European</strong> Bioinformatics Institute [MSD08].<br />

Mapping protein residue in text to UniProtKB.<br />

The mapping <strong>of</strong> a residue entity<br />

in text to its co-mentioned protein, and ultimately the mapping to UniProtKB, is<br />

explained in section 5.1.<br />

Implementation<br />

The correct sequence index mapping <strong>of</strong> a PDB entry to its corresponding Uniprot entry<br />

was based on the lookup table produced by [Mar05] (version October 2008). An example<br />

<strong>of</strong> the lookup table data is shown in figure 8.2. The combination <strong>of</strong> the following keys were<br />

used to unambiguously map a residue from PDB to its Uniprot native sequence position:<br />

PDBID + chainID + RID.<br />

139


PDB<br />

UniProtKB<br />

PDBID chainID serial resName resSeq UID resName seqIndex<br />

11gs B 1 PRO 2 GSTP1 HUMAN P 3<br />

11gs B 2 TYR 3 GSTP1 HUMAN Y 4<br />

11gs B 3 THR 4 GSTP1 HUMAN T 5<br />

11gs B 4 VAL 5 GSTP1 HUMAN V 6<br />

11gs B 5 VAL 6 GSTP1 HUMAN V 7<br />

11gs B 6 TYR 7 GSTP1 HUMAN Y 8<br />

11gs B 7 PHE 8 GSTP1 HUMAN F 9<br />

11gs B 8 PRO 9 GSTP1 HUMAN P 10<br />

11gs B 9 VAL 10 GSTP1 HUMAN V 11<br />

11gs B 10 ARG 11 GSTP1 HUMAN R 12<br />

Figure 8.2: Lookup table for PDB/UniProtKB mapping. Excerpt <strong>of</strong> the lookup table to map protein<br />

residues from a PDB entry to the corresponding UniProtKB entry.<br />

8.2 Evaluation methods<br />

The validation <strong>of</strong> identified catalytic residues was done by manual examination <strong>of</strong> the<br />

<strong>functional</strong> descriptions <strong>of</strong> annotated protein residues.<br />

Within this analysis 6 datasets<br />

were used (cf. section 7.2): CSA is the set <strong>of</strong> <strong>active</strong> site residues from the Catalytic Site<br />

Atlas [PBT04]; OLDFIELD is the set <strong>of</strong> residues in the non-redundant structure set from<br />

[Old02]; PATTERN is the set <strong>of</strong> residues from the data mined 3D patterns; OPR is the<br />

set <strong>of</strong> protein residues identified from MEDLINE extraction; FA is the subset <strong>of</strong> OPR,<br />

which have <strong>functional</strong> <strong>annotation</strong>s extracted from MEDLINE; and ENZ is the subset <strong>of</strong><br />

FA, where the contained information are classified as ENZ ACT, i.e. the information are<br />

enzyme-related.<br />

8.3 Results<br />

8.3.1 Protein residue mapping between three data resources<br />

This section gives an overview <strong>of</strong> the analysed datasets. Figure 8.3 summarises the data.<br />

OLDFIELD contains in total 341,365 protein residues, counted as RID+PDBID.<br />

328,796 out <strong>of</strong> 341,365 residues are found in the lookup table, which corresponds to<br />

280,521 RID+UID. Parallely, the residues from the mined 3D pattern set (PATTERN) was<br />

140


Figure 8.3: Overview <strong>of</strong> the combined datasets from protein structure data and biomedical literature<br />

data. The combined dataset is analysed to identify <strong>active</strong> site residues. CSA = <strong>active</strong> site database; OPR<br />

= identified protein residues; PAS = contextual feature assigned to a protein residue; ENZ = contextual<br />

feature with enzyme-related information; OLDFIELD = protein structure subset from PDB; PATTERN<br />

= data mined structural features from OLDFIELD.<br />

141


mapped to 24,500 RID+UID. The identification <strong>of</strong> protein residues in MEDLINE found<br />

a total <strong>of</strong> 132,476 RID+UID with a unique count <strong>of</strong> 46,750 RID+UID. This dataset is<br />

referred as OPR. 36,569 out <strong>of</strong> 46,750 protein residues have <strong>functional</strong> <strong>annotation</strong>s (FA),<br />

while another subset <strong>of</strong> 1,467 out <strong>of</strong> 36,569 have <strong>annotation</strong>s classified as ENZ ACT<br />

(ENZ). A set analysis between OLDFIELD and OPR determined 2,402 common protein<br />

residues, 197 out <strong>of</strong> 2,402 also listed in CSA.<br />

In summary, for a large fraction <strong>of</strong> protein residues in OLDFIELD, mapping to<br />

UniProtKB sequence indices is available. However, only 2,402 are recovered from MED-<br />

LINE extraction, which can be used for validation.<br />

8.3.2 Rediscovery <strong>of</strong> <strong>active</strong> <strong>sites</strong> and catalytic residues<br />

The identification <strong>of</strong> catalytic residues from protein structure data mining, and from<br />

biomedical literature mining was studied previously (cf. sections 4.2 and 7.2). Each<br />

result was evaluated by cross-validation with CSA. This section studies the validation <strong>of</strong><br />

<strong>predicted</strong> <strong>active</strong> <strong>sites</strong> from the combined datasets.<br />

Previously, three structural patterns were identified as <strong>active</strong> <strong>sites</strong>, by cross-validation<br />

with CSA (cf. chapter 4). One <strong>of</strong> the pattern represents the well known catalytic triad.<br />

This pattern was found in 19 proteins within the dataset (cf. section 4.2). Associated<br />

with these 19 proteins is the set <strong>of</strong> 57 protein residues. The analysis shows that only 3 out<br />

<strong>of</strong> 57 residues were identified in MEDLINE, The 3 identified residues in text correspond<br />

to the same protein, bovine chymotrypsinogen (cf. table 8.1). The associated <strong>functional</strong><br />

<strong>annotation</strong>s for the residues ASP102, and HIS57, were not classified as ENZ ACT. The<br />

contained information in these <strong>annotation</strong>s only indirectly indicate the catalytic property<br />

<strong>of</strong> these residues; the <strong>annotation</strong>s do not mention them as part <strong>of</strong> the catalytic triad. In<br />

conclusion, a structure-based prediction <strong>of</strong> an <strong>active</strong> site was not validated by literature<br />

data.<br />

The intersection <strong>of</strong> PATTERN, OPR, and CSA results in a set <strong>of</strong> 15 protein residues.<br />

142


RID+UID<br />

S195 CTRA BOVIN; D102 CTRA BOVIN; H57 CTRA BOVIN<br />

Sentence ”These include the NH2-terminal four residues, the sequences near histidine-57 (chymotrypsinogen<br />

A numbering system), aspartic acid-102, aspartic acid-189, and serine-195,<br />

the regions <strong>of</strong> the three disulfide bridges, and the COOH-terminal end (residues 225-<br />

229) <strong>of</strong> the proteins. When aligned to maximize homology the identity <strong>of</strong> residues is<br />

34%.”(PMID:804314)<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

N/A<br />

D102 CTRA BOVIN; H57 CTRA BOVIN<br />

”In bovine chymotrypsinogen A in 2H2O at 31 degrees C, histidine-57 has a pK’ <strong>of</strong> 7.3 and<br />

aspartate-102 a pK’ <strong>of</strong> 1.4, and the histidine-40-aspartate-194 system exhibits inflections at<br />

pH 4.6 and 2.3.” (PMID:31898)<br />

pred = has<br />

arg1 = HIS57<br />

arg2 = a pK<br />

arg2-<strong>of</strong> = 7.3 and ASP102 a pK<br />

arg2-<strong>of</strong> = 1.4<br />

D102 CTRA BOVIN<br />

”In bovine chymotrypsin Aalpha under the same conditions, the histidine-57-aspartate-102<br />

system has pK’ values <strong>of</strong> 6.1 and 2.8, and histidine-40 has a pK’ <strong>of</strong> 7.2.” (PMID:31898)<br />

pred = have<br />

arg1 = the HIS57 ASP102 system<br />

arg2 = pK values<br />

arg2-<strong>of</strong> = 6.1 and 2.8<br />

D102 CTRA BOVIN; H57 CTRA BOVIN<br />

”The results suggest that the pK’ <strong>of</strong> histidine-57 is higher than the pK’ <strong>of</strong> aspartate-102 in<br />

both zymogen and enzyme.” (PMID:31898)<br />

pred = is<br />

arg1 = that the pK<br />

arg1-<strong>of</strong> = HIS57<br />

arg2 = higher than the pK<br />

arg2-<strong>of</strong> = ASP102<br />

arg2-in = both zymogen and enzyme<br />

H57 CTRA BOVIN<br />

”The 1H NMR chemical shift <strong>of</strong> the Cepsilon1 H <strong>of</strong> histidine-57 in the chymotrypsin Aalphapancreatic<br />

trypsin inhibitor (Kunitz) complex is constant between pH 3 and 9 at a value<br />

similar to that <strong>of</strong> histidine-57 in the porcine trypsin-pancreatic trypsin inhibitor complex<br />

[Markley, J.L., and Porubcan, M. A. (1976), J. Mol. Biol. 102, 487–509], suggesting that the<br />

mechanisms <strong>of</strong> interaction are similar in the two complexes.” (PMID:31898)<br />

pred = is<br />

arg1 = complex<br />

arg2 = constant<br />

arg2-between = pH 3 and 9<br />

arg2-at = a value similar<br />

arg2-to = that<br />

arg2-<strong>of</strong> = HIS57<br />

arg2-in = the porcine trypsin-pancreatic trypsin inhibitor complex<br />

Table 8.1: Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen.<br />

Based on the performance <strong>of</strong> the <strong>functional</strong> <strong>annotation</strong> extraction system and the availability <strong>of</strong> information<br />

in MEDLINE, only few information was extracted. The mined information on the <strong>active</strong> site<br />

residues mention only indirectly their catalytic properties.<br />

143


RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

C32 THIO HUMAN; C35 THIO HUMAN<br />

”A hydrogen bond between the sulfhydryls <strong>of</strong> Cys32 and Cys35 may reduce the pKa <strong>of</strong> Cys32<br />

and this pKa depression probably results in increased nucleophilicity <strong>of</strong> the Cys32 thiolate<br />

group.” (PMID:8805557)<br />

pred = reduce<br />

arg1 = A hydrogen bond<br />

arg1-between = the sulfhydryls<br />

arg1-<strong>of</strong> = CYS32 and CYS35<br />

arg2 = the pKa<br />

arg2-<strong>of</strong> = [CYS32 and this pKa depression]/ENZ ACT<br />

C215 PTN1 HUMAN<br />

”The structure <strong>of</strong> the catalytically in<strong>active</strong> mutant (C215S) <strong>of</strong> the human proteintyrosine<br />

phosphatase 1B (PTP1B) has been solved to high resolution in two complexes.”<br />

(PMID:9391040)<br />

pred = solved<br />

arg1 = [in<strong>active</strong> mutant (C215S)]/ENZ ACT<br />

arg1-<strong>of</strong> = the human protein-tyrosine phosphatase 1B (PTP1B)<br />

arg2 = unk<br />

arg2-to = to high resolution<br />

arg2-in = in two complexes<br />

Table 8.2: Identified catalytic residues from MEDLINE extraction. The mined <strong>functional</strong> <strong>annotation</strong><br />

were classified as enzyme-related, suggesting the correspondent protein residue has some catalytic properties.<br />

The identified residues were also cross-validated by CSA, however the mined 3D pattern with<br />

these residues were not validated as <strong>active</strong> site residues by the database.<br />

The analysis shows that only 3 out <strong>of</strong> 15 protein residues have enzyme-related <strong>annotation</strong>s.<br />

2 out <strong>of</strong> 3 residues correspond to the protein human thioredoxin (cf. table 8.2). However,<br />

none <strong>of</strong> the mined 3D patterns can provide a structure context to the identified catalytic<br />

residues. A manual analysis on the 12 out <strong>of</strong> 15 residues shows, that some <strong>of</strong> the associated<br />

<strong>annotation</strong>s were not correctly classified as enzyme-related, which can be explained by<br />

the performance <strong>of</strong> the classifier (cf. section 6.3).<br />

For 16 out <strong>of</strong> 197 protein residues, i.e. the intersection between OLDFIELD, OPR,<br />

and CSA, the term ”catalytic triad” is found as co-mention within sentences. While none<br />

<strong>of</strong> the 16 residues are associated with a mined 3D pattern, 6 out <strong>of</strong> 16 residues have<br />

enzyme-related <strong>functional</strong> <strong>annotation</strong>s (cf. table 8.3).<br />

In conclusion, the results in this study indicate, that the coverage <strong>of</strong> relevant information<br />

to validate <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> is too low. However, some <strong>of</strong> the enzyme-related<br />

<strong>annotation</strong>s are biological valid, but have no correlation with a 3D pattern.<br />

144


RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

S80 HNL HEVBR; D207 HNL HEVBR; H235 HNL HEVBR<br />

”Our results yielded further support for an enzymatic mechanism involving the catalytic<br />

triad Ser80, His235, and Asp207 as a general acid/base.” (PMID:11354003)<br />

pred = involving<br />

arg1 = furhter support<br />

arg1-for = for an enzymatic mechanism<br />

arg2 = [the catalytic triad SER80, HIS235, and ASP207]/ENZ ACT<br />

E132 LINB PSEPA; D108 LINB PSEPA; H272 LINB PSEPA<br />

”The enzyme belongs to the alpha/beta hydrolase family and contains a catalytic triad<br />

(Asp108, His272, and Glu132) in the lipase-like topological arrangement previously proposed<br />

from mutagenesis experiments.” (PMID:11087355)<br />

pred = contains<br />

arg1 = unk<br />

arg1-to = the alpha/beta hydrolase family and<br />

arg2 = [a catalytic triad (ASP108, HIS272, and GLU132)]/ENZ ACT<br />

Table 8.3: Catalytic triad residues available from the mined <strong>functional</strong> <strong>annotation</strong>s. The <strong>active</strong> site<br />

residues were identified by a search for the term ”catalytic triad” in the mined <strong>functional</strong> <strong>annotation</strong><br />

data. The validity was also confirmed by comparison with CSA.<br />

8.3.3 Search for novel catalytic residues<br />

In the previous section, the combined dataset was evaluated by cross-validation with CSA.<br />

Thus the identified catalytic residues represent only re-discoveries <strong>of</strong> known data. The<br />

goal in this section is to search for novel catalytic residues by combining enzyme-related<br />

<strong>annotation</strong>s with mined 3D pattern.<br />

A set analysis between CSA, OLDFIELD, and OPR revealed, that 2,205 residues<br />

are included in OLDFIELD and OPR, but not in CSA (cf. figure 8.3). A search for<br />

the term ”catalytic triad” in sentences <strong>of</strong> these 2,205 identified residues resulted in a<br />

subselection <strong>of</strong> 24 residues. The analysis shows that none <strong>of</strong> the 24 residues were found in<br />

the mined 3D pattern. However, 15 out <strong>of</strong> 24 residues have enzyme-related <strong>annotation</strong>s<br />

(cf. table F.1), suggesting they are catalytic residues. A manual analysis determined,<br />

that the <strong>annotation</strong>s contain valid evidences to identify the residues as catalytic.<br />

The result in this study indicates, that MEDLINE extraction can find some additional<br />

catalytic residues that are not represented in CSA. However, a correlation with the mined<br />

3D patterns was not found, and <strong>functional</strong> <strong>annotation</strong>s were not interpreted in a structural<br />

context.<br />

145


8.3.4 General correlation found between <strong>predicted</strong> <strong>functional</strong><br />

<strong>sites</strong> and extract <strong>functional</strong> <strong>annotation</strong>s.<br />

Previously, the validation <strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> was studied by cross-validation <strong>of</strong> known<br />

catalytic residues. In this section a more general correlation analysis between structure<br />

and function data is studied. Because the coverage <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s<br />

<strong>of</strong> protein residues is too low to be useful to annotate the residues <strong>of</strong> the prediction,<br />

we cannot expect that all residues in one prediction are annotated with description <strong>of</strong><br />

biological function. However, if a <strong>predicted</strong> <strong>functional</strong> site has some feature which point<br />

to a common concept <strong>of</strong> function, then this can be used to prioritise the prediction.<br />

Table 8.4 (left panel) shows the top 25 mined structural patterns which were ranked<br />

by the number <strong>of</strong> distinct residues with PAS data. In total 168 patterns have <strong>annotation</strong>s<br />

ranging from one residue to a maximal <strong>of</strong> nine distinct residues with <strong>annotation</strong>s. Another<br />

view is to take into consideration the number <strong>of</strong> annotated residues in context <strong>of</strong> the total<br />

number <strong>of</strong> residues in a prediction (cf. table 8.4, right panel). This gives an indication <strong>of</strong><br />

how frequent a pattern is and how much do we know on each residue from the text mined<br />

data.<br />

The extraction <strong>of</strong> biological features from text for protein residues matches to a number<br />

<strong>of</strong> various proteins, including homologues proteins. So far the <strong>annotation</strong> <strong>of</strong> residues<br />

in a <strong>predicted</strong> <strong>functional</strong> site considered only first level information (<strong>annotation</strong>s for exact<br />

protein), however, the correlation analysis can also exploit information from homologous<br />

proteins (second level information). Based on the information from the Homology-derived<br />

Secondary Structure <strong>of</strong> proteins (HSSP) database [SS96], the <strong>annotation</strong> <strong>of</strong> the prediction<br />

was expanded by extracted information from homologues. The result <strong>of</strong> this study shows,<br />

that the number <strong>of</strong> residue <strong>annotation</strong> is increased by 10% (cf. table 8.5). A control analysis<br />

<strong>of</strong> how many residues in the non-redundant protein dataset OLDFIELD are identified<br />

in MEDLINE and how many <strong>of</strong> these have an association with PAS data indicates that<br />

the low recall <strong>of</strong> the developed text mining system is the reason for the weak <strong>annotation</strong><br />

146


#residues with Pattern #residues in A/B #residues with Pattern #residues in A/B<br />

PAS (A) pattern (B) PAS (A) pattern (B)<br />

6 9 10 16 CYS CYS PHE-1 12 0.5 4 10 11 11 ALA HIS HIS-1 6 0.6667<br />

4 10 15 11 ASP HIS TRP-2 18 0.2222 4 9 15 11 GLN LEU TRP-2 6 0.6667<br />

4 10 11 20 HIS MET PHE-1 12 0.3333 6 9 10 16 CYS CYS PHE-1 12 0.5<br />

4 9 18 11 GLY MET TYR-1 12 0.3333 3 10 13 10 CYS PHE TYR-1 6 0.5<br />

4 9 11 17 ALA LEU VAL-1 30 0.1333 4 10 11 20 HIS MET PHE-1 12 0.3333<br />

4 8 9 10 CYS CYS HIS-1 12 0.3333 4 11 18 9 CYS ILE PHE-1 12 0.3333<br />

4 11 8 18 HIS HIS SER-1 12 0.3333 4 11 8 18 HIS HIS SER-1 12 0.3333<br />

4 11 18 9 CYS ILE PHE-1 12 0.3333 4 18 10 10 ASP CYS PHE-1 12 0.3333<br />

4 11 11 12 HIS HIS MET-1 21 0.1905 4 19 11 10 ASP CYS ILE-1 12 0.3333<br />

4 9 15 11 GLN LEU TRP-2 6 0.6667 4 20 9 11 ASP GLY MET-1 12 0.3333<br />

4 10 15 11 ASP HIS TRP-1 15 0.2667 4 8 9 10 CYS CYS HIS-1 12 0.3333<br />

4 10 11 11 ALA HIS HIS-1 6 0.6667 4 9 18 11 GLY MET TYR-1 12 0.3333<br />

4 20 9 11 ASP GLY MET-1 12 0.3333 3 9 10 8 CYS HIS MET-1 9 0.3333<br />

4 18 10 10 ASP CYS PHE-1 12 0.3333 2 11 13 9 ASN LYS SER-1 6 0.3333<br />

4 19 11 10 ASP CYS ILE-1 12 0.3333 2 11 14 8 ALA ARG ASN-2 6 0.3333<br />

4 11 14 7 ASP MET SER-1 18 0.2222 2 11 17 10 CYS PHE PRO-1 6 0.3333<br />

4 9 17 10 ALA ILE PHE-1 18 0.2222 2 18 10 11 ARG GLU PRO-1 6 0.3333<br />

3 9 10 8 CYS HIS MET-1 9 0.3333 2 19 9 11 ALA PRO TYR-1 6 0.3333<br />

3 10 13 10 CYS PHE TYR-1 6 0.5 2 9 11 9 ASP CYS LYS-1 6 0.3333<br />

3 21 11 10 CYS GLY VAL-1 21 0.1429 1 10 10 20 HIS PRO TYR-1 3 0.3333<br />

3 11 9 9 ASP MET SER-1 15 0.2 1 10 12 11 ILE LEU PHE-1 3 0.3333<br />

3 17 11 9 ALA LEU VAL-1 102 0.0294 1 14 8 7 ASP HIS SER-1 3 0.3333<br />

3 10 10 19 ALA HIS MET-1 18 0.1667 1 8 11 17 GLU THR THR-1 3 0.3333<br />

3 8 8 15 ASP HIS SER-1 33 0.0909 4 10 15 11 ASP HIS TRP-1 15 0.2667<br />

3 10 9 11 CYS VAL VAL-1 33 0.09099 4 10 15 11 ASP HIS TRP-2 18 0.2222<br />

Table 8.4: Functional <strong>annotation</strong>s <strong>of</strong> protein residues in <strong>predicted</strong> <strong>functional</strong> <strong>sites</strong>. A <strong>functional</strong> site is<br />

<strong>predicted</strong> as a structure pattern that is recurrent among a non-redundant set <strong>of</strong> proteins. The table on<br />

the left panel lists the top 25 patterns ranked by the total number <strong>of</strong> annotated protein residues for each<br />

pattern, while the table on the right panel ranks the pattern by the total number <strong>of</strong> annotated protein<br />

residues in context <strong>of</strong> total number <strong>of</strong> residues found in all structure examples.<br />

147


Residue Annotations<br />

-HSSP<br />

+HSSP<br />

OPR FA OPR FA<br />

OLDFIELD 2,402 1,963 243 192<br />

PATTERN 168 132 16 19<br />

Table 8.5: Homology-based transfer <strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s for protein residues in the<br />

mined pattern data. Based on the HSSP information the identified protein residues and their associated<br />

<strong>functional</strong> <strong>annotation</strong>s were transferred from homologous proteins to the target proteins and residues in<br />

the mined structure pattern data.<br />

expansion.<br />

In conclusion, a general correlation between protein structure and function data is<br />

found in this study. The set <strong>of</strong> available <strong>annotation</strong>s for protein residues is an indication<br />

<strong>of</strong> biological function for a <strong>predicted</strong> <strong>functional</strong> site. The biological significance <strong>of</strong> this<br />

result is being investigated further.<br />

8.4 Discussion<br />

The distribution <strong>of</strong> information in the combined data was studied by a search for <strong>active</strong><br />

site residues. Another approach in sampling the dataset is the identification <strong>of</strong> ligand<br />

binding residues. A search can be done from the protein structure data, by selecting only<br />

residues <strong>of</strong> an identified metal binding site, and then consulting the literature for relevant<br />

<strong>annotation</strong>s.<br />

The validation <strong>of</strong> a <strong>predicted</strong> <strong>active</strong> site in this study demonstrates, that the amount<br />

<strong>of</strong> extracted <strong>functional</strong> <strong>annotation</strong>s was not sufficient for this task.<br />

Considering, that<br />

the catalytic triad is a well characterised structural feature, the information should be<br />

available in MEDLINE. In fact, by searching for the term ”catalytic triad” in the text<br />

mined data, several associations between the term and residues can be found. A close<br />

examination reveals that some are <strong>annotation</strong>s for homologous proteins with the Asp-<br />

His-Ser catalytic triad motif (data not shown).<br />

However, the results <strong>of</strong> the presented<br />

studies indicate that the recall <strong>of</strong> the text mining system is to low to capture sufficiently<br />

148


<strong>annotation</strong>s for protein homologues.<br />

Despite the identification <strong>of</strong> some catalytic residues in this analysis, it must be noted<br />

that literature-based verification <strong>of</strong> <strong>predicted</strong> <strong>active</strong> <strong>sites</strong> cannot rule out the detection <strong>of</strong><br />

false positives. The absence <strong>of</strong> a biological evidence in the literature does not mean, that<br />

the prediction is wrong, but that simply no knowledge is currently available. Biological<br />

research is hypothesis-driven, and therefore not all <strong>of</strong> the <strong>predicted</strong> <strong>active</strong> site residues<br />

are expected to be reported in the literature, if they have not been a biological research<br />

target.<br />

8.5 Conclusion<br />

In this chapter I performed a correlation analysis between the dataset from protein structure<br />

data mining and literature mining.<br />

The result in this study suggests, that the<br />

combined data have little correlations. For example, a structure-based prediction <strong>of</strong> an<br />

<strong>active</strong> site had no <strong>functional</strong> <strong>annotation</strong>s with biological evidences, while the result was<br />

cross-validated with CSA. Conversely, literature-based identification <strong>of</strong> catalytic residues<br />

could not be interpreted in an evolutionary conserved structure context, because data<br />

mining did not find a suitable recurrent structure pattern.<br />

149


Chapter 9<br />

Conclusions and future work<br />

9.1 Summary <strong>of</strong> main contributions<br />

The goal <strong>of</strong> this thesis was to identify <strong>functional</strong> <strong>sites</strong> in proteins. For this purpose a<br />

novel approach that combines protein structure data mining and literature mining was<br />

used. Below is a summary <strong>of</strong> contributions.<br />

Significance testing <strong>of</strong> residue interaction is a novel approach to identify statistically<br />

significant spatial and chemical configurations <strong>of</strong> residues.<br />

The developed<br />

method relies solely on mathematical models, and the analysis shows, that recurrent<br />

homologous or convergent structural features can be extracted. More importantly,<br />

the mined result contains biologically valid data. For example, 22 proteins with the<br />

catalytic triad were identified from cross-validation studies. Altogether, the developed<br />

data mining method can be used to discover novel information; the result is a<br />

prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong>.<br />

Identification <strong>of</strong> protein residues is an important text mining component developed<br />

in this study for the extraction <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s. The implemented solution<br />

utilises regular expression patterns, and lists <strong>of</strong> terminologies from UniProtKB and<br />

NCBI Taxonomy, in order to find and associate biological entities. Ultimately, an<br />

150


identified protein residue is mapped to a Uniprot protein, which means other extracted<br />

information can be integrated into UniProtKB. With a precision <strong>of</strong> 0.82 and<br />

a recall <strong>of</strong> 0.38, residues can be identified and associated precisely with their Uniprot<br />

proteins. From a whole MEDLINE analysis, 15,110 abstract texts were found, that<br />

can be used for information extraction <strong>of</strong> 2,884 UniProtKB/PDB proteins.<br />

Contextual feature extraction is a discovery-driven information extraction approach,<br />

to find description <strong>of</strong> function associated with a residue entity in the text. The developed<br />

method extracts from a parsed sentence verbal and prepositional relations<br />

<strong>of</strong> a residue and its contextual features. The Gene Ontology was not used, because<br />

it does not contain suitable terminologies for the identification <strong>of</strong> <strong>functional</strong> descriptions<br />

<strong>of</strong> residues. With a precision <strong>of</strong> 0.68 and a recall <strong>of</strong> 0.48, the language parser<br />

found 46,750 <strong>annotation</strong>s for the identified protein residues from MEDLINE. Manual<br />

analysis indicates that some <strong>of</strong> the extracted <strong>annotation</strong>s are valid, and contain<br />

novel information that can be used to update the feature table in UniProtKB.<br />

Annotation <strong>of</strong> protein structures is the main objective in this thesis. The goal is to<br />

create a synthesis between protein structure data and protein function data. The<br />

hypothesis is, that the intersection <strong>of</strong> information from both datasets can lead to<br />

the discovery <strong>of</strong> new biological information. For example, a <strong>predicted</strong> <strong>active</strong> site can<br />

be validated with evidences from the set <strong>of</strong> <strong>functional</strong> <strong>annotation</strong>s. Although crossvalidations<br />

demonstrates, that mined information from PDB and literature contain<br />

correct results, no correlation was found between both datasets. Nevertheless, the<br />

text mined information are valid, and 1,391 catalytic residues were found, that can<br />

be used to update CSA.<br />

151


9.2 Limitations and future works<br />

During the work <strong>of</strong> this thesis, various research techniques, and three major analysis<br />

components have been developed. Their algorithms, and implementations were explained,<br />

their performances analysed, and suggestions for improvement have been made. In the<br />

following is a discussion on the improvements for the combined dataset analysis.<br />

To biologically validate a <strong>predicted</strong> <strong>functional</strong> site with published experimental data<br />

results it has to be assumed that the extracted <strong>functional</strong> <strong>annotation</strong>s from the literature<br />

provide sufficient supporting evidence for a biological function. This has been shown to<br />

be partly correct for some examples. However, it will probably not work in all cases. My<br />

results suggest that other factors have to be considered in order to achieve one <strong>of</strong> the<br />

followings: (1) standardised description <strong>of</strong> function <strong>of</strong> protein residues; (2) identification<br />

<strong>of</strong> a representative <strong>functional</strong> concept <strong>of</strong> a structural feature; and (3) verification <strong>of</strong> the<br />

validity <strong>of</strong> the pattern as a consensus <strong>functional</strong> site, where <strong>annotation</strong>s <strong>of</strong> other protein<br />

examples share the same <strong>annotation</strong>s. Although the verification approach uses the vast<br />

and broad covering information from MEDLINE, the analysis indicates that this might<br />

not be sufficient for this task.<br />

Another serious limitation in the literature-based verification <strong>of</strong> <strong>functional</strong> <strong>sites</strong> is to<br />

take into account that our knowledge <strong>of</strong> the protein function space could be incomplete or<br />

even incorrect. Protein structure data mining aims to deliver biologically unbiased results,<br />

since 3D pattern mining relies on mathematical models and no biological knowledge is<br />

used. The result is a prediction <strong>of</strong> <strong>functional</strong> <strong>sites</strong>. However, the input is biologically<br />

biased.<br />

Currently, we do not have the complete knowledge <strong>of</strong> the fold space, which<br />

means the actual distribution <strong>of</strong> structural features may be skewed. As a consequence,<br />

the prediction may contain a large fraction <strong>of</strong> false positives. In the long run, various<br />

structural genomics initiatives may expand our knowledge <strong>of</strong> the fold space.<br />

In the meantime, the literature is the main resource <strong>of</strong> biological evidences to validate<br />

predictions. Yet, our knowledge <strong>of</strong> protein residue function, and even the spectrum <strong>of</strong><br />

152


iological function has still to be determined.<br />

This can lead to four scenarios: (1) a<br />

true <strong>functional</strong> site is fully supported by evidences (true positive); (2) a true <strong>functional</strong><br />

site is partly supported by evidences (incomplete knowledge); (3) a falsely <strong>predicted</strong><br />

<strong>functional</strong> site is partly supported by evidences (incomplete knowledge); and (4) a falsely<br />

<strong>predicted</strong> <strong>functional</strong> site is fully supported by contradictory evidences (false positive).<br />

While, from a bioinformatical point <strong>of</strong> view, there is little we can do about this problem,<br />

the identification <strong>of</strong> case (2), (3), and case (4) can propose further biological experiments<br />

to find the missing data.<br />

153


Bibliography<br />

[AGM + 90]<br />

SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local<br />

alignment search tool. Journal <strong>of</strong> Molecular Biololgy, 215(3):403–10, 1990.<br />

[AL02]<br />

M Ashburner and SE Lewis. On ontologies for biologists: the gene ontology<br />

- uncoupling the web. Novartis Foundation Symposium, 2002.<br />

[AMS + 97]<br />

SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and<br />

DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation <strong>of</strong> protein<br />

database search programs. Nucleic Acids Research, 25(17):3389–402, 1997.<br />

[APG + 94] PJ Artymiuk, AR Poirrette, HM Grindley, DW Rice, and P Willett. A<br />

graph-theoretic approach to the identification <strong>of</strong> three-dimensional patterns<br />

<strong>of</strong> amino acid side-chains in protein structures. Journal <strong>of</strong> Molecular Biololgy,<br />

243(2):327–44, 1994.<br />

[Att02]<br />

TK Attwood. The PRINTS database: a resource for identification <strong>of</strong> protein<br />

families. Brief Bioinform, 3(3):252–63, 2002.<br />

[AZP + 05]<br />

G Ausiello, A Zanzoni, D Peluso, A Via, and M Helmer-Citterich. pdbFun:<br />

mass selection and fast comparison <strong>of</strong> annotated PDB residues.<br />

Nucleic<br />

Acids Research, 33:W133–137, Jul 2005.<br />

154


[BFL04]<br />

T Binkowski, P Freeman, and J Liang. pvSOAR: detecting similar surface<br />

patterns <strong>of</strong> pocket and void surfaces <strong>of</strong> amino acid residues on proteins.<br />

Nucleic Acids Research, 32:555–558, 2004.<br />

[BFW + 94]<br />

A Barth, K Frost, M Wahab, W Brandt, HD Schadler, and R Franke. Classification<br />

<strong>of</strong> serine proteases derived from steric comparisons <strong>of</strong> their <strong>active</strong><br />

<strong>sites</strong>, part ii: ”ser, his, asp arrangements in proteolytic and nonproteolytic<br />

proteins”. Drug Design Discovery, 2:89–111, November 1994.<br />

[BGH + 00]<br />

WC Barker, JS Garavelli, H Huang, PB Mcgarvey, BC Orcutt, GY Srinivasarao,<br />

C Xiao, LL Yeh, RS Ledley, JF Janda, F Pfeiffer, HW Mewes,<br />

A Tsugita, and C Wu. The protein information resource (pir). Nucleic<br />

Acids Research, 28(1):41–44, January 2000.<br />

[BKL00]<br />

SE Brenner, P Koehl, and M Levitt. The astral compendium for protein<br />

structure and sequence analysis. Nucleic Acids Research, 28(1):254–256,<br />

January 2000.<br />

[BLK + 08]<br />

E Beisswanger, V Lee, JJ Kim, D Rebholz-Schuhmann, A Splendiani,<br />

O Dameron, S Schulz, and U Hahn. Gene regulation ontology (gro): design<br />

principles and use cases. Studies in health technology and informatics,<br />

136:9–14, 2008.<br />

[BM05]<br />

R Bunescu and RJ Mooney. A shortest path dependency kernel for relation<br />

extraction.<br />

In Proceedings <strong>of</strong> the Joint Conference on Human Language<br />

Technology / Empirical Methods in Natural Language Processing<br />

(HLT/EMNLP’05), 2005.<br />

[BM06]<br />

R Bunescu and RJ Mooney. Subsequence kernels for relation extraction. In<br />

Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information<br />

Processing Systems 18, pages 171–178. MIT Press, 2006.<br />

155


[BMC08] BMC. Biomed central. http://www.biomedcentral.com/, November 2008.<br />

[BT03]<br />

JA Barker and JM Thornton. An algorithm for constraint-based structural<br />

template matching: application to 3D templates with statistical analysis.<br />

Bioinformatics, 19(13):1644–1649, September 2003.<br />

[BW03]<br />

PE Bourne and H Weissig. Structural Bioinformatics (Methods <strong>of</strong> Biochemical<br />

Analysis, V. 44). Wiley-Liss, 1 edition, February 2003.<br />

[BW05]<br />

CJO Baker and R Witte. Mutation miner - textual <strong>annotation</strong> <strong>of</strong> protein<br />

structures. CERMM Symposium, 2005.<br />

[BWF + 00]<br />

HM Berman, J Westbrook, Z Feng, G Gilliland, TN Bhat, H Weissig,<br />

IN Shindyalov, and PE Bourne. The protein data bank. Nucleic Acids<br />

Research, 28(1):235–242, January 2000.<br />

[CB94]<br />

RR Copley and GJ Barton. A structural analysis <strong>of</strong> phosphate and sulphate<br />

binding <strong>sites</strong> in proteins. Estimation <strong>of</strong> propensities for binding and conservation<br />

<strong>of</strong> phosphate binding <strong>sites</strong>. Journal <strong>of</strong> Molecular Biology, 242:321–<br />

329, Sep 1994.<br />

[CCR + 08]<br />

BL Cantarel, PM Coutinho, C Rancurel, T Bernard, V Lombard, and<br />

B Henrissat. The Carbohydrate-Active EnZymes database (CAZy): an<br />

expert resource for Glycogenomics. Nucleic Acids Research, Oct 2008.<br />

[Cer00]<br />

F Cerbah. Exogenous and endogenous approaches to semantic categorization<br />

<strong>of</strong> unknown technical terms. In in In Proceedings <strong>of</strong> the 18th International<br />

Conference on Computational Linguistics (COLING, pages 145–151,<br />

2000.<br />

[CFK + 05]<br />

BY Chen, VY F<strong>of</strong>anov, DM Kristensen, M Kimmel, O Lichtarge, and<br />

LE Kavraki. Algorithms for structural comparison and statistical analysis<br />

156


<strong>of</strong> 3D protein motifs. Pacific Symposium on Biocomputing, pages 334–345,<br />

2005.<br />

[Cha93]<br />

P Chakrabarti. Anion binding <strong>sites</strong> in protein structures. Journal <strong>of</strong> Molecular<br />

Biololgy, 234:463–482, Nov 1993.<br />

[CHR + 02]<br />

JM Castagnetto, SW Hennessy, VA Roberts, ED Getz<strong>of</strong>f, JA Tainer, and<br />

ME Pique. Mdb: the metalloprotein database and browser at the scripps<br />

research institute. Nucleic Acids Research, 30(1):379–382, January 2002.<br />

[CK06] IG Choi and SH Kim. Evolution <strong>of</strong> protein structural classes and protein<br />

sequence families. Proceedings <strong>of</strong> the National Academy <strong>of</strong> Sciences,<br />

September 2006.<br />

[CL64]<br />

RV Cochran and LH Lund. On the kirkwood superposition approximation.<br />

Journal <strong>of</strong> Physical Chemistry, 1964.<br />

[CMP05]<br />

J Crim, R McDonald, and F Pereira. <strong>Automatic</strong>ally annotating documents<br />

with normalized gene lists. BMC Bioinformatics, 6 Suppl 1, 2005.<br />

[CMR06]<br />

P Corbett and P Murray-Rust. High-throughput identification <strong>of</strong> chemistry<br />

in life science texts. In Computational Life Sciences II, pages 107–118.<br />

Springer, 2006.<br />

[CSL + 06]<br />

FM Couto, MJ Silva, V Lee, E Dimmer, E Camon, R Apweiler, H Kirsch,<br />

and D Rebholz-Schuhmann. Goannotator: linking protein go <strong>annotation</strong>s<br />

to evidence text. Journal <strong>of</strong> Biomedical Discovery and Collaboration, 1:19+,<br />

December 2006.<br />

[DBAD03]<br />

R Day, DA Beck, RS Armen, and V Daggett. A consensus view <strong>of</strong> fold<br />

space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein<br />

Science, 12:2150–2160, Oct 2003.<br />

157


[DCG + 04]<br />

F Diella, S Cameron, C Gemuend, R Linding, A Via, B Kuster, ST Ponten,<br />

N Blom, and TJ Gibson. Phospho.elm: a database <strong>of</strong> experimentally verified<br />

phosphorylation <strong>sites</strong> in eukaryotic proteins. BMC Bioinformatics, 5, June<br />

2004.<br />

[DS05]<br />

A Doms and M Schroeder. Gopubmed: exploring pubmed with the gene<br />

ontology. Nucleic Acids Research, 33(Web Server issue), July 2005.<br />

[FGS98]<br />

JS Fetrow, A Godzik, and J Skolnick. Functional analysis <strong>of</strong> the escherichia<br />

coli genome using the sequence-to-structure-to-function paradigm: identification<br />

<strong>of</strong> proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase<br />

activity. Journal <strong>of</strong> Molecular Biololgy, 282(4):703–711, October<br />

1998.<br />

[FKY + 01]<br />

C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: a<br />

natural-language processing system for the extraction <strong>of</strong> molecular pathways<br />

from journal articles. Bioinformatics, 17 Suppl 1, 2001.<br />

[Fri07]<br />

D Frishman. Protein <strong>annotation</strong> at genomic scale: the current status. Chem<br />

Rev, 107(8):3448–3466, August 2007.<br />

[FS98]<br />

JS Fetrow and J Skolnick. Method for prediction <strong>of</strong> protein function from sequence<br />

using the sequence-to-structure-to-function paradigm with application<br />

to glutaredoxins/thioredoxins and T1 ribonucleases. Journal <strong>of</strong> Molecular<br />

Biololgy, 281(5), September 1998.<br />

[Fuk98]<br />

K Fukuda. Toward information extraction: identifying protein names from<br />

biological papers, 1998.<br />

[FWLN94] D Fischer, H Wolfson, SL Lin, and R Nussinov. Three-dimensional, sequence<br />

order-independent structural comparison <strong>of</strong> a serine protease against<br />

158


the crystallographic database reveals <strong>active</strong> site similarities: potential implications<br />

to evolution and to protein folding. Protein Science, 3(5):769–778,<br />

May 1994.<br />

[GDAW03]<br />

R Gaizauskas, G Demetriou, PJ Artymiuk, and P Willett. Protein structures<br />

and information extraction from biological texts: the pasta system.<br />

Bioinformatics, 19(1):135–143, January 2003.<br />

[GDO + 05]<br />

A Golovin, D Dimitropoulos, TJ Oldfield, A Rachedi, and K Henrick.<br />

Msdsite: A database search and retrieval system for the analysis and viewing<br />

<strong>of</strong> bound ligands and <strong>active</strong> <strong>sites</strong>. Proteins: Structure, Function, and<br />

Bioinformatics, 58(1):190–199, 2005.<br />

[GH08]<br />

A Golovin and K Henrick. Msdmotif: exploring protein <strong>sites</strong> and motifs.<br />

BMC Bioinformatics, 9(1), 2008.<br />

[GJYLRS08] S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Combining<br />

evidence, specificity, and proximity towards the normalization <strong>of</strong> gene ontology<br />

terms in text. EURASIP journal on bioinformatics & systems biology,<br />

2008.<br />

[Glu91]<br />

JP Glusker. Structural aspects <strong>of</strong> metal liganding to <strong>functional</strong> groups in<br />

proteins. Advances in Protein Chemistry, 42:1–76, 1991.<br />

[GOC06] GOConsortium. The gene ontology (go) project in 2006. Nucleic Acids<br />

Research, 34(Database issue), January 2006.<br />

[GPP + 03]<br />

F Glaser, T Pupko, I Paz, RE Bell, D Bechor-Shental, E Martz, and N Ben-<br />

Tal.<br />

ConSurf: identification <strong>of</strong> <strong>functional</strong> regions in proteins by surfacemapping<br />

<strong>of</strong> phylogenetic information. Bioinformatics, 19(1):163–164, January<br />

2003.<br />

159


[Gue96]<br />

F Guenthner. Electronic lexica and corpora research at cis. CIS Bericht-<br />

96-100, 1996.<br />

[HBB + 08]<br />

N Hulo, A Bairoch, V Bulliard, L Cerutti, BA Cuche, E de Castro,<br />

C Lachaize, PS Langendijk-Genevaux, and CJ Sigrist.<br />

The 20 years <strong>of</strong><br />

PROSITE. Nucleic Acids Research, 36:D245–249, Jan 2008.<br />

[HBGK03]<br />

M Hendlich, A Bergner, J Günther, and G Klebe. Relibase: design and<br />

development <strong>of</strong> a database for comprehensive analysis <strong>of</strong> protein-ligand interactions.<br />

Journal <strong>of</strong> Molecular Biololgy, 326(2):607–620, February 2003.<br />

[HFM + 05]<br />

D Hanisch, K Fundel, HT Mevissen, R Zimmer, and J Fluck. Prominer:<br />

rule-based protein and gene entity recognition. BMC Bioinformatics, 6<br />

Suppl 1, 2005.<br />

[HJ99] C Hadley and DT Jones. A systematic comparison <strong>of</strong> protein structure<br />

classifications: SCOP, CATH and FSSP. Structure, 7:1099–1112, Sep 1999.<br />

[HLC04]<br />

F Horn, AL Lau, and FE Cohen. Automated extraction <strong>of</strong> mutation data<br />

from the literature: application <strong>of</strong> mutext to g protein-coupled receptors and<br />

nuclear hormone receptors. Bioinformatics, 20(4):557–568, March 2004.<br />

[HMBC97]<br />

TJ Hubbard, AG Murzin, SE Brenner, and C Chothia. SCOP: a structural<br />

classification <strong>of</strong> proteins database. Nucleic Acids Research, 25:236–239, Jan<br />

1997.<br />

[HNR + 05]<br />

ZZ Hu, M Narayanaswamy, KE Ravikumar, K Vijay-Shanker, and CH Wu.<br />

Literature mining and database <strong>annotation</strong> <strong>of</strong> protein phosphorylation using<br />

a rule-based system. Bioinformatics, 21(11):2759–2765, June 2005.<br />

[Hob02]<br />

JR Hobbs. Information extraction from biomedical text. Journal <strong>of</strong> Biomedical<br />

Informatics, 35(4):260–264, August 2002.<br />

160


[HPS + 03]<br />

A Harrison, F Pearl, I Sillitoe, T Slidel, R Mott, JM Thornton, and<br />

CA Orengo. Recognizing the fold <strong>of</strong> a protein structure. Bioinformatics,<br />

19(14):1748–1759, September 2003.<br />

[HS94]<br />

L Holm and C Sander. The fssp database <strong>of</strong> structurally aligned protein<br />

fold families. Nucleic Acids Research, 22(17):3600–3609, September 1994.<br />

[HS96] L Holm and C Sander. Mapping the protein universe. Science,<br />

273(5275):595–603, August 1996.<br />

[HSSS92]<br />

U Hobohm, M Scharf, R Schneider, and C Sander. Selection <strong>of</strong> representative<br />

protein data sets. Protein Science, 1(3):409–417, March 1992.<br />

[HZH + 04]<br />

M Huang, X Zhu, Y Hao, DG Payan, K Qu, and M Li. Discovering patterns<br />

to extract protein-protein interactions from full texts. Bioinformatics,<br />

20(18):3604–3612, December 2004.<br />

[IPGK05]<br />

VA Ivanisenko, SS Pintus, DA Grigorovich, and NA Kolchanov. PDBSite:<br />

a database <strong>of</strong> the 3D structure <strong>of</strong> protein <strong>functional</strong> <strong>sites</strong>. Nucleic Acids<br />

Research, 33:D183–187, Jan 2005.<br />

[JB04]<br />

A Jakulin and I Bratko. Testing the significance <strong>of</strong> attribute interactions.<br />

In In ICML, pages 409–416. ACM Press, 2004.<br />

[JGLRS08] S Jaeger, S Gaudan, U Leser, and D Rebholz-Schuhmann. Integrating<br />

protein-protein interactions and text mining for protein function prediction.<br />

BMC Bioinformatics, 9(Suppl 8), 2008.<br />

[JIDG03]<br />

M Jambon, A Imberty, G Delà c○age, and C Geourjon. A new bioinformatic<br />

approach to detect common 3d <strong>sites</strong> in protein structures. Proteins:<br />

Structure, Function, and Genetics, 52:137–145, 2003.<br />

161


[JK95]<br />

J Justeson and S Katz. Technical terminology: some linguistic properties<br />

and an algorithm for identification in text. Natural Language Engineering,<br />

pages 9–27, 1995.<br />

[KCRB07]<br />

R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker. A workflow for<br />

mutation extraction and structure <strong>annotation</strong>. Journal <strong>of</strong> Bioinformatics<br />

and Computational Biology, 5(6):1319–1337, December 2007.<br />

[KH04]<br />

E Krissinel and K Henrick. Secondary-structure matching (ssm), a new tool<br />

for fast protein structure alignment in three dimensions. Acta Crystallographica<br />

Section D: Biological Crystallography, 60(1):2256–2268, December<br />

2004.<br />

[KJ94]<br />

GJ Kleywegt and TA Jones. Detection, delineation, measurement and display<br />

<strong>of</strong> cavities in macromolecular structures. Acta Crystallographica Section<br />

D: Biological Crystallography, 50(Pt 2):178–185, March 1994.<br />

[Kle99]<br />

GJ Kleywegt. Recognition <strong>of</strong> spatial motifs in protein structures. Journal<br />

<strong>of</strong> Molecular Biololgy, 285(4):1887–1897, January 1999.<br />

[KN03]<br />

K Kinoshita and H Nakamura. Identification <strong>of</strong> protein biochemical functions<br />

by similarity search using the molecular surface database ef-site. Protein<br />

Science, 12(8):1589–1595, August 2003.<br />

[KNT05] A Koike, Y Niwa, and T Takagi. <strong>Automatic</strong> extraction <strong>of</strong> gene/protein<br />

biological functions from biomedical text. Bioinformatics, 21(7):1227–1236,<br />

April 2005.<br />

[KON99] T Kawabata, M Ota, and K Nishikawa. The protein mutant database.<br />

Nucleic Acids Research, 27(1):355–357, January 1999.<br />

162


[Las95]<br />

RA Laskowski. Surfnet: a program for visualizing molecular surfaces, cavities,<br />

and intermolecular interactions. Journal <strong>of</strong> Molecular Biololgy, 13(5),<br />

October 1995.<br />

[LC05] G Leroy and H Chen. Genescene: An ontology-enhanced integration <strong>of</strong><br />

linguistic and co-occurrence based relations in biomedical texts: Research<br />

articles. Journal <strong>of</strong> the American Society for Information Science and Technology,<br />

56(5):457–468, March 2005.<br />

[LCM03] G Leroy, H Chen, and JD Martinez. A shallow parser based on closedclass<br />

words to capture relations in biomedical text. Journal <strong>of</strong> Biomedical<br />

Informatics, pages 145–158, June 2003.<br />

[LEW98]<br />

J Liang, H Edelsbrunner, and C Woodward. Anatomy <strong>of</strong> protein pockets<br />

and cavities: measurement <strong>of</strong> binding site geometry and implications for<br />

ligand design. Protein Science, 7(9):1884–1897, September 1998.<br />

[LHC07] LC Lee, F Horn, and FE Cohen. <strong>Automatic</strong> extraction <strong>of</strong> protein point<br />

mutations using a graph bigram association. PLoS Computational Biology,<br />

3(2):e16+, February 2007.<br />

[LRTV07]<br />

Gonzalo Lopez, Ana Rojas, Michael Tress, and Alfonso Valencia. Assessment<br />

<strong>of</strong> predictions submitted for the CASP7 function prediction category.<br />

Proteins, 69 Suppl 8:165–74, 2007.<br />

[LW91] Y Lamdan and HJ Wolfson. Protein structures and information extraction<br />

from biological texts: the pasta system. Computer Vision and Pattern<br />

Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Conference<br />

on, pages 22–27, June 1991.<br />

[Mar05] AC Martin. Mapping pdb chains to uniprotkb entries. Bioinformatics,<br />

21(23):4297–4301, December 2005.<br />

163


[MB99] Y Matsuo and SH Bryant. Identification <strong>of</strong> homologous core structures.<br />

Proteins, 35:70–79, Apr 1999.<br />

[MG03] J McCallum and S Ganesh. Text mining <strong>of</strong> DNA sequence homology<br />

searches. Applied Bioinformatics, 2:59–63, 2003.<br />

[MR03]<br />

S Mika and B Rost. UniqueProt: Creating representative protein sequence<br />

sets. Nucleic Acids Research, 31:3789–3791, Jul 2003.<br />

[MSD08] MSDmapping. Msdmapping. http://www.ebi.ac.uk/msd-as/<br />

MSDMapping/, November 2008.<br />

[MT05] Y Miyao and J Tsujii. Probabilistic disambiguation models for widecoverage<br />

hpsg parsing. In ACL ’05: Proceedings <strong>of</strong> the 43rd Annual Meeting<br />

on Association for Computational Linguistics, pages 83–90. Association for<br />

Computational Linguistics, 2005.<br />

[NBD + 06]<br />

J Natarajan, D Berrar, W Dubitzky, C Hack, Y Zhang, C Desesa,<br />

JR Van Brocklyn, and EG Bremer.<br />

Text mining <strong>of</strong> full-text journal articles<br />

combined with gene expression analysis reveals a relationship between<br />

sphingosine-1-phosphate and invasiveness <strong>of</strong> a glioblastoma cell line. BMC<br />

Bioinformatics, 7:373+, August 2006.<br />

[NED03]<br />

S Novichkova, S Egorov, and N Daraselia. Medscan, a natural language<br />

processing engine for medline abstracts. Bioinformatics, 19(13):1699–1706,<br />

September 2003.<br />

[OCR01]<br />

MJ Ondrechen, JG Clifton, and D Ringe. Thematics: A simple computational<br />

predictor <strong>of</strong> enzyme function from structure.<br />

Proceedings <strong>of</strong> the<br />

National Academy <strong>of</strong> Sciences, 98(22):12473–12478, October 2001.<br />

164


[Old01]<br />

TJ Oldfield. Creating structure features by data mining the PDB to use as<br />

molecular-replacement models. Acta Crystallographica Section D: Biological<br />

Crystallography, 57:1421–1427, Oct 2001.<br />

[Old02]<br />

TJ Oldfield. Data mining the protein data bank: residue interactions. Proteins,<br />

49(4):510–528, December 2002.<br />

[OMJ + 97]<br />

CA Orengo, AD Michie, S Jones, DT Jones, MB Swindells, and JM Thornton.<br />

CATH-a hierarchic classification <strong>of</strong> protein domain structures. Structure,<br />

5:1093–1108, Aug 1997.<br />

[PB06]<br />

BJ Polacco and PC Babbitt. Automated discovery <strong>of</strong> 3d motifs for protein<br />

function <strong>annotation</strong>. Bioinformatics, 22(6):723–730, March 2006.<br />

[PBT04]<br />

CT Porter, GJ Bartlett, and JM Thornton. The Catalytic Site Atlas: a<br />

resource <strong>of</strong> catalytic <strong>sites</strong> and residues identified in enzymes using structural<br />

data. Nucleic Acids Research, 32(Database issue), January 2004.<br />

[PJYLRS08] P Pezik, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Static dictionary<br />

features for term polysemy identification. Building and evaluating<br />

resources for biomedical text mining, LREC Workshop, 2008.<br />

[PKS06] G Pandey, V Kumar, and M Steinbach. Computational approaches for<br />

protein function prediction: A survey. Technical Report 06-028, Department<br />

<strong>of</strong> Computer Science and Engineering, University <strong>of</strong> Minnesota, Twin Cities,<br />

2006.<br />

[Plo08] PloS. Public library <strong>of</strong> science. http://www.plos.org/, November 2008.<br />

[PMC08]<br />

PMC. Pubmed central. http://www.pubmedcentral.nih.gov/, November<br />

2008.<br />

165


[POHS05]<br />

M Pesu, J O’Shea, L Hennighausen, and O Silvennoinen. Identification <strong>of</strong> an<br />

acquired mutation in Jak2 provides molecular insights into the pathogenesis<br />

<strong>of</strong> myeloproliferative disorders.<br />

Molecular Interventions, 5:211–215, Aug<br />

2005.<br />

[RMK + 07]<br />

ND Rawlings, FR Morton, CY Kok, J Kong, and AJ Barrett. Merops: the<br />

peptidase database.<br />

Nucleic Acids Research, pages gkm954+, November<br />

2007.<br />

[Ros99]<br />

B Rost. Twilight zone <strong>of</strong> protein sequence alignments. Protein Engineering<br />

Design and Selection, 12(2):85–94, February 1999.<br />

[RSAG + 08]<br />

D Rebholz-Schuhmann, M Arregui, S Gaudan, H Kirsch, and A Jimeno<br />

Yepes. Text processing through web services: Calling whatizit. Bioinformatics,<br />

2008.<br />

[RSKA + 07]<br />

D Rebholz-Schuhmann, H Kirsch, M Arregui, S Gaudan, M Riethoven, and<br />

P Stoehr. Ebimed-text crunching to gather facts for proteins from medline.<br />

Bioinformatics, 23(2), January 2007.<br />

[RSMA + 04]<br />

D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle, G Casari, and H Kirsch.<br />

<strong>Automatic</strong> extraction <strong>of</strong> mutations from medline and cross-validation with<br />

omim. Nucleic Acids Research, 2004.<br />

[Rus98] RB Russell. Detection <strong>of</strong> protein three-dimensional side-chain patterns:<br />

new examples <strong>of</strong> convergent evolution.<br />

Journal <strong>of</strong> Molecular Biology,<br />

279(5):1211–1227, June 1998.<br />

[SAR + 07]<br />

B Smith, M Ashburner, C Rosse, K Bard, W Bug, W Ceusters, LJ Goldberg,<br />

K Eilbeck, A Ireland, CJ Mungall, N Leontis, P Rocca-Serra, A Ruttenberg,<br />

SA Sansone, RH Scheuermann, N Shah, PL Whetzel, and S Lewis. The<br />

166


OBO Foundry: coordinated evolution <strong>of</strong> ontologies to support biomedical<br />

data integration. Nature Biotechnology, 25(11):1251–5, 2007.<br />

[SB05]<br />

A Schutz and P Buitelaar. Relext: A tool for relation extraction from text<br />

in ontology extension. The Semantic Web - ISWC 2005, pages 593–606,<br />

2005.<br />

[SB06]<br />

J Schuman and S Bergler. Postnominal prepositional phrase attachment<br />

in proteomics. In Proceedings <strong>of</strong> the HLT-NAACL BioNLP Workshop on<br />

Linking Natural Language and Biology. Association for Computational Linguistics,<br />

2006.<br />

[SDC06]<br />

A Sidhu, T Dillon, and E Chang. Unification <strong>of</strong> protein data and knowledge<br />

sources. Knowledge-Based Intelligent Information and Engineering Systems,<br />

pages 728–737, 2006.<br />

[Sin04] A Singer. Maximum entropy formulation <strong>of</strong> the Kirkwood superposition<br />

approximation. Journal <strong>of</strong> Chemical Physics, 121:3657–3666, Aug 2004.<br />

[SPIBA03]<br />

PK Shah, C Perez-Iratxeta, P Bork, and MA Andrade. Information extraction<br />

from full text scientific articles: where are the keywords?<br />

BMC<br />

Bioinformatics, 4(1), May 2003.<br />

[SPNW04]<br />

A Shulman-Peleg, R Nussinov, and HJ Wolfson. Recognition <strong>of</strong> <strong>functional</strong><br />

<strong>sites</strong> in protein structures. Journal <strong>of</strong> Molecular Biololgy, 339(3):607–633,<br />

June 2004.<br />

[SS96] R Schneider and C Sander. The HSSP database <strong>of</strong> protein structuresequence<br />

alignments. Nucleic Acids Research, 24(1):201–5, 1996.<br />

[SSR03]<br />

A Stark, S Sunyaev, and RB Russell. A model for statistical significance <strong>of</strong><br />

local similarities in structure. Journal <strong>of</strong> Molecular Biology, 326(5):1307–<br />

1316, March 2003.<br />

167


[STB06]<br />

MH Saier, CV Tran, and RD Barabote. Tcdb: the transporter classification<br />

database for membrane transport protein analyses and information. Nucleic<br />

Acids Research, 34(Database issue), January 2006.<br />

[SWS + 04]<br />

MJ Schuemie, M Weeber, BJ Schijvenaars, EM van Mulligen, CC van der<br />

Eijk, R Jelier, B Mons, and JA Kors. Distribution <strong>of</strong> information in biomedical<br />

abstracts and full-text publications. Bioinformatics, 20(16):2597–2604,<br />

November 2004.<br />

[SYH + 03]<br />

S Saito, H Yamaguchi, Y Higashimoto, C Chao, Y Xu, AJ Fornace, E Appella,<br />

and CW Anderson. Phosphorylation site interdependence <strong>of</strong> human<br />

p53 post-translational modifications in response to stress. Journal <strong>of</strong> Biological<br />

Chemistry, 278:37536–37544, Sep 2003.<br />

[TCS + 07]<br />

RT Tsai, WC Chou, YS Su, YC Lin, CL Sung, HJ Dai, IT Yeh, W Ku,<br />

TY Sung, and WL Hsu.<br />

Biosmile: A semantic role labeling system for<br />

biomedical verbs using a maximum-entropy model with automatically generated<br />

template features. BMC Bioinformatics, 8:325+, September 2007.<br />

[TMA08]<br />

Y Tsuruoka, J Mcnaught, and S Ananiadou. Normalizing biomedical terms<br />

by minimizing ambiguity and variability. BMC Bioinformatics, 9(Suppl 3),<br />

2008.<br />

[TOT04]<br />

Y Tateisi, T Ohta, and J Tsujii. Annotation <strong>of</strong> predicate-argument structure<br />

on molecular biology text. In First International Joint Conference on Natural<br />

Language Processing In the IJCNLP-04 workshop on Beyond Shallow<br />

Analyses, March 2004.<br />

[TW02]<br />

L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical<br />

text. Bioinformatics, 18(8):1124–1132, August 2002.<br />

168


[VMMR + 05] S Velankar, P McNeil, V Mittard-Runte, A Suarez, D Barrell, R Apweiler,<br />

and K Henrick.<br />

E-msd: an integrated data resource for bioinformatics.<br />

Nucleic Acids Research, 33(Database issue), January 2005.<br />

[VZHC05] A Via, A Zanzoni, and M Helmer-Citterich. Seq2Struct: a resource for<br />

establishing sequence-structure links. Bioinformatics, 21(4):551–3, 2005.<br />

[WAB + 06]<br />

CH Wu, R Apweiler, A Bairoch, DA Natale, WC Barker, B Boeckmann,<br />

S Ferro, E Gasteiger, H Huang, R Lopez, M Magrane, MJ Martin,<br />

R Mazumder, C O’Donovan, N Redaschi, and B Suzek. The universal<br />

protein resource (uniprot): an expanding universe <strong>of</strong> protein information.<br />

Nucleic Acids Research, 34(Database issue), January 2006.<br />

[WBB + 06]<br />

DL Wheeler, T Barrett, DA Benson, SH Bryant, K Canese, V Chetvernin,<br />

DM Church, M Dicuccio, R Edgar, S Federhen, LY Geer, W Helmberg,<br />

Y Kapustin, DL Kenton, O Khovayko, DJ Lipman, TL Madden, DR Maglott,<br />

J Ostell, KD Pruitt, GD Schuler, LM Schriml, E Sequeira, ST Sherry,<br />

K Sirotkin, A Souvorov, G Starchenko, TO Suzek, R Tatusov, TA Tatusova,<br />

L Wagner, and E Yaschenko.<br />

Database resources <strong>of</strong> the national center<br />

for biotechnology information. Nucleic Acids Research, 34(Database issue),<br />

January 2006.<br />

[WBT97] AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hashing<br />

algorithm for deriving 3d coordinate templates for searching structural<br />

databases. application to enzyme <strong>active</strong> <strong>sites</strong>. Protein Science, 6(11):2308–<br />

2323, November 1997.<br />

[WD03]<br />

G Wang and RL Dunbrack. Pisces: a protein sequence culling server. Bioinformatics,<br />

19(12):1589–1591, August 2003.<br />

169


[WK07]<br />

R Witte and T Kappler. Enhanced semantic access to the protein engineering<br />

literature using ontologies populated by text mining. International<br />

Journal <strong>of</strong> Bioinformatics Research and Applications, 2007.<br />

[WR97] HJ Wolfson and I Rigoutsos. Geometric hashing: an overview. Computational<br />

Science and Engineering, IEEE [see also Computing in Science &<br />

Engineering], 4(4):10–21, 1997.<br />

[WSC04] T Wattarujeekrit, PK Shah, and N Collier. Pasbio: predicate-argument<br />

structures for event extraction in molecular biology. BMC Bioinformatics,<br />

5, October 2004.<br />

[YEC + 07]<br />

S Yoon, JC Ebert, EY Chung, G De Micheli, and RB Altman. Clustering<br />

protein environments for function prediction: finding prosite motifs in 3d.<br />

BMC Bioinformatics, 8 Suppl 4, 2007.<br />

[YHF + 02]<br />

H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. <strong>Automatic</strong><br />

extraction <strong>of</strong> gene and protein synonyms from medline and journal<br />

articles. Proceedings <strong>of</strong> the AMIA Symposium, pages 919–923, 2002.<br />

[YLPV07] YL Yip, N Lachenal, V Pillet, and AL Veuthey. Retrieving mutationspecific<br />

information for human proteins in UniProt/Swiss-Prot Knowledgebase.<br />

Journal <strong>of</strong> Bioinformatics and Computational Biology, 5:1215–1231,<br />

Dec 2007.<br />

[YMTT05] A Yakushiji, Y Miyao, Y Tateisi, and J Tsujii. Biomedical information<br />

extraction with predicate-argument structure patterns. In SMBM, 2005.<br />

170


Appendix A<br />

Examples <strong>of</strong> errors in relation<br />

extraction.<br />

171


Table A.1: Examples <strong>of</strong> errors in the relation extraction for the detection <strong>of</strong><br />

contextual features.<br />

.<br />

Sentence<br />

Annotated residue<br />

Annotated keywords<br />

Annotated PAS<br />

TP shallow parsing<br />

FP full parsing<br />

Sentence<br />

Annotated residue<br />

Annotated keywords<br />

Annotated PAS<br />

FP shallow parsing<br />

TP full parsing<br />

Sentence<br />

Annotated residue<br />

Annotated keywords<br />

Annotated PAS<br />

FP shallow parsing<br />

FP full parsing<br />

”This observation provides a rationale for the reduced electron-transfer efficiency displayed<br />

by the E92K mutant. ” (PMID:10089511)<br />

GLU92<br />

reduced electron-transfer efficiency<br />

pred = diplayed<br />

arg1 = the reduced electron-transfer efficiency<br />

arg2-by = the E92K mutant<br />

pred = displayed<br />

arg1 = a rationale<br />

arg1-for = the reduced electron-transfer efficiency<br />

arg2-by = the GLU92 LYS mutant<br />

pred = displayed<br />

arg1-by = the GLU92 LYS mutant<br />

”An apparent ’acceptor consensus overlap’ at Ser474 suggests that the mechanism behind<br />

the glycosaminoglycan split <strong>of</strong> TM may involve a competition for substrate between xylosyltransferase<br />

and N-acetylgalactosaminyltransferase.” (PMID:8216207)<br />

SER474<br />

acceptor consensus overlap<br />

pred = suggests<br />

arg1 = An apparent ’acceptor consensus overlap’<br />

arg1-at = SER474<br />

arg2 = the mechanism behind the glycosaminoglycan split<br />

arg2-<strong>of</strong> = TM<br />

pred = suggests<br />

arg1-at = SER474<br />

arg2 = that the mechanism<br />

arg2-behind = the glycosaminoglycan split<br />

arg2-<strong>of</strong> =<br />

pred = suggests<br />

arg1 = An apparent ’acceptor consensus overlap’<br />

arg1-at = SER474<br />

arg2 = that the mechanism<br />

arg2-behind = the glycosaminoglycan split<br />

arg2-<strong>of</strong> = TM<br />

”Using this approach, coupled with Edman degradation <strong>of</strong> the 32PO4-labeled tryptic<br />

peptides, and comparison with tryptic peptides analyzed after labeling normal human<br />

colonic tissues, we identified ser-52 as the major K18 physiologic phosphorylation site.”<br />

(PMID:7523419)<br />

SER52<br />

physiologic phosphorylation site<br />

pred = identified<br />

arg1 = unk<br />

arg2 = SER52<br />

arg2-as = the major K18 phosphorylation phosphorylation site<br />

pred = identified<br />

arg2 = SER52<br />

arg2-as = the major<br />

pred = identified<br />

arg1 = we<br />

arg2 = SER52<br />

172


Appendix B<br />

Examples <strong>of</strong> extracted <strong>functional</strong><br />

<strong>annotation</strong>s compared with<br />

UniProtKB<br />

173


.<br />

RID+UID<br />

Table B.1: Comparison <strong>of</strong> extracted protein residue <strong>annotation</strong>s from GC with<br />

UniProtKB. Mined <strong>functional</strong> <strong>annotation</strong>s are listed as PAS, while relevant<br />

information from UniProtKB are reproduced from the feature table (FT) entry<br />

line.<br />

SER15 P53 HUMAN<br />

Sentence ”Previous studies have demonstrated that phosphorylation <strong>of</strong> human<br />

p53 on serine 15 contributes to protein stabilization after<br />

DNA damage and that this is mediated by the ATM family <strong>of</strong> kinases.”<br />

(PMID:11865061)<br />

UniProtKB/FT<br />

PAS<br />

RID+UID<br />

Sentence<br />

UniProtKB/FT<br />

PAS<br />

PAS<br />

RID+UID<br />

Sentence<br />

UniProtKB/FT<br />

PAS<br />

RID+UID<br />

Sentence<br />

UniProtKB/FT<br />

PAS<br />

SER15 MOD RES: Phosphoserine; by PRPK<br />

SER15 VARIANT: S->R in a sporadic cancer; somatic mutation.<br />

pred = contributes<br />

Arg1 =<br />

arg1-on = SER15<br />

arg2 =<br />

arg2-to = protein stabilization<br />

arg2-after = DNA damage and that<br />

GLU189 CP27B HUMAN, LEU343 CP27B HUMAN<br />

”The R389G mutant was totally in<strong>active</strong>,but mutant L343F retained<br />

2.3% <strong>of</strong> wild-type activity,and mutant E189G retained 22% <strong>of</strong> wildtype<br />

activity.” (PMID:12050193)<br />

GLU189 VARIANT: E-K in VDDR I; 11% <strong>of</strong> wild-type activity.<br />

LEU343 VARIANT: L->F in VDDR I; 2.3% <strong>of</strong> wild-type activity.<br />

pred = retained<br />

arg1 = but mutant LEU343 PHE<br />

arg2 = 2.3 %<br />

arg2-<strong>of</strong> = wild-type activity<br />

pred = retained<br />

arg1 = and mutant GLU189 GLY<br />

arg2 = 22 %<br />

arg2-<strong>of</strong> = wild-type activity<br />

CYS260 TGA1 ARATH, CYS266 TGA1 ARATH<br />

”Furthermore,site-directed mutagenesis <strong>of</strong> TGA1 Cys-260 and Cys-<br />

266 enables the interaction with NPR1 in yeast and<br />

Arabidopsis.” (PMID:12953119)<br />

C260/C266 DISULFID: (potential).<br />

C260 MUTAGEN: C->N; Gain <strong>of</strong> interaction with NPR1; when associated with S-266.<br />

C266 MUTAGEN: C->S: Gain <strong>of</strong> interaction with NPR1; when associated with S-260.<br />

pred = enables<br />

arg1 = site-directed mutagenesis<br />

arg1-<strong>of</strong> = TGA1 CYS260 and CYS266<br />

arg2 = the interaction<br />

arg2-with = NPR1<br />

arg2-in = yeast and Arabidopsis<br />

THR13 RUM1 SCHPO, SER19 RUM1 SCHPO<br />

”Direct in vitro kinase assay using GST-fusion proteins <strong>of</strong> wild-type as well as various mutants<br />

<strong>of</strong> p25(rum1) demonstrated that MAPK phosphorylates<br />

the N-terminal portion <strong>of</strong> p25(rum1) and residues Thr13<br />

and Ser19 are major phosphorylation <strong>sites</strong> for MAPK.”<br />

(PMID:12135491)<br />

THR13 MOD RES: Phosphothreonine; by MAPK<br />

SER19 MOD RES: Phosphoserine; by MAPK<br />

SER19 MUTAGEN: S->E:reduces activity as a cdc2 inhibitor; when associated with E-13<br />

pred = are<br />

arg1 = the N-terminal portion<br />

arg1-<strong>of</strong> = p25(rum1) and residues THR13 and SER19<br />

174


. . . continuation <strong>of</strong> table B.1<br />

arg2 = major phosphorylation <strong>sites</strong><br />

arg2-for = MAPK<br />

RID+UID<br />

Sentence<br />

UniProtKB/FT<br />

PAS<br />

PAS<br />

RID+UID<br />

THR13 RUM1 SCHPO, SER19 RUM1 SCHPO<br />

”Together with the fact that replacement <strong>of</strong> both Thr13 and Ser19 with<br />

Glu,which mimics the phosphorylated state <strong>of</strong> these residues,also significantly reduces the activity<br />

<strong>of</strong> p25(rum1) as a Cdc2 inhibitor,it was suggested that<br />

the phosphorylation <strong>of</strong> Thr13 and Ser19 negatively regulates<br />

the function <strong>of</strong> p25(rum1).” (PMID:12135491)<br />

THR13 N/A<br />

SER19 N/A<br />

pred = suggested<br />

arg2 = that the phosphorylation<br />

arg2-<strong>of</strong> = THR13 and SER19<br />

pred = regulates<br />

arg1 = that the phosphorylation<br />

arg1-<strong>of</strong> = THR13 and SER19<br />

arg2 = the function<br />

arg2-<strong>of</strong> = p25(rum1)<br />

THR13 RUM1 SCHPO, SER19 RUM1 SCHPO<br />

Sentence ”Further evidence indicates that phosphorylation <strong>of</strong> Thr13<br />

and Ser19 may retain a negative effect on the function <strong>of</strong><br />

p25(rum1) even in vivo.” (PMID:12135491)<br />

UniProtKB/FT<br />

PAS<br />

RID+UID<br />

THR13 N/A<br />

SER19 N/A<br />

pred = retain<br />

arg1 = that<br />

arg1-<strong>of</strong> = THR13 and SER19<br />

arg2 = a negative effect<br />

arg2-on = the function<br />

arg2-<strong>of</strong> = p25(rum1)<br />

GLU55 DHMA MYCAV, ASP123 DHMA MYCAV, TRP124 DHMA MYCAV<br />

Sentence ”Many residues essential for the dehalogenation reaction are conserved<br />

in DhmA;the putative catalytic triad consists <strong>of</strong><br />

Asp123,His279,and Asp250,and the putative oxyanion<br />

hole consists <strong>of</strong> Glu55 and Trp124.” (PMID:12147465)<br />

UniProtKB/FT<br />

PAS<br />

PAS<br />

RID+UID<br />

Sentence<br />

UniProtKB/FT<br />

GLU55 N/A<br />

ASP123 ACT SITE: Nucleophile (by similarity).<br />

TRP124 N/A<br />

pred = consists<br />

arg1 = the putative catalytic triad<br />

arg2 =<br />

arg2-<strong>of</strong> = ASP123<br />

pred = consists<br />

arg1 = and the putative oxyanion hole<br />

arg2 =<br />

arg2-<strong>of</strong> = GLU55 and TRP124<br />

CYS48 THIO RAT, CYS152 THIO RAT, CYS73 THIO RAT<br />

”Thus,PrxV mutants lacking Cys(48) or Cys(152) showed<br />

no detectable thioredoxin-dependent peroxidase activity,whereas mutation <strong>of</strong><br />

Cys(73) had no effect on activity.” (PMID:10751410)<br />

N/A<br />

175


. . . continuation <strong>of</strong> table B.1<br />

PAS<br />

PAS<br />

RID+UID<br />

pred = showed<br />

arg1 = CYS48 or CYS152<br />

arg2 = no detectable thioredoxin-dependent peroxidase activity<br />

pred = had<br />

arg1 = whereas mutation<br />

arg1-<strong>of</strong> = CYS73<br />

arg2 = no effect on activity<br />

GLY43 PPCS HUMAN<br />

Sentence ”Highly conserved ATP binding residues include<br />

Gly43,Ser61,Gly63,Gly66,Phe230,and<br />

Asn258.” (PMID:12906824)<br />

UniProtKB/FT<br />

PAS<br />

RID+UID<br />

N/A<br />

pred = include<br />

arg1 = conserved ATP binding residues<br />

arg2 = GLY43<br />

ASN59 PPCS HUMAN<br />

Sentence ”Highly conserved phosphopantothenate binding residues include<br />

Asn59,Ala179,Ala180,and Asp183 from one<br />

monomer and Arg55’ from the adjacent monomer.” (PMID:12906824)<br />

UniProtKB/FT<br />

PAS<br />

RID+UID<br />

N/A<br />

pred = include<br />

arg1 = conserved phosphopantothenate binding residues<br />

arg2 = ASN59<br />

GLU50 SHD HUMAN, GLU51 SHD HUMAN<br />

Sentence ”Rab3A binding-defective mutants <strong>of</strong> rabphilin<br />

(E50A) and Noc2( E51A) were still localized in the distal<br />

portion <strong>of</strong> the neurites (where dense-core vesicles had accumulated) in nerve growth factordifferentiated<br />

PC12 cells,the same as the wild-type proteins,whereas Rab27A<br />

binding-defective mutants <strong>of</strong> rabphilin ( E50A/I54A) and<br />

Noc2( E51A/I55A) were present throughout the cytosol.”<br />

(PMID:14722103)<br />

UniProtKB/FT<br />

PAS<br />

RID+UID<br />

Sentence<br />

UniProtKB/FT<br />

PAS<br />

N/A<br />

pred = localized<br />

arg1 = Rab3A binding-defective mutants<br />

arg1-<strong>of</strong> = rabphilin ( GLU50 ALA ) and Noc2 ( GLU51 ALA )<br />

arg2 =<br />

arg2-in = the distal portion<br />

arg2-<strong>of</strong> = the neurites ( where dense-core vesicles<br />

TRP124 DHMA MYCAV<br />

”Trp124 should be involved in substrate binding and product<br />

(halide) stabilization,while the second halide-stabilizing residue cannot be identified<br />

from a comparison <strong>of</strong> the DhmA sequence with the sequences <strong>of</strong> three<br />

dehalogenases with known tertiary structures.” (PMID:12147465)<br />

N/A<br />

pred = involved<br />

arg1 = TRP124<br />

arg2 =<br />

arg2-in = substrate binding and product (halide) stabilization<br />

176


Appendix C<br />

Examples <strong>of</strong> extracted <strong>functional</strong><br />

<strong>annotation</strong>s for the protein p53<br />

177


Table C.1: Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in<br />

p53. The listed data are grouped by topics.<br />

.<br />

regulatory PTM<br />

RID+UID<br />

PMID 10930428<br />

PAS<br />

RID+UID<br />

SER6 P53 HUMAN<br />

pred = creased<br />

arg1 = a background<br />

arg1-<strong>of</strong> = constitutive phosphorylation<br />

arg1-at = SER6 that<br />

arg2 = 10-fold<br />

arg2-upon = upon exposure<br />

arg2-to = either ionizing radiation or UV light<br />

pred = exhibited<br />

arg1 = Untreated A549 cells<br />

arg2 = a background<br />

arg2-<strong>of</strong> = constitutive phosphorylation<br />

arg2-at = SER6 that<br />

pred = is<br />

arg1 = The relative phosphorylation<br />

arg1-<strong>of</strong> = THR18<br />

arg1-by = VRK2B<br />

arg2 = similar<br />

arg2-in = magnitude<br />

arg2-to = that induced<br />

arg2-by = taxol<br />

PMID 12487430<br />

PAS<br />

RID+UID<br />

THR18 P53 HUMAN<br />

pred = compared<br />

arg1 = that phosphorylation<br />

arg1-at = THR18 decreased binding<br />

arg1-to = recombinant Mdm2 protein<br />

arg2 =<br />

arg2-with = the unphosphorylated and the two other single phosphorylated analogues<br />

PMID 11030628<br />

PAS<br />

RID+UID<br />

SER46 P53 HUMAN<br />

pred = regulates<br />

arg1 = and phosphorylation<br />

arg1-<strong>of</strong> = SER46<br />

arg2 = the transcriptional activation<br />

arg2-<strong>of</strong> = this apoptosis-inducing gene<br />

PMID 11875057<br />

PAS<br />

RID+UID<br />

SER46 P53 HUMAN<br />

pred = hibited<br />

arg1 = IR-induced phosphorylation<br />

arg1-at = SER46<br />

arg2 =<br />

arg2-by = wortmannin<br />

PMID 14757188<br />

PAS<br />

SER15 P53 HUMAN<br />

pred = duce<br />

arg1 =<br />

arg1-in = synergy<br />

arg2 = ATM-mediated phosphorylation<br />

arg2-<strong>of</strong> = the SER15 site<br />

178


. . . continuation <strong>of</strong> table C.1<br />

RID+UID<br />

arg2-<strong>of</strong> =<br />

PMID 17292432<br />

PAS<br />

RID+UID<br />

SER15 P53 HUMAN<br />

pred = suppressed<br />

arg2 = both NaVO(3)-induced SER15 phosphorylation and accumulation<br />

arg2-<strong>of</strong> =<br />

PMID 11850826<br />

PAS<br />

RID+UID<br />

SER15 P53 HUMAN<br />

pred = observed<br />

arg1 = Increased phosphorylation<br />

arg1-<strong>of</strong> = SER15<br />

arg2 =<br />

arg2-in = heat shocked GM638<br />

PMID 10933801<br />

PAS<br />

RID+UID<br />

THR55 P53 HUMAN<br />

pred = define<br />

arg1 = These data<br />

arg2 = THR55<br />

arg2-as = a novel phosphorylation site and<br />

arg2-for = the first time show threonine phosphorylation<br />

arg2-<strong>of</strong> = human<br />

PMID 15116093<br />

PAS<br />

RID+UID<br />

PMID 9246643<br />

PAS<br />

THR55 P53 HUMAN<br />

pred = clarify<br />

arg1 = This study<br />

arg2 = the biological significance<br />

arg2-<strong>of</strong> = doxorubicin-induced THR55 phosphorylation<br />

pred = reduced<br />

arg1 = phosphorylation<br />

arg1-at = SER15<br />

arg2 = and phosphorylation<br />

arg2-at = SER392<br />

SER315 P53 HUMAN<br />

pred = reversed<br />

arg1 = but SER315<br />

arg2 = the effect<br />

arg2-<strong>of</strong> = phosphorylation<br />

arg2-at = SER392<br />

RID+UID<br />

PMID 7926727<br />

PAS<br />

RID+UID<br />

PHE19 P53 HUMAN<br />

pred = are<br />

arg1 = PHE19<br />

arg2 = crucial<br />

arg2-for = the interactions<br />

arg2-between =<br />

SER20 P53 HUMAN<br />

binding activity<br />

179


. . . continuation <strong>of</strong> table C.1<br />

PMID 11323395<br />

PAS<br />

RID+UID<br />

pred = play<br />

arg1 =<br />

arg1-<strong>of</strong> = SER20<br />

arg2 = a key role<br />

arg2-in = the dissociation<br />

arg2-<strong>of</strong> = mdm2<br />

arg2-in = response<br />

arg2-to = Cr(VI)<br />

PMID 17914575<br />

PAS<br />

RID+UID<br />

CYS135 P53 HUMAN<br />

pred = generates<br />

arg1 = that the amino acid change CYS135˜ARG<br />

arg1-in = the human TP53<br />

arg2 = the loss<br />

arg2-<strong>of</strong> = TP53 DNA-binding activity<br />

PMID 16784539<br />

PAS<br />

SER315 P53 HUMAN<br />

pred = dephosphorylates<br />

arg1 = both<br />

arg1-in = vitro and<br />

arg1-in = vivo and<br />

arg2 = the SER315 site<br />

arg2-<strong>of</strong> =<br />

RID+UID<br />

PMID 10432310<br />

PAS<br />

RID+UID<br />

SER20 P53 HUMAN<br />

protein-protein-interaction<br />

pred = containing<br />

arg2 = phosphate<br />

arg2-at = SER20 inhibited DO-1 binding<br />

PMID 11960368<br />

PAS<br />

RID+UID<br />

SER166 P53 HUMAN<br />

pred = mutated<br />

arg1 = analysis<br />

arg1-<strong>of</strong> = HDM2 proteins<br />

arg2 =<br />

arg2-at = the consensus Akt recognition <strong>sites</strong><br />

arg2-at = SER166<br />

PMID 11172034<br />

PAS<br />

RID+UID<br />

PMID 7624134<br />

PAS<br />

ARG175 P53 HUMAN<br />

pred = abolish<br />

arg1 = mutations ARG175˜HIS or ARG248˜TRP<br />

arg2 = the association<br />

arg2-<strong>of</strong> =<br />

SER315 P53 HUMAN<br />

pred = abolished<br />

arg1 =<br />

arg1-to = alanine ( p53- SER315˜ALA )<br />

180


. . . continuation <strong>of</strong> table C.1<br />

arg2 = phosphorylation<br />

arg2-by = cdk2 kinase<br />

RID+UID<br />

PMID 7624134<br />

PAS<br />

RID+UID<br />

SER315 P53 HUMAN<br />

pred = required<br />

arg1 = SER315<br />

arg1-<strong>of</strong> = wtp53<br />

arg2 =<br />

arg2-for = transcriptional activity<br />

arg2-in = vivo<br />

PMID 16818505<br />

PAS<br />

RID+UID<br />

CYS238 P53 HUMAN<br />

pred = retains<br />

arg1 = ( CYS238˜TYR ) mutant<br />

arg2 = <strong>functional</strong> wild-type<br />

PMID 16707427<br />

PAS<br />

ARG175 P53 HUMAN<br />

biological activity<br />

pred = displayed<br />

arg1 = the ARG175˜LEU mutant<br />

arg2 = an attenuated tumor suppressor activity<br />

arg2-in = the regulation<br />

arg2-<strong>of</strong> = transcription<br />

RID+UID<br />

PMID 10616523<br />

PAS<br />

RID+UID<br />

ARG72 P53 HUMAN<br />

disease<br />

pred = suggests<br />

arg1 = The acquisition<br />

arg1-<strong>of</strong> = both mutations ( GLY245˜VAL and ARG72˜PRO )<br />

arg1-in = the transformation<br />

arg1-from = transient leukemia<br />

arg1-to = overt acute megakaryoblastic leukemia<br />

arg2 = a <strong>functional</strong> role<br />

arg2-<strong>of</strong> = mutant<br />

PMID 18181044<br />

PAS<br />

ARG72 P53 HUMAN<br />

pred = sociated<br />

arg1 = the development<br />

arg1-<strong>of</strong> = lung carcinoma and that ARG72˜PRO genotype<br />

arg2 =<br />

arg2-with = a poorer prognosis<br />

arg2-<strong>of</strong> = lung cancer<br />

181


. . . continuation <strong>of</strong> table C.1<br />

RID+UID<br />

PMID 7761089<br />

PAS<br />

RID+UID<br />

VAL138 P53 HUMAN<br />

molecular stability<br />

pred = showed<br />

arg1 = The human VAL138 mutant<br />

arg2 = temperature-sensitive transformation<br />

arg2-<strong>of</strong> = rat embryo fibroblasts ( REFs )<br />

arg2-in = collaboration assay<br />

arg2-with = activated<br />

PMID 15703170<br />

PAS<br />

ARG249 P53 HUMAN<br />

pred = duce<br />

arg1 = oncogenic mutations HIS168˜ARG and z:resi ty<br />

ARG249˜SER<br />

arg2 = substantial structural perturbation<br />

arg2-around = the mutation site<br />

arg2-in = the L2 and L3 loops<br />

182


Appendix D<br />

Examples <strong>of</strong> extracted <strong>functional</strong><br />

<strong>annotation</strong>s for the protein Jak2<br />

183


Table D.1: Examples <strong>of</strong> literature mined <strong>annotation</strong>s <strong>of</strong> protein residues in<br />

Jak2. The listed data are grouped by topics.<br />

.<br />

disease<br />

PMID 16896569<br />

RID+UID<br />

VAL617 JAK2 HUMAN<br />

pred = improved<br />

arg1 = The improved knowledge<br />

arg1-<strong>of</strong> = the molecular basis<br />

arg1-<strong>of</strong> = the disease because<br />

arg1-<strong>of</strong> = the discovery<br />

arg1-<strong>of</strong> = the VAL617˜PHE mutation<br />

arg1-in = the JAK2 gene<br />

arg2 = the molecular diagnosis and<br />

PMID 16503548<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = is<br />

arg1 = that the JAK2 VAL617˜PHE mutation<br />

arg2 = rare<br />

arg2-in = patients<br />

arg2-with = idiopathic erythrocytosis<br />

PMID 16247455<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = reported<br />

arg1 = A missense somatic mutation<br />

arg1-in = JAK2 gene ( JAK2 VAL617˜PHE )<br />

arg2 =<br />

arg2-in = chronic myeloproliferative disorders<br />

PMID 18024388<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = is<br />

arg1 = The JAK2 VAL617˜PHE point mutation<br />

arg2 = rare<br />

arg2-in = hypereosinophilic syndrome and/or chronic eosinophilic leukemia<br />

PMID 15858187<br />

genetic<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = had<br />

arg1 = All 51 patients<br />

arg1-with = 9pLOH<br />

arg2 = the VAL617˜PHE mutation<br />

pred = is<br />

arg1 = VAL617˜PHE<br />

arg2 = a somatic mutation present<br />

arg2-in = hematopoietic cells<br />

molecular function<br />

184


. . . continuation <strong>of</strong> table D.1<br />

PMID 15970705<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = sociated<br />

arg1 = JAK2 ( VAL617˜PHE )<br />

arg2 =<br />

arg2-with = constitutive phosphorylation<br />

arg2-<strong>of</strong> = JAK2 and its downstream effectors<br />

arg2-as =<br />

PMID 16239216<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = duces<br />

arg1 = that the homologous VAL617˜PHE mutation<br />

arg2 = activation<br />

arg2-<strong>of</strong> = JAK1 and Tyk2<br />

PMID 16384930<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = link<br />

arg1 = the presence<br />

arg1-in = PV erythroblasts<br />

arg1-<strong>of</strong> = proliferative and antiapoptotic signals that<br />

arg2 = the JAK2 VAL617˜PHE mutation<br />

arg2-with = the inhibition<br />

arg2-<strong>of</strong> = death receptor signaling<br />

PMID 16442619<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = does<br />

arg1 = crease<br />

arg1-<strong>of</strong> = expression and kinase activity<br />

arg1-<strong>of</strong> = JAK2<br />

arg1-in = CML cells<br />

arg2 = result<br />

arg2-from = the JAK2 VAL617˜PHE activation mutation and that transformation<br />

arg2-into = to blast crisis<br />

PMID 16461300<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = sociated<br />

arg1 = the presence<br />

arg1-<strong>of</strong> = the JAK2 VAL617˜PHE mutation<br />

arg2 =<br />

arg2-with = higher platelet activation<br />

PMID 16904848<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = transmit<br />

arg1 = that JAK2 VAL617˜PHE<br />

arg2 = signals<br />

arg2-from = ligand-activated TpoR or EpoR<br />

PMID 15863514<br />

RID+UID<br />

PAS<br />

VAL617 JAK2 HUMAN<br />

pred = changes<br />

arg2 = conserved VAL617˜PHE<br />

arg2-in = the pseudokinase domain<br />

arg2-<strong>of</strong> = JAK2 that<br />

185


Appendix E<br />

Examples <strong>of</strong> extracted <strong>functional</strong><br />

<strong>annotation</strong>s <strong>of</strong> the category binding<br />

event<br />

186


Table E.1: [Mined <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> protein residues with information<br />

on binding events. The mined information correspond to 17 protein residues<br />

listed in MSDsite. The extracted information can be used for <strong>functional</strong> <strong>annotation</strong><br />

and validation <strong>of</strong> <strong>predicted</strong> binding site in the database.<br />

.<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

T199 CAH2 HUMAN<br />

”The three-dimensional structures <strong>of</strong> azide-bound and sulfate-bound T199V CAIIs were determined<br />

by x-ray crystallographic methods at 2.25 and 2.4 A, respectively (final crystallographic<br />

R factors are 0.173 and 0.174, respectively).” (PMID:8262987)<br />

pred = determined<br />

arg1 = The three-dimensional structures<br />

arg1-<strong>of</strong> = [azide-bound and sulfate-bound THR199 VAL CAIIs]/BINDING<br />

arg2 =<br />

arg2-by = x-prot:ray crystallographic methods<br />

arg2-at = at 2.25 and 2.4 A ,respectively ( final crystallographic<br />

R55 PPIA HUMAN<br />

”On the basis <strong>of</strong> the structure, it is proposed that Arg55 hydrogen-bonds to the nitrogen<br />

to deconjugate the resonance <strong>of</strong> the prolyl amide bond and thus facilitates the cis-trans<br />

rotation.” (PMID:8652511)<br />

pred = proposed<br />

arg2 = [that ARG55 hydrogen-bonds]/BINDING<br />

arg2-to = the nitrogen<br />

pred = deconjugate<br />

arg1 = [that ARG55 hydrogen-bonds]/BINDING<br />

arg1-to = the nitrogen<br />

arg2 = the resonance<br />

arg2-<strong>of</strong> = the prolyl amide bond and<br />

L255 PH4H HUMAN<br />

”Only for the R252Q and L255V mutants were catalytically <strong>active</strong> tetramer and dimer recovered<br />

and for R252G some dimer, i.e. 20% (R252Q, tetramer), 44% (L255V, tetramer)<br />

and 4.4% (R252G, dimer) <strong>of</strong> the activity for the respective wild-type (wt) forms.”<br />

(PMID:9799096)<br />

pred = recovered<br />

arg1 = <strong>active</strong> tetramer and dimer<br />

arg2 = and<br />

arg2-for = [ARG252 GLY some dimer]/BINDING<br />

Y156 HGXR TRIFO<br />

”But the forces involved in recognizing the exocyclic C2-substituents <strong>of</strong> the purine ring, which<br />

involve the Tyr156 hydroxyl, Ile157 backbone carbonyl, and Asp163 side-chain carboxyl, may<br />

be weakened by the shifted conformation <strong>of</strong> the peptide backbone resulted from loss <strong>of</strong> the<br />

Glu11-Arg155 salt bridge.” (PMID:9843428)<br />

pred = resulted<br />

arg1 =<br />

arg1-by = the shifted conformation<br />

arg1-<strong>of</strong> = the peptide backbone<br />

arg2 =<br />

arg2-from = loss<br />

arg2-<strong>of</strong> = [the GLU11 ARG155 salt bridge]/BINDING<br />

K79 HGXR TOXGO<br />

”The Leu78-Lys79 peptide bond in the <strong>active</strong> site adopts the cis configuration, which it must<br />

to bind PRPP or pyrophosphate.” (PMID:10545171)<br />

pred = adopts<br />

arg1 = [The LEU78 LYS79 peptide bond]/BINDING<br />

arg1-in = the <strong>active</strong> site<br />

arg2 = the<br />

G57 FLAV CLOBE<br />

187


. . . continuation <strong>of</strong> table E.1<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

”In the Clostridium beijerinckii flavodoxin, the reduction <strong>of</strong> the flavin mononucleotide (FMN)<br />

c<strong>of</strong>actor is accompanied by a local conformation change in which the Gly57-Asp58 peptide<br />

bond ”flips” from primarily the unusual cis O-down conformation in the oxidized state to<br />

the trans O-up conformation such that a new hydrogen bond can be formed between the<br />

carbonyl group <strong>of</strong> Gly57 and the proton on N(5) <strong>of</strong> the neutral FMN semiquinone radical<br />

[Ludwig, M. L., Pattridge, K. A., Metzger, A. L., Dixon, M. M., Eren, M., Feng, Y., and<br />

Swenson, R. P. (1997) Biochemistry 36, 1259-1280].” (PMID:10353827)<br />

pred = accompanied<br />

arg1 = ) c<strong>of</strong>actor<br />

arg2 =<br />

arg2-by = a local conformation change<br />

arg2-in = [which the GLY57 ASP58 peptide bond]/BINDING<br />

D160 APX STRGR; M161 APX STRGR; G201 APX STRGR; R202 APX STRGR; F219<br />

APX STRGR<br />

”These studies allowed the tracing <strong>of</strong> the previously disordered region <strong>of</strong> the enzyme (Glu196-<br />

Arg202) and the identification <strong>of</strong> some <strong>of</strong> the <strong>functional</strong> groups <strong>of</strong> the enzyme that are<br />

involved in enzyme-substrate interactions (Asp160, Met161, Gly201, Arg202 and Phe219).”<br />

(PMID:10771423)<br />

pred = involved<br />

arg1 = disordered region<br />

arg1-<strong>of</strong> = the enzyme ( GLU196 ARG202 ) and the identification<br />

arg1-<strong>of</strong> = some<br />

arg1-<strong>of</strong> = the <strong>functional</strong> groups<br />

arg1-<strong>of</strong> = the enzyme that<br />

arg2 =<br />

arg2-in = [enzyme-substrate interactions ( ASP160, MET161, GLY201, ARG202,<br />

PHE219)]/BINDING<br />

I209 FIXL RHIME<br />

”Interaction between the iron-bound O(2) and Ile209 was also observed in the resonance<br />

Raman spectra <strong>of</strong> RmFixLH as evidenced by the fact that the Fe-O(2) and Fe-CN stretching<br />

frequencies were shifted from 575 to 570 cm(-1) (Fe-O(2)), and 504 to 499 cm(-1), respectively,<br />

as the result <strong>of</strong> the replacement <strong>of</strong> Ile209 with an Ala residue.” (PMID:10926518)<br />

pred = observed<br />

arg1 = Interaction<br />

arg1-between = [the iron-bound O(2) and ILE209]/BINDING<br />

arg2 =<br />

arg2-in = the resonance Raman spectra<br />

arg2-<strong>of</strong> = RmFixLH as<br />

188


Appendix F<br />

Examples <strong>of</strong> extracted <strong>functional</strong><br />

<strong>annotation</strong>s <strong>of</strong> <strong>active</strong> site residues<br />

189


Table F.1: Identified catalytic triad residues from MEDLINE exraction. The<br />

listed sentences describe the mentioned protein residues as catalytic (comention<br />

with the term ”catalytic triad”), however, none <strong>of</strong> them are recorded<br />

in CSA, thus the identified information are novel data.<br />

.<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

RID+UID<br />

Sentence<br />

PAS<br />

D44 TPP2 HUMAN, H264 TPP2 HUMAN, S449 EPHA3 HUMAN<br />

”The amino acids forming the putative catalytic triad (Asp-44, His-264, Ser-449) as well as<br />

the conserved Asn-362, potentially stabilizing the transition state, were replaced by alanine<br />

and the mutated cDNAs were transfected into human embryonic kidney (HEK) 293 cells.”<br />

(PMID:12445476)<br />

pred = forming<br />

arg1 = The amino acids<br />

arg2 = [the putative catalytic triad ( ASP44, HIS264, SER449)]/ENZ ACT<br />

C25 CYSP1 CARCN, H159 CYSP1 CARCN, D175 CYSP1 CARCN<br />

”The seven cysteine residues are aligned with those <strong>of</strong> papain and the catalytic triad<br />

(Cys25, His159, Asn175) <strong>of</strong> all cysteine peptidases <strong>of</strong> the papain family is conserved.”<br />

(PMID:10355634)<br />

pred = aligned<br />

arg1 = The seven CYS+<br />

arg2 =<br />

arg2-with = with those<br />

arg2-<strong>of</strong> = <strong>of</strong> papain and the catalytic triad ( CYS25<br />

C176 NADE MYCTU, E52 NADE MYCTU, K121 NADE MYCTU<br />

”The residues forming the putative catalytic triad (Cys176, Glu52 and Lys121) were replaced<br />

by alanine; the mutated enzymes were expressed in the Escherichia coli Origami (DE3) strain<br />

and purified.” (PMID:15748981)<br />

pred = forming<br />

arg1 = The residues<br />

arg2 = [the putative catalytic triad ( CYS176, GLU52, and LYS121)]/ENZ ACT<br />

S1752 POLG BVDVS<br />

”Our study provides experimental evidence that histidine at position 1658 and aspartic acid<br />

at position 1686 constitute together with the previously identified serine at position 1752<br />

(S1752) the catalytic triad <strong>of</strong> the pestiviral NS3 serine protease.” (PMID:10915606)<br />

pred = identified<br />

arg1 =<br />

arg1-with = the<br />

arg2 = [SER1752 ( S1752 ) the catalytic triad]/ENZ ACT<br />

arg2-<strong>of</strong> = the pestiviral NS3 serine protease.<br />

D167 POLS SFV, H145 POLS SFV, S219 POLS SFV<br />

”After this autoproteolytic cleavage, the free carboxylic group <strong>of</strong> Trp267 interacts with the<br />

catalytic triad (His145, Asp167 and Ser219) and inactivates the enzyme.” (PMID:18177892)<br />

pred = interacts<br />

arg1 = the free carboxylic group<br />

arg1-<strong>of</strong> = TRP267<br />

arg2 =<br />

arg2-with = [the catalytic triad ( HIS145, ASP167, and SER219)]/ENZ ACT<br />

D122 ARY2 RAT<br />

”Substitution <strong>of</strong> the catalytic triad Asp-122 with either alanine or asparagine resulted in the<br />

complete loss <strong>of</strong> protein structural integrity and catalytic activity.” (PMID:15209520)<br />

pred = resulted<br />

arg1 = Substitution<br />

arg1-<strong>of</strong> = the catalytic triad ASP122<br />

arg1-with = either alanine or asparagine<br />

arg2 =<br />

arg2-in = the complete loss<br />

arg2-<strong>of</strong> = [protein structural integrity and catalytic activity]/ENZ ACT<br />

190


. . . continuation <strong>of</strong> table F.1<br />

RID+UID<br />

Sentence<br />

PAS<br />

D156 LYPA1 HUMAN<br />

”To investigate whether this bridging function occurs in vivo, two transgenic mouse lines<br />

were established expressing a muscle creatine kinase promoter-driven human LPL (hLPL)<br />

minigene mutated in the catalytic triad (Asp156 to Asn).” (PMID:9811888)<br />

pred = mutated<br />

arg1 = ( hLPL ) minigene<br />

arg2 =<br />

arg2-in = [the catalytic triad (ASP156 ASN)]/ENZ ACT<br />

191


Appendix G<br />

Glossary<br />

3D pattern – a recurrent residue triplet configuration (with k=2 or k=3 interaction <strong>of</strong> residues) within a dataset <strong>of</strong> protein<br />

structures.<br />

arg – the argument <strong>of</strong> a PAS<br />

BIND – the set <strong>of</strong> binding-related <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> extracted protein residues, i.e. <strong>annotation</strong>s are labelled as<br />

BINDING.<br />

BINDING – a category in MAN, describing binding events <strong>of</strong> a protein residue.<br />

CSA – a database <strong>of</strong> manually curated <strong>active</strong> <strong>sites</strong> with structure templates derived from PDB.<br />

Contextual feature .<br />

EC – Enzyme classification identifier.<br />

ER – entity recognition.<br />

ENZ – the set <strong>of</strong> enzyme-related <strong>functional</strong> <strong>annotation</strong>s <strong>of</strong> extracted protein residues, i.e.<br />

ENZ ACT.<br />

<strong>annotation</strong>s are labelled as<br />

ENZ ACT – a category in MAN, describing enzyme-related information.<br />

FA – a <strong>functional</strong> <strong>annotation</strong>; or the set <strong>of</strong> extracted protein residues with <strong>functional</strong> <strong>annotation</strong>s.<br />

FEAT – a categorisation scheme based on UniProtKB.<br />

FN – a false negative.<br />

FP – a false positive.<br />

FT – a record in Uniprot data file with <strong>functional</strong> <strong>annotation</strong>.<br />

Functional <strong>annotation</strong> – Information on biological function assigned to a protein residue.<br />

GC – a manually annotated test set with abstract texts drawn from a random selection <strong>of</strong> UniProtKB citations.<br />

GO – Gene Ontology.<br />

MAN – a categorisation scheme based on manual analysis on MEDLINE.<br />

MEDLINE – a database <strong>of</strong> citations and abstract texts from biomedical publications.<br />

NP – a noun phrase is defined as a nominal sequence.<br />

OLDFIELD – a non-redundant structure dataset <strong>of</strong> protein domains selected from PDB by sequence alignments.<br />

OPR – a semantic relation between a residue, its source protein, and hosting organism; or the set <strong>of</strong> mined protein residues.<br />

192


PAS – a data structure to accommodate the semantic relation between a predicate its arguments.<br />

PDBID – PDB identifier.<br />

PDB – the primary database <strong>of</strong> protein structure with spatial coordinates.<br />

PMID – a PubMed identifier.<br />

POS – a class <strong>of</strong> words, e.g. noun, verb, adjective, used for linguistic analysis.<br />

PP – a prepositional phrase is defined as preposition + noun phrase.<br />

pred – the predicate <strong>of</strong> a PAS.<br />

Protein residue – a residue with known association to its source protein within a hosting organism (OPR).<br />

RE – Relation extraction.<br />

RID – a Residue identifier: residue name + residue protein sequence.<br />

SCOP40 – a non-redundant protein structure dataset derived from SCOP.<br />

SCOP – a derived protein structure database with manual classification <strong>of</strong> proteins based on structure similarities.<br />

SITE – a record in the PDB data file denoting residues <strong>of</strong> a <strong>functional</strong> site.<br />

Structure pattern – cf. 3D pattern.<br />

TN – a true negative.<br />

TP – a true positive.<br />

TID – a Taxonomy identifier based on the NCBI Taxonomy guideline.<br />

UID – a Protein identifier based on the UniProtKB guideline.<br />

UniProtKB – a protein sequence database with manual <strong>annotation</strong>s on protein residues.<br />

VG – a verb group is sequence <strong>of</strong> verbs, auxiliaries, or verb modifiers.<br />

VP – a verb phrase, consisting <strong>of</strong> a verb group + noun phrase.<br />

XC – a cross-validation corpus based on references from UniProtKB.<br />

chainID – a protein chain identifier in a PDB entry.<br />

k=2, k=3 – a residue triplet configuration with two-way or three-way interaction.<br />

resName – a residue name.<br />

resSeq – a protein residue sequence identifier from a PDB entry.<br />

seqIndex – a protein residue sequence identifier from a UniProtKB entry.<br />

193

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!