Automatic functional annotation of predicted active sites - European ...

More documents

Recommendations

Info

For example, ”hunchback” is a protein in Drosophila, while it is also a general English term. Furthermore, protein names consist mostly of multiple words, e.g. ”Rho-like protein” or ”HIV-1 envelope glycoprotein gp120”. An ER system needs to identify all the constituents of a protein name in order to relate the detected entity to its reference entry in a database. The BioCreAtIvE challenge addressed this problem with the 1B subtask; the target is the identification of protein/gene names in text, and the annotation of their correct gene identifier. Various solutions were published ranging from rule-based methods [HFM + 05] [TW02] [Fuk98] to machine learning approaches [CMP05]. The developed methods are, in general, reusable for any other biological entity recognition or terminology identification problem. Works have also been published that focused on the extraction of protein point mutations [RSMA + 04] [HLC04] [BW05] [LHC07] [YLPV07], which is one category of protein residue terminology. Other categories are residue sequence or residue interaction pair. The most widely adopted method to identify these terminologies is the design of regular expression patterns. 2.3.2 Biological relation extraction Relation extraction (RD) aims to find associations between entities, or between an entity and a terminology within a text phrase. One objective in biomedical information extraction is the mining of biological facts from text. An example of biological fact is the semantic relation between two biological entities, such as protein-protein interaction [TOT04]. Until now, three strategies have been investigated for biological relation extraction: the co-occurrence based analysis [LC05] [SB05], pattern-based approach [HZH + 04] [LCM03], and machine learning based methods [BM05] [BM06]. The common limitation of all of these extraction systems is, that only the relation targets, e.g. proteins within a proteinprotein interaction, are extracted. By no means are contextual information considered in 39
the extraction that would describe or explain the association of the entities. Within the information extraction community, a consensus has been reached, that deeper analysis of sentence structures is required in order to adequately acquire biomolecular relations from text [WSC04]. In respect of biological relation extraction, two classes of syntactical parsers were studied. The first is the shallow parsing technique, which aims in detecting main constituents of a sentence, without determining the complete syntactical structure. Results were published, where protein-protein interactions [KNT05] and general biological entity relations [LCM03] were extracted based on shallow parsing. The second class of syntactical parser is the full parser, which attempts a deep analysis of the syntactical structure of a sentence. Several systems have been reported [NED03] [FKY + 01] that utilises full parsing for relation extraction from biomedical literature. One interesting full parser is ENJU [YMTT05] [MT05], a so called head-driven phrase structure grammar (HPSG) parser, which identifies predicate-argument structure (PAS) from a text sentence. The use of PAS, as template for biomolecular relation extraction, was firstly reported in [TOT04] [YMTT05]. Recently, two proposition bank were reported, that are designed to capture relations in molecular biology: PASBio [WSC04] and BioProp [TCS + 07]. Within this work, there are two types of semantic relations to be extracted. The first is the residue-protein association. The system called MEMA [RSMA + 04] uses a word distance metric to associate a list of residue-protein pairs with the smallest word distance. Another approach is to look up valid associations between a residue and a protein in context of a predetermined association of a protein and an organism. Three systems have been reported, that adopt this approach: MuteXt [HLC04], MutationMiner [BW05], and MutationGraB [LHC07]. The other semantic relation to be extracted in this work is the association between a residue entity and its description of function. The systems MuteXt [HLC04], MEMA [RSMA + 04], MutationMiner [BW05], and MutationGraB [LHC07] are all dedicated to 40
Page 1 and 2: Automatic functional annotation of
Page 3 and 4: Summary Kevin Nagel European Bioinf
Page 5 and 6: Acknowledgements This thesis would
Page 7 and 8: Contents 1 Introduction 15 1.1 Prot
Page 9 and 10: 5 Identification of protein residue
Page 11 and 12: B Examples of extracted functional
Page 13 and 14: 4.5 Re-discovery of the catalytic t
Page 15 and 16: 6.3 Evaluation of syntactical langu
Page 17 and 18: Amino Acid 3-Letter 1-Letter Side-c
Page 19 and 20: site 1. evolutionary site 1.1. cons
Page 21 and 22: Figure 1.3: The protein universe an
Page 23 and 24: oth approaches, results from data m
Page 25 and 26: teins. In contrast, to extract a pr
Page 27 and 28: Chapter 2 Background In the previou
Page 29 and 30: Figure 2.1: Data banks in the prote
Page 31 and 32: various metal binding sites. The Ca
Page 33 and 34: Key INIT MET SIGNAL PROPEP TRANSIT
Page 35 and 36: Annotation Sentence Manual GO ”Th
Page 37 and 38: process consists of the following m
Page 39: esult actually incurs some bias, be
Page 43 and 44: Chapter 3 Mining residue interactio
Page 45 and 46: 3.1.1 Structural feature extraction
Page 47 and 48: from the input data set by providin
Page 49 and 50: factorised, we attempt to approxima
Page 51 and 52: whereas in a system with one-pair i
Page 53 and 54: 3.1.3 Grouping and selecting freque
Page 55 and 56: key features of each dataset. The m
Page 57 and 58: esults from data mining on OLDFIELD
Page 59 and 60: Figure 3.5: Comparison of extracted
Page 61 and 62: Figure 3.6: The effect of varying t
Page 63 and 64: probabilistic classification approa
Page 65 and 66: 4.1 Evaluation methods The biologic
Page 67 and 68: MSDsite Reference Dataset Determine
Page 69 and 70: PDBID Description Bound metal 1h2r
Page 71 and 72: PDBID Description Bound metal 1iml
Page 73 and 74: 3Cys SCOP classification SCOP domai
Page 75 and 76: 3D pattern (k=2) Cross-validated Pa
Page 77 and 78: PDBID Asp-His-Ser His-2Ser Ala-His-
Page 79 and 80: that were found in this study. The
Page 81 and 82: Figure 5.1: Overview of processes a
Page 83 and 84: 5.1.2 Entity recognition of protein
Page 85 and 86: RANGE-TO = ("-"+ ("to" "-+")? | "to
Page 87 and 88: matches the protein sequence; (2) s
Page 89 and 90: abstract texts was drawn from the U
Page 91 and 92:
Unique residue entities Reference D
Page 93 and 94:
Unique resi.-prot.-org.-association
Page 95 and 96:
triplet association/UTRP Resource A
Page 97 and 98:
is the set of 18,427 out of 40,750
Page 99 and 100:
fraction [LHC07] in biomedical text
Page 101 and 102:
identifier annotations in combinati
Page 103 and 104:
Figure 6.1: Overview of processes a
Page 105 and 106:
discussion on semantic relation and
Page 107 and 108:
data collation method is necessary
Page 109 and 110:
evaluation study: shallow parser ba
Page 111 and 112:
REL = NP PP* VP. The extracted rela
Page 113 and 114:
MAN FEAT Category Defintion Categor
Page 115 and 116:
”The GlyNH2 was removed and the r
Page 117 and 118:
”Mutation K241Q completely abolis
Page 119 and 120:
6.3.2 Performance analysis of the c
Page 121 and 122:
MAN FEAT Category Precision Recall
Page 123 and 124:
The extracted information is diffic
Page 125 and 126:
Chapter 7 Extraction of functional
Page 127 and 128:
7.2 Results 7.2.1 Evaluation of the
Page 129 and 130:
has the following sources: a false
Page 131 and 132:
The knowledge base does not provide
Page 133 and 134:
7.2.3 Cross-validation of mined cat
Page 135 and 136:
Figure 7.3: Cross-validaiton of tex
Page 137 and 138:
tions indicates, that the informati
Page 139 and 140:
Figure 8.1: Overview of processes a
Page 141 and 142:
PDB UniProtKB PDBID chainID serial
Page 143 and 144:
mapped to 24,500 RID+UID. The ident
Page 145 and 146:
RID+UID Sentence PAS RID+UID Senten
Page 147 and 148:
8.3.4 General correlation found bet
Page 149 and 150:
Residue Annotations -HSSP +HSSP OPR
Page 151 and 152:
Chapter 9 Conclusions and future wo
Page 153 and 154:
9.2 Limitations and future works Du
Page 155 and 156:
Bibliography [AGM + 90] SF Altschul
Page 157 and 158:
[BMC08] BMC. Biomed central. http:/
Page 159 and 160:
[DCG + 04] F Diella, S Cameron, C G
Page 161 and 162:
[Gue96] F Guenthner. Electronic lex
Page 163 and 164:
[JK95] J Justeson and S Katz. Techn
Page 165 and 166:
[MB99] Y Matsuo and SH Bryant. Iden
Page 167 and 168:
[POHS05] M Pesu, J O’Shea, L Henn
Page 169 and 170:
[STB06] MH Saier, CV Tran, and RD B
Page 171 and 172:
[WK07] R Witte and T Kappler. Enhan
Page 173 and 174:
Table A.1: Examples of errors in th
Page 175 and 176:
. RID+UID Table B.1: Comparison of
Page 177 and 178:
. . . continuation of table B.1 PAS
Page 179 and 180:
Table C.1: Examples of literature m
Page 181 and 182:
. . . continuation of table C.1 PMI
Page 183 and 184:
. . . continuation of table C.1 RID
Page 185 and 186:
Table D.1: Examples of literature m
Page 187 and 188:
Appendix E Examples of extracted fu
Page 189 and 190:
. . . continuation of table E.1 Sen
Page 191 and 192:
Table F.1: Identified catalytic tri
Page 193 and 194:
Appendix G Glossary 3D pattern - a
show all

Automatic functional annotation of predicted active sites - European ...

Create successful ePaper yourself

Delete template?

Save as template?