Automatic functional annotation of predicted active sites - European ...

More documents

Recommendations

Info

2.1.2 Universal Protein Knowledge base The major repository of protein sequence data is the Universal Protein Knowledge base (UniProtKB). Along with the collection of sequence data is the listing of protein names and synonyms, taxonomic data, citation references, and other manually curated information from literature survey. One important aspect of UniProtKB when evaluating structure-function relationships is the annotation of protein residues. In the feature table the biological function of a residue site is described along with several other key categories (cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific annotations (version from January 2008). Despite the high quality data contained in UniProtKB, the process of extracting functional annotations from literature remains a laborious human expert curation work. The curator surveys the biomedical literature, represents the experimentally determined functional information, and formulates the precise functional role by utilising standardised semantic resources (cf. section 2.1.3). Despite the highly reliable quality of manual curation, this approach is evidently inefficient considering the amount of full-text publications curators have to distil. According to Frishman, if we assume ”[...] that one needs on average roughly 30 min to assess published fact and bioinformatics evidence for one protein, one thousand annotators would have to work 1 year long, 8 h a day, to annotate all 5 million sequences that are currently known. However, since the size of the protein database has been consistently doubling every 18 months, the moving target of annotating all proteins will never be achieved.” [Fri07] Considering that the estimated total number of proteins is in excess of 10 10 [CK06], an automatic or semi-automatic solution is needed to facilitate the laborious human expert work. Currently, methods for the automatic expansion of citation set [YLPV07] [HLC04] [LHC07] and the automatic annotation of protein function with GO terminologies [CSL + 06] [GJYLRS08] [RSKA + 07] are being developed in the field of text mining. 31
Key INIT MET SIGNAL PROPEP TRANSIT CHAIN PEPTIDE TOPO DOM TRANSMEM DOMAIN REPEAT CA BIND ZN FING DNA BIND NP BIND REGION COILED MOTIF COMPBIAS ACT SITE METAL BINDING SITE NON STD MOD RES LIPID CARBOHYD DISULFID CROSSLNK VAR SEQ VARIANT MUTAGEN CONFLICT Description Initiator methionine. Extent of a signal sequence (prepeptide). Extent of a propeptide. Extent of a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome etc.). Extent of a polypeptide chain in the mature protein. Extent of a released active peptide. Topological domain. Extent of a transmembrane region. Extent of a domain, which is defined as a specific combination of secondary structures organised into a characteristic three-dimensional structure of fold. Extent of an internal sequence repetition. Extent of a calcium-binding region. Extent of a zinc finger region. Extent of a DNA-binding region. Extent of a nucleotide phosphate-binding region. Extent of a region of interest in the sequence. Extent of a coiled-coil region. Short (up to 20 amino acids) sequence motif of biological interest. Extent of a compositionally biased region. Amino acid(s) involved in the activity of an enzyme. Binding site for a metal ion. Binding site for any chemical group (co-enzyme, prosthetic group, etc.). Any interesting single amino-acid site on the sequence, that is not defined by another feature key. It can also apply to an amino acid bond which is represented by the positions of the two flanking amino acids. Non-standard amino acid. Posttranslational modification of a residue. Covalent binding of a lipid moiety. Glycosylation site. Disulfide bond. Posttranslationally formed amino acid bonds. Description of sequence variants produced by alternative splicing, alternative promoter usage, alternative initiation and ribosomal frameshifting. Authors report that sequence variants exist. Site which has been experimentally altered by mutagenesis. Different sources report differing sequences. Figure 2.3: Categories for protein sequence annotation in UniProtKB. Key categories used to describe regions or sites of interest in a protein sequence are listed. The key and the corresponding information (value) are stored in the feature table (FT line) in UniProtKB. Along with the listed categories are their definitions presented in this figure. 32
Page 1 and 2: Automatic functional annotation of
Page 3 and 4: Summary Kevin Nagel European Bioinf
Page 5 and 6: Acknowledgements This thesis would
Page 7 and 8: Contents 1 Introduction 15 1.1 Prot
Page 9 and 10: 5 Identification of protein residue
Page 11 and 12: B Examples of extracted functional
Page 13 and 14: 4.5 Re-discovery of the catalytic t
Page 15 and 16: 6.3 Evaluation of syntactical langu
Page 17 and 18: Amino Acid 3-Letter 1-Letter Side-c
Page 19 and 20: site 1. evolutionary site 1.1. cons
Page 21 and 22: Figure 1.3: The protein universe an
Page 23 and 24: oth approaches, results from data m
Page 25 and 26: teins. In contrast, to extract a pr
Page 27 and 28: Chapter 2 Background In the previou
Page 29 and 30: Figure 2.1: Data banks in the prote
Page 31: various metal binding sites. The Ca
Page 35 and 36: Annotation Sentence Manual GO ”Th
Page 37 and 38: process consists of the following m
Page 39 and 40: esult actually incurs some bias, be
Page 41 and 42: the extraction that would describe
Page 43 and 44: Chapter 3 Mining residue interactio
Page 45 and 46: 3.1.1 Structural feature extraction
Page 47 and 48: from the input data set by providin
Page 49 and 50: factorised, we attempt to approxima
Page 51 and 52: whereas in a system with one-pair i
Page 53 and 54: 3.1.3 Grouping and selecting freque
Page 55 and 56: key features of each dataset. The m
Page 57 and 58: esults from data mining on OLDFIELD
Page 59 and 60: Figure 3.5: Comparison of extracted
Page 61 and 62: Figure 3.6: The effect of varying t
Page 63 and 64: probabilistic classification approa
Page 65 and 66: 4.1 Evaluation methods The biologic
Page 67 and 68: MSDsite Reference Dataset Determine
Page 69 and 70: PDBID Description Bound metal 1h2r
Page 71 and 72: PDBID Description Bound metal 1iml
Page 73 and 74: 3Cys SCOP classification SCOP domai
Page 75 and 76: 3D pattern (k=2) Cross-validated Pa
Page 77 and 78: PDBID Asp-His-Ser His-2Ser Ala-His-
Page 79 and 80: that were found in this study. The
Page 81 and 82: Figure 5.1: Overview of processes a
Page 83 and 84:
5.1.2 Entity recognition of protein
Page 85 and 86:
RANGE-TO = ("-"+ ("to" "-+")? | "to
Page 87 and 88:
matches the protein sequence; (2) s
Page 89 and 90:
abstract texts was drawn from the U
Page 91 and 92:
Unique residue entities Reference D
Page 93 and 94:
Unique resi.-prot.-org.-association
Page 95 and 96:
triplet association/UTRP Resource A
Page 97 and 98:
is the set of 18,427 out of 40,750
Page 99 and 100:
fraction [LHC07] in biomedical text
Page 101 and 102:
identifier annotations in combinati
Page 103 and 104:
Figure 6.1: Overview of processes a
Page 105 and 106:
discussion on semantic relation and
Page 107 and 108:
data collation method is necessary
Page 109 and 110:
evaluation study: shallow parser ba
Page 111 and 112:
REL = NP PP* VP. The extracted rela
Page 113 and 114:
MAN FEAT Category Defintion Categor
Page 115 and 116:
”The GlyNH2 was removed and the r
Page 117 and 118:
”Mutation K241Q completely abolis
Page 119 and 120:
6.3.2 Performance analysis of the c
Page 121 and 122:
MAN FEAT Category Precision Recall
Page 123 and 124:
The extracted information is diffic
Page 125 and 126:
Chapter 7 Extraction of functional
Page 127 and 128:
7.2 Results 7.2.1 Evaluation of the
Page 129 and 130:
has the following sources: a false
Page 131 and 132:
The knowledge base does not provide
Page 133 and 134:
7.2.3 Cross-validation of mined cat
Page 135 and 136:
Figure 7.3: Cross-validaiton of tex
Page 137 and 138:
tions indicates, that the informati
Page 139 and 140:
Figure 8.1: Overview of processes a
Page 141 and 142:
PDB UniProtKB PDBID chainID serial
Page 143 and 144:
mapped to 24,500 RID+UID. The ident
Page 145 and 146:
RID+UID Sentence PAS RID+UID Senten
Page 147 and 148:
8.3.4 General correlation found bet
Page 149 and 150:
Residue Annotations -HSSP +HSSP OPR
Page 151 and 152:
Chapter 9 Conclusions and future wo
Page 153 and 154:
9.2 Limitations and future works Du
Page 155 and 156:
Bibliography [AGM + 90] SF Altschul
Page 157 and 158:
[BMC08] BMC. Biomed central. http:/
Page 159 and 160:
[DCG + 04] F Diella, S Cameron, C G
Page 161 and 162:
[Gue96] F Guenthner. Electronic lex
Page 163 and 164:
[JK95] J Justeson and S Katz. Techn
Page 165 and 166:
[MB99] Y Matsuo and SH Bryant. Iden
Page 167 and 168:
[POHS05] M Pesu, J O’Shea, L Henn
Page 169 and 170:
[STB06] MH Saier, CV Tran, and RD B
Page 171 and 172:
[WK07] R Witte and T Kappler. Enhan
Page 173 and 174:
Table A.1: Examples of errors in th
Page 175 and 176:
. RID+UID Table B.1: Comparison of
Page 177 and 178:
. . . continuation of table B.1 PAS
Page 179 and 180:
Table C.1: Examples of literature m
Page 181 and 182:
. . . continuation of table C.1 PMI
Page 183 and 184:
. . . continuation of table C.1 RID
Page 185 and 186:
Table D.1: Examples of literature m
Page 187 and 188:
Appendix E Examples of extracted fu
Page 189 and 190:
. . . continuation of table E.1 Sen
Page 191 and 192:
Table F.1: Identified catalytic tri
Page 193 and 194:
Appendix G Glossary 3D pattern - a
show all

Automatic functional annotation of predicted active sites - European ...

Create successful ePaper yourself

Delete template?

Save as template?