Automatic functional annotation of predicted active sites - European ...

More documents

Recommendations

Info

Side chain interaction model. The determination of residue interactions requires a transformation of a full atom model into a simpler representation. This is because the mathematical model, that needs to describe all combinations of atom interactions of two residues, would be too complex. The solution is to replace the all-atom structure model with a coarse grained model, by reducing each residue to a single point. In principle, a residue point can be calculated either by the centre of mass, or the geometric centre (centroid). Each representation can be calculated from main chain atoms, main and side chain atoms, or side chain atoms only. The focus in this study is the side chain interactions within residue triplet configuration. For this reason, a protein structure is represented as a point spread of side chain centroids. Protein structure triangulation. The extraction of residue triplets from a protein is based on triangulation of structures. Here structures are triangulated on the basis of three criteria. The first is the compositional constraint. Each residue in a triplet must be an element of the 20 natural amino acids, while hetero atoms are excluded. One prominent reason is that there are not many examples of residue-hetero atom interactions in the dataset that would support a statistical analysis. The second condition of triplet extraction requires that none of the residues are direct neighbours in the protein sequence. The assumption made here is, that any covalently bonded residues have a higher likelihood than any other two residues being next to each other in space that are not bonded. Similarly, the probability of finding three residues in space that are connected, is higher than finding unconnected triplets of residues. Consequently, the distribution of interacting residues in space would be over-represented. The definition of residue neighbourhood affects the data mining result, e.g. by requiring a pair interaction in the triplet to have a distance of more than one residue, patches of residues at one side of a beta-sheet may not be discovered. While tuning this parameter can modify the result of the data mining, the objective here is to discover new knowledge 45
from the input data set by providing as little as possible of biological information. The last criterion in triplet extraction is concerned with the geometrical property of a triplet. The Euclidean distances between the residues must fulfil the triangular inequality, while only two interaction distances of less than 6Åwere allowed. Although the interaction distance threshold is based on an empirical study of a number of protein structures, this value may not be adequate, because it would prefer close contacts of large side chains of residue pairs. For example the pair interaction of two tryptophans may have a near maximal allowed interaction distance of the centroids, while the distance of the contacting atoms are actually very close. The alternative is to set up a threshold system for residue pairs or triplets, which depends on the types of residues. Although this approach was not studied in this thesis, future work could improve the developed algorithm. Yet another approach in selecting residue interactions from a protein structure is based on the analysis of surface contacts of the side chain groups. While not all functional sites require their constituents to be in physicochemical contact (e.g. a metal binding site consists of metal ion coordinating residues without physical contacts), a protein binding site is an example where residues of two different proteins are in non-covalent interaction. However, the presented data mining approach aims in the unbiased search for residue interactions from a dataset of monomeric protein structure domains, and therefore a surface-based selection criterion will biased the analysis. Implementation A coarse grained representation is used in this protein structure analysis. From a full atom model of a protein structure, centroid positions of each protein residue were calculated on the basis of their side chain atoms. The resulting simplified structure model is then triangulated based on three criteria: (1) each residue in a triplet must be an element of the 20 natural amino acids; (2) pairs of residues in the triplet must not have a sequential relation in respect of their protein sequence position; and (3) only two pairs of residues 46
Page 1 and 2: Automatic functional annotation of
Page 3 and 4: Summary Kevin Nagel European Bioinf
Page 5 and 6: Acknowledgements This thesis would
Page 7 and 8: Contents 1 Introduction 15 1.1 Prot
Page 9 and 10: 5 Identification of protein residue
Page 11 and 12: B Examples of extracted functional
Page 13 and 14: 4.5 Re-discovery of the catalytic t
Page 15 and 16: 6.3 Evaluation of syntactical langu
Page 17 and 18: Amino Acid 3-Letter 1-Letter Side-c
Page 19 and 20: site 1. evolutionary site 1.1. cons
Page 21 and 22: Figure 1.3: The protein universe an
Page 23 and 24: oth approaches, results from data m
Page 25 and 26: teins. In contrast, to extract a pr
Page 27 and 28: Chapter 2 Background In the previou
Page 29 and 30: Figure 2.1: Data banks in the prote
Page 31 and 32: various metal binding sites. The Ca
Page 33 and 34: Key INIT MET SIGNAL PROPEP TRANSIT
Page 35 and 36: Annotation Sentence Manual GO ”Th
Page 37 and 38: process consists of the following m
Page 39 and 40: esult actually incurs some bias, be
Page 41 and 42: the extraction that would describe
Page 43 and 44: Chapter 3 Mining residue interactio
Page 45: 3.1.1 Structural feature extraction
Page 49 and 50: factorised, we attempt to approxima
Page 51 and 52: whereas in a system with one-pair i
Page 53 and 54: 3.1.3 Grouping and selecting freque
Page 55 and 56: key features of each dataset. The m
Page 57 and 58: esults from data mining on OLDFIELD
Page 59 and 60: Figure 3.5: Comparison of extracted
Page 61 and 62: Figure 3.6: The effect of varying t
Page 63 and 64: probabilistic classification approa
Page 65 and 66: 4.1 Evaluation methods The biologic
Page 67 and 68: MSDsite Reference Dataset Determine
Page 69 and 70: PDBID Description Bound metal 1h2r
Page 71 and 72: PDBID Description Bound metal 1iml
Page 73 and 74: 3Cys SCOP classification SCOP domai
Page 75 and 76: 3D pattern (k=2) Cross-validated Pa
Page 77 and 78: PDBID Asp-His-Ser His-2Ser Ala-His-
Page 79 and 80: that were found in this study. The
Page 81 and 82: Figure 5.1: Overview of processes a
Page 83 and 84: 5.1.2 Entity recognition of protein
Page 85 and 86: RANGE-TO = ("-"+ ("to" "-+")? | "to
Page 87 and 88: matches the protein sequence; (2) s
Page 89 and 90: abstract texts was drawn from the U
Page 91 and 92: Unique residue entities Reference D
Page 93 and 94: Unique resi.-prot.-org.-association
Page 95 and 96: triplet association/UTRP Resource A
Page 97 and 98:
is the set of 18,427 out of 40,750
Page 99 and 100:
fraction [LHC07] in biomedical text
Page 101 and 102:
identifier annotations in combinati
Page 103 and 104:
Figure 6.1: Overview of processes a
Page 105 and 106:
discussion on semantic relation and
Page 107 and 108:
data collation method is necessary
Page 109 and 110:
evaluation study: shallow parser ba
Page 111 and 112:
REL = NP PP* VP. The extracted rela
Page 113 and 114:
MAN FEAT Category Defintion Categor
Page 115 and 116:
”The GlyNH2 was removed and the r
Page 117 and 118:
”Mutation K241Q completely abolis
Page 119 and 120:
6.3.2 Performance analysis of the c
Page 121 and 122:
MAN FEAT Category Precision Recall
Page 123 and 124:
The extracted information is diffic
Page 125 and 126:
Chapter 7 Extraction of functional
Page 127 and 128:
7.2 Results 7.2.1 Evaluation of the
Page 129 and 130:
has the following sources: a false
Page 131 and 132:
The knowledge base does not provide
Page 133 and 134:
7.2.3 Cross-validation of mined cat
Page 135 and 136:
Figure 7.3: Cross-validaiton of tex
Page 137 and 138:
tions indicates, that the informati
Page 139 and 140:
Figure 8.1: Overview of processes a
Page 141 and 142:
PDB UniProtKB PDBID chainID serial
Page 143 and 144:
mapped to 24,500 RID+UID. The ident
Page 145 and 146:
RID+UID Sentence PAS RID+UID Senten
Page 147 and 148:
8.3.4 General correlation found bet
Page 149 and 150:
Residue Annotations -HSSP +HSSP OPR
Page 151 and 152:
Chapter 9 Conclusions and future wo
Page 153 and 154:
9.2 Limitations and future works Du
Page 155 and 156:
Bibliography [AGM + 90] SF Altschul
Page 157 and 158:
[BMC08] BMC. Biomed central. http:/
Page 159 and 160:
[DCG + 04] F Diella, S Cameron, C G
Page 161 and 162:
[Gue96] F Guenthner. Electronic lex
Page 163 and 164:
[JK95] J Justeson and S Katz. Techn
Page 165 and 166:
[MB99] Y Matsuo and SH Bryant. Iden
Page 167 and 168:
[POHS05] M Pesu, J O’Shea, L Henn
Page 169 and 170:
[STB06] MH Saier, CV Tran, and RD B
Page 171 and 172:
[WK07] R Witte and T Kappler. Enhan
Page 173 and 174:
Table A.1: Examples of errors in th
Page 175 and 176:
. RID+UID Table B.1: Comparison of
Page 177 and 178:
. . . continuation of table B.1 PAS
Page 179 and 180:
Table C.1: Examples of literature m
Page 181 and 182:
. . . continuation of table C.1 PMI
Page 183 and 184:
. . . continuation of table C.1 RID
Page 185 and 186:
Table D.1: Examples of literature m
Page 187 and 188:
Appendix E Examples of extracted fu
Page 189 and 190:
. . . continuation of table E.1 Sen
Page 191 and 192:
Table F.1: Identified catalytic tri
Page 193 and 194:
Appendix G Glossary 3D pattern - a
show all

Automatic functional annotation of predicted active sites - European ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?