Automatic functional annotation of predicted active sites - European ...
Automatic functional annotation of predicted active sites - European ...
Automatic functional annotation of predicted active sites - European ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.1.2 Universal Protein Knowledge base<br />
The major repository <strong>of</strong> protein sequence data is the Universal Protein Knowledge base<br />
(UniProtKB). Along with the collection <strong>of</strong> sequence data is the listing <strong>of</strong> protein names<br />
and synonyms, taxonomic data, citation references, and other manually curated information<br />
from literature survey.<br />
One important aspect <strong>of</strong> UniProtKB when evaluating<br />
structure-function relationships is the <strong>annotation</strong> <strong>of</strong> protein residues. In the feature table<br />
the biological function <strong>of</strong> a residue site is described along with several other key categories<br />
(cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific<br />
<strong>annotation</strong>s (version from January 2008).<br />
Despite the high quality data contained in UniProtKB, the process <strong>of</strong> extracting <strong>functional</strong><br />
<strong>annotation</strong>s from literature remains a laborious human expert curation work. The<br />
curator surveys the biomedical literature, represents the experimentally determined <strong>functional</strong><br />
information, and formulates the precise <strong>functional</strong> role by utilising standardised<br />
semantic resources (cf. section 2.1.3). Despite the highly reliable quality <strong>of</strong> manual curation,<br />
this approach is evidently inefficient considering the amount <strong>of</strong> full-text publications<br />
curators have to distil. According to Frishman, if we assume<br />
”[...] that one needs on average roughly 30 min to assess published fact<br />
and bioinformatics evidence for one protein, one thousand annotators would<br />
have to work 1 year long, 8 h a day, to annotate all 5 million sequences that<br />
are currently known. However, since the size <strong>of</strong> the protein database has been<br />
consistently doubling every 18 months, the moving target <strong>of</strong> annotating all<br />
proteins will never be achieved.” [Fri07]<br />
Considering that the estimated total number <strong>of</strong> proteins is in excess <strong>of</strong> 10 10 [CK06],<br />
an automatic or semi-automatic solution is needed to facilitate the laborious human expert<br />
work.<br />
Currently, methods for the automatic expansion <strong>of</strong> citation set [YLPV07]<br />
[HLC04] [LHC07] and the automatic <strong>annotation</strong> <strong>of</strong> protein function with GO terminologies<br />
[CSL + 06] [GJYLRS08] [RSKA + 07] are being developed in the field <strong>of</strong> text mining.<br />
31