24.10.2014 Views

Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation of predicted active sites - European ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.1.2 Universal Protein Knowledge base<br />

The major repository <strong>of</strong> protein sequence data is the Universal Protein Knowledge base<br />

(UniProtKB). Along with the collection <strong>of</strong> sequence data is the listing <strong>of</strong> protein names<br />

and synonyms, taxonomic data, citation references, and other manually curated information<br />

from literature survey.<br />

One important aspect <strong>of</strong> UniProtKB when evaluating<br />

structure-function relationships is the <strong>annotation</strong> <strong>of</strong> protein residues. In the feature table<br />

the biological function <strong>of</strong> a residue site is described along with several other key categories<br />

(cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific<br />

<strong>annotation</strong>s (version from January 2008).<br />

Despite the high quality data contained in UniProtKB, the process <strong>of</strong> extracting <strong>functional</strong><br />

<strong>annotation</strong>s from literature remains a laborious human expert curation work. The<br />

curator surveys the biomedical literature, represents the experimentally determined <strong>functional</strong><br />

information, and formulates the precise <strong>functional</strong> role by utilising standardised<br />

semantic resources (cf. section 2.1.3). Despite the highly reliable quality <strong>of</strong> manual curation,<br />

this approach is evidently inefficient considering the amount <strong>of</strong> full-text publications<br />

curators have to distil. According to Frishman, if we assume<br />

”[...] that one needs on average roughly 30 min to assess published fact<br />

and bioinformatics evidence for one protein, one thousand annotators would<br />

have to work 1 year long, 8 h a day, to annotate all 5 million sequences that<br />

are currently known. However, since the size <strong>of</strong> the protein database has been<br />

consistently doubling every 18 months, the moving target <strong>of</strong> annotating all<br />

proteins will never be achieved.” [Fri07]<br />

Considering that the estimated total number <strong>of</strong> proteins is in excess <strong>of</strong> 10 10 [CK06],<br />

an automatic or semi-automatic solution is needed to facilitate the laborious human expert<br />

work.<br />

Currently, methods for the automatic expansion <strong>of</strong> citation set [YLPV07]<br />

[HLC04] [LHC07] and the automatic <strong>annotation</strong> <strong>of</strong> protein function with GO terminologies<br />

[CSL + 06] [GJYLRS08] [RSKA + 07] are being developed in the field <strong>of</strong> text mining.<br />

31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!