Automatic functional annotation of predicted active sites - European ...
Automatic functional annotation of predicted active sites - European ...
Automatic functional annotation of predicted active sites - European ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
contexts. This summary <strong>of</strong> text has also been compared with an ”unstructured knowledge<br />
database”, where information is present, but difficult to retrieve due to the complexity <strong>of</strong><br />
natural language. According to Sidhu,<br />
”[...] it is generally acknowledged that only 20 per cent <strong>of</strong> biological knowledge<br />
and data is available in a structured format or a database. The remaining<br />
80 per cent <strong>of</strong> biological information is hidden in the unstructured, free text<br />
<strong>of</strong> scientific publications.” [SDC06]<br />
In context <strong>of</strong> information extraction, the data to be extracted from an article are<br />
words (keywords) regarding biological concepts that could summarise the key message<br />
<strong>of</strong> the article.<br />
At first glance, abstract texts have a high density <strong>of</strong> keywords but a<br />
low coverage <strong>of</strong> information, while full-texts cover a larger but disperse quantity <strong>of</strong> data<br />
[FKY + 01] [YHF + 02] [SPIBA03] [SWS + 04] [NBD + 06].<br />
Another key distinction between abstract texts and full-texts is the availability <strong>of</strong><br />
data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE<br />
without restriction, while full-texts from various journals are only available for subscribed<br />
customers.<br />
Although some full-text articles are accessible through various initiatives<br />
[BMC08] [Plo08] [PMC08], the extraction <strong>of</strong> information from a whole document is expected<br />
to be much more complex than from an abstract text. For example, a biological<br />
feature <strong>of</strong> a residue may be expressed over several sentences, requiring a co-reference<br />
resolution <strong>of</strong> the residue and the feature.<br />
2.2 Protein structure data mining<br />
Data mining is an analytic method to identify valid, and novel patterns in data. A general<br />
data mining solution does not exist. Instead human data mining expertise and human<br />
domain expertise are required to solve each specific data mining problem. A data mining<br />
35