24.10.2014 Views

Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation of predicted active sites - European ...

Automatic functional annotation of predicted active sites - European ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

contexts. This summary <strong>of</strong> text has also been compared with an ”unstructured knowledge<br />

database”, where information is present, but difficult to retrieve due to the complexity <strong>of</strong><br />

natural language. According to Sidhu,<br />

”[...] it is generally acknowledged that only 20 per cent <strong>of</strong> biological knowledge<br />

and data is available in a structured format or a database. The remaining<br />

80 per cent <strong>of</strong> biological information is hidden in the unstructured, free text<br />

<strong>of</strong> scientific publications.” [SDC06]<br />

In context <strong>of</strong> information extraction, the data to be extracted from an article are<br />

words (keywords) regarding biological concepts that could summarise the key message<br />

<strong>of</strong> the article.<br />

At first glance, abstract texts have a high density <strong>of</strong> keywords but a<br />

low coverage <strong>of</strong> information, while full-texts cover a larger but disperse quantity <strong>of</strong> data<br />

[FKY + 01] [YHF + 02] [SPIBA03] [SWS + 04] [NBD + 06].<br />

Another key distinction between abstract texts and full-texts is the availability <strong>of</strong><br />

data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE<br />

without restriction, while full-texts from various journals are only available for subscribed<br />

customers.<br />

Although some full-text articles are accessible through various initiatives<br />

[BMC08] [Plo08] [PMC08], the extraction <strong>of</strong> information from a whole document is expected<br />

to be much more complex than from an abstract text. For example, a biological<br />

feature <strong>of</strong> a residue may be expressed over several sentences, requiring a co-reference<br />

resolution <strong>of</strong> the residue and the feature.<br />

2.2 Protein structure data mining<br />

Data mining is an analytic method to identify valid, and novel patterns in data. A general<br />

data mining solution does not exist. Instead human data mining expertise and human<br />

domain expertise are required to solve each specific data mining problem. A data mining<br />

35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!