11.03.2014 Views

Data integration in microbial genomics ... - Jacobs University

Data integration in microbial genomics ... - Jacobs University

Data integration in microbial genomics ... - Jacobs University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

7.4. In silico Hypothesis Generation 93<br />

tegrated. If the MIxS standards are adopted widely, the <strong><strong>in</strong>tegration</strong><br />

efforts can be significantly reduced. However, if the effort necessary<br />

to <strong>in</strong>tegrate the data <strong>in</strong> the megx.net project, can not be done by a<br />

handful of people, it will be impossible to keep up with the pace of data<br />

accumulation. From this po<strong>in</strong>t of view, megx.net represents a proof of<br />

pr<strong>in</strong>ciple of a future-oriented approach that should be pursued.<br />

The megx.net platform has been classified as be<strong>in</strong>g at the <strong>in</strong>terface<br />

between <strong>in</strong>formation and knowledge. This has been done, because the<br />

knowledge extraction itself still requires human <strong>in</strong>put, even though the<br />

platform offers the users a lot of useful <strong>in</strong>formation 4 . That knowledge<br />

can be ga<strong>in</strong>ed with this approach, has been demonstrated <strong>in</strong> chapter<br />

6 and is discussed <strong>in</strong> the follow<strong>in</strong>g.<br />

7.4 Towards knowledge generation <strong>in</strong> silico<br />

To exemplify the power of <strong>in</strong> silico knowledge generation, a case study<br />

has been conducted (chapter 6). The study presents a computational<br />

approach to analyze large-scale metagenomic data sets that are enriched<br />

with contextual data <strong>in</strong> order to generate functional hypotheses.<br />

This was done by look<strong>in</strong>g at the co-occurrence of doma<strong>in</strong>s of unknown<br />

function (DUF) across different habitats and by correlat<strong>in</strong>g their occurence<br />

with environmental parameters offered by the megx.net platform<br />

(chapter 5). If the ”x,y,z,t-key-data-tuple” had not been available,<br />

this analysis would not have been possible. The <strong>in</strong> silico approach<br />

demonstrates a way to quickly process and <strong>in</strong>terpret large metagenomic<br />

data sets. The GOS data set was analyzed, which at that po<strong>in</strong>t of time<br />

<strong>in</strong>cluded more than 10 million sequence reads from 79 sampl<strong>in</strong>g sites,<br />

all of which were georeferenced. The Hidden Markov Model (HMM)<br />

search returned a number of 473,251 hits. The graphical representation<br />

of co-occurr<strong>in</strong>g DUF hits <strong>in</strong> networks provided a quick and <strong>in</strong>tuitive<br />

overview about this aspect of the data. Patterns were revealed and<br />

allowed <strong>in</strong>terpretation already at this stage. Hypotheses about the<br />

function of prote<strong>in</strong> families with previously no known function could<br />

be derived through the association to prote<strong>in</strong> families with a puta-<br />

4 There are no green buttons, that can be pressed to auto-generate scientific papers that<br />

conta<strong>in</strong> explicit knowledge.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!