11.03.2014 Views

Data integration in microbial genomics ... - Jacobs University

Data integration in microbial genomics ... - Jacobs University

Data integration in microbial genomics ... - Jacobs University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.3. Results 43<br />

spreadsheet and web-based software tool which assists users <strong>in</strong> the consistent<br />

acquisition, electronic storage and submission of CD associated<br />

to their samples [Hankeln et al., 2010]. However, a tool that <strong>in</strong>tegrates<br />

CD and sequence data by directly enrich<strong>in</strong>g FASTA files for submission<br />

does not exist yet.<br />

Here we present CD<strong>in</strong>Fusion (Contextual <strong>Data</strong> and FASTA <strong>in</strong> fusion).<br />

CD<strong>in</strong>Fusion has been designed to submit sequence data together with<br />

CD to INSDC. CD<strong>in</strong>Fusion <strong>in</strong>tends to facilitate the <strong><strong>in</strong>tegration</strong> of CD<br />

and sequence data prior to submission by directly enrich<strong>in</strong>g sequence<br />

data us<strong>in</strong>g the FASTA format. It generates submission-ready outputs<br />

for INSDC by implement<strong>in</strong>g the MIxS standard def<strong>in</strong>ed by the GSC.<br />

CD<strong>in</strong>Fusion processes s<strong>in</strong>gle as well as MultiFASTA files, conta<strong>in</strong><strong>in</strong>g<br />

up to millions of sequences. It was successfully applied to several<br />

use cases. Example submissions to the INSDC can be accessed with<br />

the follow<strong>in</strong>g accession numbers: JF681370, JF268327-JF268425 and<br />

Genome Project ID 63253. A public <strong>in</strong>stallation is hosted and ma<strong>in</strong>ta<strong>in</strong>ed<br />

at the Max Planck Institute for Mar<strong>in</strong>e Microbiology, Bremen,<br />

Germany: http://www.megx.net/cd<strong>in</strong>fusion. The tool is easy to <strong>in</strong>stall<br />

and released under the LGPL 3 open source license to promote distribution<br />

<strong>in</strong> aid of <strong>in</strong>creas<strong>in</strong>g the quantity and quality of CD <strong>in</strong> the public<br />

repositories.<br />

3.3 Results<br />

CD<strong>in</strong>Fusion has been designed as a web-based tool, which enables users<br />

to upload s<strong>in</strong>gle or MultiFASTA files from s<strong>in</strong>gle sequence to highthroughput<br />

analysis and enrich them with CD. After upload<strong>in</strong>g the<br />

sequences, the user is requested to select the appropriate GSC checklist<br />

and environmental package. CD can be entered <strong>in</strong> the web forms<br />

or CSV templates can be downloaded, filled with CD off-l<strong>in</strong>e and uploaded.<br />

The CSV files help to store and share the data. The merged<br />

sequence and CD can be downloaded for subsequent submission to<br />

INSDC.<br />

The implemented workflow covers the three typical scenarios of sequence<br />

submission to an INSDC database namely: 1) Enrich<strong>in</strong>g a s<strong>in</strong>gle<br />

sequence with one CD set, 2) Enrich<strong>in</strong>g many sequences <strong>in</strong> a Mul-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!