11.03.2014 Views

Data integration in microbial genomics ... - Jacobs University

Data integration in microbial genomics ... - Jacobs University

Data integration in microbial genomics ... - Jacobs University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.2. Introduction 41<br />

2007, Ratnas<strong>in</strong>gham and Hebert, 2007].<br />

Early on, scientists recognized the necessity to share sequence data<br />

to facilitate reuse, reproducibility and comparisons. This has become<br />

an <strong>in</strong>tegral part of the research and publication process. In the<br />

’Bermuda Pr<strong>in</strong>ciples’, on the first <strong>in</strong>ternational strategy meet<strong>in</strong>g on human<br />

genome sequenc<strong>in</strong>g <strong>in</strong> 1996, it was agreed upon, that all human<br />

genomic sequence <strong>in</strong>formation, generated by centers funded for largescale<br />

human sequenc<strong>in</strong>g, should be freely available <strong>in</strong> the public doma<strong>in</strong><br />

to encourage research and to maximize its benefits to society. In the<br />

Fort Lauderdale meet<strong>in</strong>g <strong>in</strong> 2003 organized by the Wellcome Trust, it<br />

was f<strong>in</strong>ally agreed to deposit all k<strong>in</strong>ds of sequenc<strong>in</strong>g data that are analyzed<br />

<strong>in</strong> scientific publications <strong>in</strong> public databases. Over the past two<br />

decades, the amount of sequence data submitted to the world’s largest<br />

public nucleotide sequence data repository INSDC (International Nucleotide<br />

Sequence <strong>Data</strong>base Collaboration, compris<strong>in</strong>g of DDBJ (DNA<br />

<strong>Data</strong> Bank of Japan), ENA (European Nucleotide Archive), and Gen-<br />

Bank) has grown exponentially [Stratton et al., 2009]. Recently, Next<br />

Generation Sequenc<strong>in</strong>g (NGS) technologies [Mardis, 2008] allow even<br />

faster and more economical sequence generation, result<strong>in</strong>g <strong>in</strong> an unprecedented<br />

sequence accumulation.<br />

Despite the impressive magnitude of sequence data generation, numerous<br />

life science studies have shown that contextual (meta)data (CD)<br />

are crucial for their <strong>in</strong>terpretation [DeLong et al., 2006,Fuhrman et al.,<br />

2006, Schriml et al., 2010]. CD are metadata about features such as<br />

the environmental orig<strong>in</strong> and the process<strong>in</strong>g steps that were applied<br />

to obta<strong>in</strong> the sequences. These ranges from data about the geographic<br />

location (latitude, longitude), sampl<strong>in</strong>g time, habitat, to experimental<br />

procedures used to obta<strong>in</strong> the sequences up to video data recorded dur<strong>in</strong>g<br />

sampl<strong>in</strong>g. The fact however that e.g. latitude, longitude (INSDC:<br />

lat lon), and time (INSDC: collection date), which can be submitted<br />

to the public repositories s<strong>in</strong>ce years, have so far only been reported<br />

<strong>in</strong> 7.3% and 7.2% of all submissions [Hankeln et al., 2010], strongly<br />

implies that the procedure to deposit these data is hampered. Common<br />

reasons are: 1) no clear descriptors exist to guide the submitters<br />

which metadata should be deposited and 2) no appropriate tools exist<br />

that support the comb<strong>in</strong>ed submission of sequence data and CD.<br />

These concerns have recently prompted the Genomic Standards Con-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!