Data integration in microbial genomics ... - Jacobs University
Data integration in microbial genomics ... - Jacobs University
Data integration in microbial genomics ... - Jacobs University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.2. Introduction 41<br />
2007, Ratnas<strong>in</strong>gham and Hebert, 2007].<br />
Early on, scientists recognized the necessity to share sequence data<br />
to facilitate reuse, reproducibility and comparisons. This has become<br />
an <strong>in</strong>tegral part of the research and publication process. In the<br />
’Bermuda Pr<strong>in</strong>ciples’, on the first <strong>in</strong>ternational strategy meet<strong>in</strong>g on human<br />
genome sequenc<strong>in</strong>g <strong>in</strong> 1996, it was agreed upon, that all human<br />
genomic sequence <strong>in</strong>formation, generated by centers funded for largescale<br />
human sequenc<strong>in</strong>g, should be freely available <strong>in</strong> the public doma<strong>in</strong><br />
to encourage research and to maximize its benefits to society. In the<br />
Fort Lauderdale meet<strong>in</strong>g <strong>in</strong> 2003 organized by the Wellcome Trust, it<br />
was f<strong>in</strong>ally agreed to deposit all k<strong>in</strong>ds of sequenc<strong>in</strong>g data that are analyzed<br />
<strong>in</strong> scientific publications <strong>in</strong> public databases. Over the past two<br />
decades, the amount of sequence data submitted to the world’s largest<br />
public nucleotide sequence data repository INSDC (International Nucleotide<br />
Sequence <strong>Data</strong>base Collaboration, compris<strong>in</strong>g of DDBJ (DNA<br />
<strong>Data</strong> Bank of Japan), ENA (European Nucleotide Archive), and Gen-<br />
Bank) has grown exponentially [Stratton et al., 2009]. Recently, Next<br />
Generation Sequenc<strong>in</strong>g (NGS) technologies [Mardis, 2008] allow even<br />
faster and more economical sequence generation, result<strong>in</strong>g <strong>in</strong> an unprecedented<br />
sequence accumulation.<br />
Despite the impressive magnitude of sequence data generation, numerous<br />
life science studies have shown that contextual (meta)data (CD)<br />
are crucial for their <strong>in</strong>terpretation [DeLong et al., 2006,Fuhrman et al.,<br />
2006, Schriml et al., 2010]. CD are metadata about features such as<br />
the environmental orig<strong>in</strong> and the process<strong>in</strong>g steps that were applied<br />
to obta<strong>in</strong> the sequences. These ranges from data about the geographic<br />
location (latitude, longitude), sampl<strong>in</strong>g time, habitat, to experimental<br />
procedures used to obta<strong>in</strong> the sequences up to video data recorded dur<strong>in</strong>g<br />
sampl<strong>in</strong>g. The fact however that e.g. latitude, longitude (INSDC:<br />
lat lon), and time (INSDC: collection date), which can be submitted<br />
to the public repositories s<strong>in</strong>ce years, have so far only been reported<br />
<strong>in</strong> 7.3% and 7.2% of all submissions [Hankeln et al., 2010], strongly<br />
implies that the procedure to deposit these data is hampered. Common<br />
reasons are: 1) no clear descriptors exist to guide the submitters<br />
which metadata should be deposited and 2) no appropriate tools exist<br />
that support the comb<strong>in</strong>ed submission of sequence data and CD.<br />
These concerns have recently prompted the Genomic Standards Con-