Data integration in microbial genomics ... - Jacobs University
Data integration in microbial genomics ... - Jacobs University
Data integration in microbial genomics ... - Jacobs University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
14 1. Introduction<br />
possibilities like temporal or spatial comparisons are hampered or simply<br />
not possible and comparability of the data cannot be assured. The<br />
fact that “latitude, longitude, and time, elements of the key contextual<br />
data tuple (x,y,z,t), are only reported <strong>in</strong> 7.3% and 7.2% of all<br />
submissions“ [Hankeln et al., 2010] shows that the majority of public<br />
sequence data is not sufficiently annotated.<br />
This fact has been<br />
recognized by the Genomic Standards Consortium (GSC), which ”is<br />
an open-membership work<strong>in</strong>g body which formed <strong>in</strong> September 2005.<br />
The goal of this <strong>in</strong>ternational community is to promote mechanisms<br />
that standardize the description of genomes and the exchange and<br />
<strong><strong>in</strong>tegration</strong> of genomic data“ (www.gensc.org).<br />
The GSC developed a series of checklists to specify which data should<br />
be captured and stored along with sequence data [Field et al., 2008].<br />
The INSDC databases support the storage of these parameters. Recently,<br />
the life science community has begun to develop tools that<br />
implement these standards and to actively <strong>in</strong>tegrate these different<br />
data sources. “<strong>Data</strong> <strong><strong>in</strong>tegration</strong> is the process of comb<strong>in</strong><strong>in</strong>g data resid<strong>in</strong>g<br />
at different sources and provid<strong>in</strong>g the user with a unified view<br />
of these data“ [Lenzer<strong>in</strong>i, 2002].<br />
Once contextualized, a far greater scope of analyses can be performed.<br />
Studies <strong>in</strong> various discipl<strong>in</strong>es of life science have already shown the<br />
power of contextual data enriched sequence studies. In mar<strong>in</strong>e microbiology<br />
it could be shown that there are conserved diversity patterns<br />
along the depth cont<strong>in</strong>uum [DeLong et al., 2006].<br />
Furthermore, annually<br />
recurr<strong>in</strong>g diversity patterns could be identified <strong>in</strong> certa<strong>in</strong> regions<br />
of the ocean [Fuhrman et al., 2006].<br />
In the medical field the<br />
global outbreaks of epidemics can be monitored globally [Janies et al.,<br />
2007, Salzberg et al., 2007, Schriml et al., 2010] 14 . All these studies<br />
exemplify the potential of globally <strong>in</strong>tegrated data.<br />
The tighter the<br />
<strong><strong>in</strong>tegration</strong> of sequence data with contextual data will be, the easier<br />
it will become to carry out sequence data analysis studies <strong>in</strong> larger<br />
contexts. This offers an approach to answer the basic questions ”Who<br />
is out there?”, ”How many of which k<strong>in</strong>d?” and ”What are they do<strong>in</strong>g?”.<br />
Moreover, knowledge will become obta<strong>in</strong>able about the complex<br />
mechanisms of the Earth’s biosphere on the micro and macro scale.<br />
14 There are many more examples that show the <strong>in</strong>creased <strong>in</strong>terpretability of contextualized<br />
sequence data: [Tyson et al., 2004, Sog<strong>in</strong> et al., 2006, Seshadri et al., 2007, Huber et al., 2007,<br />
Rusch et al., 2007].