Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
European Nucleotide Archive<br />
Our team builds and maintains the European Nucleotide Archive (ENA), an<br />
open, supported platform for the management, sharing, integration, archiving<br />
and dissemination of public-domain sequence data. ENA comprises both the<br />
globally comprehensive data resource that preserves the world’s public-domain<br />
output of sequence data, and a rich portfolio of services that support the research<br />
community in handling and using sequence data.<br />
As nucleotide sequencing becomes increasingly central<br />
to applied areas such as healthcare and environmental<br />
sciences, ENA has become a foundation upon which<br />
scientific understanding may be assembled. Our<br />
users comprise data submitters, data coordinators for<br />
sequence-based studies, direct data consumers and<br />
secondary service providers (e.g., UniProt, RNAcentral,<br />
EBI Metagenomics, Ensembl, Ensembl Genomes,<br />
ArrayExpress) that build on ENA services and content.<br />
By extending and adapting software and services that<br />
started within ENA, we provide technology and datarepository<br />
support beyond ENA, including the CRAM<br />
sequence-data compression technology and the Webin<br />
submissions application.<br />
We nurture communities of scientists and engineers<br />
in building and improving data infrastructure, helping<br />
them incorporate richer, more discoverable content and<br />
enable higher-impact, sequence-based science. The<br />
ENA and its partner databases represent the engine<br />
room of global data sharing in the life sciences, and<br />
our team members help drive progress in our domain.<br />
We focus on data standards driven by domain experts,<br />
technical interoperability and social aspects of data<br />
sharing such as policy and regulatory requirements.<br />
Major achievements<br />
Throughout <strong>2015</strong> we provided services to more than<br />
2000 sequence data submitters. We made the ENA’s<br />
content discoverable and accessible to a user base many<br />
times this size. We improved many of our services,<br />
including the Webin technical infrastructure, which<br />
provides the means for users to transfer and describe<br />
their newly generated data sets. For example, we<br />
made it easier for users to describe the materials they<br />
sequenced, and updated the programmatic interface to<br />
support both tabular and XML formats for metadata<br />
files. We enhanced the ENA website and associated Web<br />
Services through greater data integration and usability.<br />
Improvements to the presentation of assembly data<br />
include options to browse those sets of raw reads that<br />
have been aligned to a given reference. This represents<br />
a major joining together of the reads and sequence<br />
domains of ENA.<br />
We continued to provide support for the global<br />
programmes for which we offer data coordination,<br />
including those handling marine (e.g. Tara Oceans,<br />
Ocean Sampling Day) and, in the context of the<br />
COMPARE project, pathogen-related data.<br />
Pathogen surveillance and genomics<br />
There is growing global enthusiasm for whole-genome<br />
sequencing as a platform for pathogen surveillance.<br />
It offers opportunities to apply a single, affordable<br />
technology across multiple pathogen groups, and<br />
promises to cater for vital analyses spanning public,<br />
animal, food and environmental health/safety areas.<br />
The ENA provides the robust data infrastructure<br />
required for whole-genome sequencing to come to<br />
fruition, supporting the global sharing of sequence data<br />
and providing new technologies, tools and services for<br />
pathogen surveillance. As a partner on the EU-funded<br />
COMPARE project, we are working actively to help<br />
speed up the detection of and response to infectious<br />
disease outbreaks using genome technology.<br />
In <strong>2015</strong> we updated many of the core technologies<br />
required to adapt ENA to support rapid sharing of<br />
pathogen data, for example deploying a dedicated<br />
viral assembly data submission workflow in Webin,<br />
accelerating the validation of incoming data, enhancing<br />
performance in INSDC global data sharing pipelines<br />
and optimising the efficiency of data structures for<br />
the presentation and sharing of assembled bacterial<br />
genomes. We developed data reporting standards for raw<br />
and assembled viral and bacterial pathogen data sets,<br />
such that the descriptive information supplied is well<br />
suited for data discoverability and reuse. We packaged<br />
core technologies, standards and data submission<br />
services into the COMPARE ‘Data Hubs’. These hubs<br />
facilitate the immediate sharing of data amongst<br />
collaborating public-health scientists. Built on the ENA<br />
and the global INSDC system, datasets shared through<br />
these hubs are structured appropriately and described<br />
richly, ready for full public release as soon as any<br />
embargo lifts. To facilitate the systematic interpretation<br />
of pathogen surveillance data flowing into ENA, we<br />
made available an Embassy Cloud compute environment<br />
to support computational workflows.<br />
81<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>