22.08.2016 Views

Annual Scientific Report 2015

EMBL_EBI_ASR_2015_DigitalEdition

EMBL_EBI_ASR_2015_DigitalEdition

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

European Nucleotide Archive<br />

Our team builds and maintains the European Nucleotide Archive (ENA), an<br />

open, supported platform for the management, sharing, integration, archiving<br />

and dissemination of public-domain sequence data. ENA comprises both the<br />

globally comprehensive data resource that preserves the world’s public-domain<br />

output of sequence data, and a rich portfolio of services that support the research<br />

community in handling and using sequence data.<br />

As nucleotide sequencing becomes increasingly central<br />

to applied areas such as healthcare and environmental<br />

sciences, ENA has become a foundation upon which<br />

scientific understanding may be assembled. Our<br />

users comprise data submitters, data coordinators for<br />

sequence-based studies, direct data consumers and<br />

secondary service providers (e.g., UniProt, RNAcentral,<br />

EBI Metagenomics, Ensembl, Ensembl Genomes,<br />

ArrayExpress) that build on ENA services and content.<br />

By extending and adapting software and services that<br />

started within ENA, we provide technology and datarepository<br />

support beyond ENA, including the CRAM<br />

sequence-data compression technology and the Webin<br />

submissions application.<br />

We nurture communities of scientists and engineers<br />

in building and improving data infrastructure, helping<br />

them incorporate richer, more discoverable content and<br />

enable higher-impact, sequence-based science. The<br />

ENA and its partner databases represent the engine<br />

room of global data sharing in the life sciences, and<br />

our team members help drive progress in our domain.<br />

We focus on data standards driven by domain experts,<br />

technical interoperability and social aspects of data<br />

sharing such as policy and regulatory requirements.<br />

Major achievements<br />

Throughout <strong>2015</strong> we provided services to more than<br />

2000 sequence data submitters. We made the ENA’s<br />

content discoverable and accessible to a user base many<br />

times this size. We improved many of our services,<br />

including the Webin technical infrastructure, which<br />

provides the means for users to transfer and describe<br />

their newly generated data sets. For example, we<br />

made it easier for users to describe the materials they<br />

sequenced, and updated the programmatic interface to<br />

support both tabular and XML formats for metadata<br />

files. We enhanced the ENA website and associated Web<br />

Services through greater data integration and usability.<br />

Improvements to the presentation of assembly data<br />

include options to browse those sets of raw reads that<br />

have been aligned to a given reference. This represents<br />

a major joining together of the reads and sequence<br />

domains of ENA.<br />

We continued to provide support for the global<br />

programmes for which we offer data coordination,<br />

including those handling marine (e.g. Tara Oceans,<br />

Ocean Sampling Day) and, in the context of the<br />

COMPARE project, pathogen-related data.<br />

Pathogen surveillance and genomics<br />

There is growing global enthusiasm for whole-genome<br />

sequencing as a platform for pathogen surveillance.<br />

It offers opportunities to apply a single, affordable<br />

technology across multiple pathogen groups, and<br />

promises to cater for vital analyses spanning public,<br />

animal, food and environmental health/safety areas.<br />

The ENA provides the robust data infrastructure<br />

required for whole-genome sequencing to come to<br />

fruition, supporting the global sharing of sequence data<br />

and providing new technologies, tools and services for<br />

pathogen surveillance. As a partner on the EU-funded<br />

COMPARE project, we are working actively to help<br />

speed up the detection of and response to infectious<br />

disease outbreaks using genome technology.<br />

In <strong>2015</strong> we updated many of the core technologies<br />

required to adapt ENA to support rapid sharing of<br />

pathogen data, for example deploying a dedicated<br />

viral assembly data submission workflow in Webin,<br />

accelerating the validation of incoming data, enhancing<br />

performance in INSDC global data sharing pipelines<br />

and optimising the efficiency of data structures for<br />

the presentation and sharing of assembled bacterial<br />

genomes. We developed data reporting standards for raw<br />

and assembled viral and bacterial pathogen data sets,<br />

such that the descriptive information supplied is well<br />

suited for data discoverability and reuse. We packaged<br />

core technologies, standards and data submission<br />

services into the COMPARE ‘Data Hubs’. These hubs<br />

facilitate the immediate sharing of data amongst<br />

collaborating public-health scientists. Built on the ENA<br />

and the global INSDC system, datasets shared through<br />

these hubs are structured appropriately and described<br />

richly, ready for full public release as soon as any<br />

embargo lifts. To facilitate the systematic interpretation<br />

of pathogen surveillance data flowing into ENA, we<br />

made available an Embassy Cloud compute environment<br />

to support computational workflows.<br />

81<br />

<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!