Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Guy Cochrane<br />
European Nucleotide Archive<br />
PhD University of East Anglia, 1999.<br />
At EMBL-EBI since 2002.<br />
Team Leader since 2009.<br />
Quality through standards<br />
The ENA team tracks growth in the uptake of<br />
sequencing and the emergence of innovations in the<br />
field, as these directly impact the growth and evolution<br />
of our services. In <strong>2015</strong> we entered the final stages of a<br />
transition from direct, manual submission processing<br />
to a system that focuses curator input using formally<br />
defined checklists and validation rules. Checklists offer<br />
a structured way to collect information about a sample,<br />
presenting attribute names alongside their definitions,<br />
usage conventions and syntactic rules (e.g. conventional<br />
expression, controlled vocabulary). This approach<br />
allows us to optimise datasets for discoverability and<br />
reanalysis across classes of data submission. It also<br />
allows ENA curators to create and edit attributes<br />
efficiently. This supports concurrent working on the<br />
system and allows for safety operations such as rollback.<br />
The system makes it possible for a single editing event<br />
to drive a change to the attribute across all checklists in<br />
which it appears, where appropriate. Such normalisation<br />
ensures a consistent experience for the data submitter,<br />
supports the capture of consistent and reliable data and,<br />
ultimately, improves the presentation of search services.<br />
Data compression<br />
We advanced our CRAM reference-based sequence<br />
data compression technology in <strong>2015</strong>. We continued to<br />
offer and support CRAM as a public software package<br />
for its broadest possible use, extended the technology<br />
itself and adopted it more deeply across ENA services.<br />
We transitioned to CRAM v. 3; extended the software<br />
to include more effective, faster compression; adopted<br />
new compression codecs; improved the treatment of<br />
unmapped reads; established greater controls on data<br />
integrity under random access; and provided more<br />
support for external tools such as the widely used<br />
hts-jdk. We enriched services for CRAM as a core data<br />
format within the Webin and ENA systems, providing<br />
full support across the Webin interfaces for CRAM<br />
submission and the systematic reference indexing of all<br />
submitted raw read CRAM data files to make these reads<br />
available through genomic coordinate-based queries.<br />
Future plans<br />
In 2016 we will continue to work with user communities<br />
on data standards, for example extending the established<br />
Marine Microbial Biodiversity, Bioinformatics and<br />
Biotechnology (M2B3) standard, including coverage of<br />
aquaculture and blue biotechnology-related studies. We<br />
also expect further work on pathogen-related standards.<br />
We will actively seek to collaborate with further<br />
communities to target coverage gaps, with a view to<br />
having checklist coverage across all classes of incoming<br />
data. Curation of data submissions representing<br />
non-assembly annotated sequence will<br />
become a fully autonomous strand of<br />
activity in 2016, which will complete our transition to<br />
having all major submission workflows operating in a<br />
scalable, quality-assured mode.<br />
We will implement specific computational workflows<br />
in the COMPARE Embassy Cloud system, initially<br />
covering bacterial assembly and functional annotation<br />
and typing/resistance profiling. We will further develop<br />
the COMPARE Data Hub concept to allow simpler<br />
user management and more integrated access. We<br />
will begin to construct a data portal for the pathogen<br />
surveillance community, with tailored search, browse<br />
and visualisation tools, and will continue to support data<br />
sharing and analysis efforts around emerging outbreaks.<br />
We will extend the existing ENA system for structured<br />
analysis output data, for example for antimicrobial<br />
drug-resistance profiles and abundance profiles from<br />
ecological studies. This will allow for the agile response<br />
to submissions and data presentation for as-yetunsupported<br />
data types. It has already been used as<br />
the basis for assembly and variation data in the EVA,<br />
submission and indexing support. Extending this system<br />
to serve as data infrastructure for EBI Metagenomics<br />
will help us improve submission and retrieval flexibility.<br />
Selected publications<br />
Gibson R, et al. (2016) Biocuration of functional<br />
annotation at the European nucleotide archive. Nucleic<br />
Acids Res. 44:D58-D66. doi:10.1093/nar/gkv1311<br />
Cochrane G, et al. (2016) The International Nucleotide<br />
Sequence Database Collaboration. Nucleic Acids Res.<br />
44:D48-D50. doi:10.1093/nar/gkv1323<br />
Ten Hoopen P, et al. (<strong>2015</strong>) Marine microbial<br />
biodiversity, bioinformatics and biotechnology (M2B3)<br />
data reporting and service standards. Stand Genomic Sci.<br />
10:20. doi:10.1186/s40793-015-0001-5<br />
Ip CL, et al. (<strong>2015</strong>) MinION Analysis and Reference<br />
Consortium: Phase 1 data release and analysis.<br />
F1000Res. 4:1075. doi:10.12688/f1000research.7201.1<br />
Mitchell A, et al. (2016) EBI metagenomics in 2016 -<br />
an expanding and evolving resource for the analysis<br />
and archiving of metagenomic data. Nucleic Acids Res.<br />
44:D595-D603. doi:10.1093/nar/gkv1195<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong> 82