22.08.2016 Views

Annual Scientific Report 2015

EMBL_EBI_ASR_2015_DigitalEdition

EMBL_EBI_ASR_2015_DigitalEdition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Genomics Technology Infrastructure<br />

Our team is responsible for developing technology to store, process, access and<br />

disseminate genomic annotation. We provide Ensembl and Ensembl Genomes<br />

with its production infrastructure, and maintain database application<br />

programming interfaces (APIs), web services, schemas and data-mining tools.<br />

Genomic annotation, a foundational service provided<br />

by EMBL-EBI, is fundamental to many downstream<br />

programs for interpreting genomic data sets. We provide<br />

a single software stack, empowering our users to explore<br />

multiple datasets on many species, in reference to<br />

several genome assemblies, and ensure a consistent<br />

interface-to-annotation experience with minimal<br />

recoding required.<br />

For the past 15 years, Ensembl has developed APIs for<br />

data access through its MySQL databases. For many<br />

bioinformatics communities, these APIs, including<br />

the highly used Variant Effect Predictor (VEP), are a<br />

vital component of many analysis methods. They are<br />

essential to Ensembl’s genome-annotation pipelines,<br />

allowing our solutions to be rolled out quickly to new<br />

species, and enable the Ensembl website to serve over 4<br />

million users a year.<br />

Our team is also responsible for developing new<br />

methods for data access. Our current focus is on<br />

providing efficient web service APIs to access genomic<br />

annotation. Bioinformatics is no longer a language<br />

dominated by a small set of programing languages; we<br />

can increase the value of our APIs by ensuring they<br />

appeal to as large a consumer base as possible. The<br />

Ensembl REST API, first released in 2012, was our<br />

first foray into providing these types of API and one we<br />

continue to improve upon.<br />

We also provide access to genome annotation via our<br />

public FTP sites and data-mining tool, BioMart. Both<br />

solutions provide ‘de-normalised’ representations of<br />

Ensembl data customised for genome-wide analysis and<br />

batch processing. We build these resources for Ensembl<br />

and Ensembl Genomes using the same codebases,<br />

ensuring consistent, high quality and correct data across<br />

both projects.<br />

BioMart, developed at EMBl-EBI, is a well-used<br />

service, especially by R programmers through the<br />

Bioconductor biomaRt project. EMBL-EBI projects<br />

(e.g. Gene Expression Atlas, UniProt, Reactome) build<br />

their own resources from annotations provided by these<br />

alternative, efficient access methods.<br />

Major achievements<br />

In <strong>2015</strong> we issued a beta release of a new Track Hub<br />

Registry service, which makes community-submitted<br />

datasets more discoverable. The Registry was developed<br />

as part of our BBSRC funded project ProteoGenomics.<br />

Track Hubs enable users visualise genomic annotation<br />

data within a genome browser, using a UCSC Genome<br />

Browser text format to group and configure these<br />

datasets into ‘tracks’. Ensembl has supported Track<br />

Hubs since their use in the ENCODE project, when<br />

they required more user involvement. Initially, it was<br />

necessary to know the location of a dataset, then and<br />

manually attach it to the Ensembl Browser. Now, the<br />

Track Hub Registry enables users to publish Track Hub<br />

locations and provides a comprehensive search interface<br />

(see Figure). We see Track Hubs and the Track Hub<br />

Registry as the natural successor to the Distributed<br />

Annotation System (DAS).<br />

In <strong>2015</strong> use of our REST API continued to grow, serving<br />

over 70 million requests for data. We made REST<br />

available for all Ensembl and Ensembl Genomes species,<br />

and provided access to annotation based on GRCh37.<br />

We released our first Global Alliance for Genomics and<br />

Health (GA4GH) endpoints, which provide access to<br />

Ensembl-hosted variation data. We also expanded the<br />

amount of data returned from a number of endpoints,<br />

providing more comprehensive information about genes<br />

and variations. We improved the performance of our<br />

endpoints, notably of retrieving genomic variations in<br />

protein coordinates and the VEP.<br />

We developed the first set of programmatic interfaces to<br />

access CRAM data from Perl. Our work, funded as part<br />

of the Crop Bioinformatics Initiative from the BBSRC,<br />

enabled CRAM data attachment into Ensembl; making<br />

Ensembl genome browsers among the first to support<br />

the next-generation format.<br />

In <strong>2015</strong> we built new layers in our indexing schemes,<br />

FAIDX and Tabix, providing fast access to FASTA<br />

formatted sequence and tab-delimited data files,<br />

respectively. The library is available from CPAN under<br />

an Apache 2.0 license.<br />

89<br />

<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!