Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Genomics Technology Infrastructure<br />
Our team is responsible for developing technology to store, process, access and<br />
disseminate genomic annotation. We provide Ensembl and Ensembl Genomes<br />
with its production infrastructure, and maintain database application<br />
programming interfaces (APIs), web services, schemas and data-mining tools.<br />
Genomic annotation, a foundational service provided<br />
by EMBL-EBI, is fundamental to many downstream<br />
programs for interpreting genomic data sets. We provide<br />
a single software stack, empowering our users to explore<br />
multiple datasets on many species, in reference to<br />
several genome assemblies, and ensure a consistent<br />
interface-to-annotation experience with minimal<br />
recoding required.<br />
For the past 15 years, Ensembl has developed APIs for<br />
data access through its MySQL databases. For many<br />
bioinformatics communities, these APIs, including<br />
the highly used Variant Effect Predictor (VEP), are a<br />
vital component of many analysis methods. They are<br />
essential to Ensembl’s genome-annotation pipelines,<br />
allowing our solutions to be rolled out quickly to new<br />
species, and enable the Ensembl website to serve over 4<br />
million users a year.<br />
Our team is also responsible for developing new<br />
methods for data access. Our current focus is on<br />
providing efficient web service APIs to access genomic<br />
annotation. Bioinformatics is no longer a language<br />
dominated by a small set of programing languages; we<br />
can increase the value of our APIs by ensuring they<br />
appeal to as large a consumer base as possible. The<br />
Ensembl REST API, first released in 2012, was our<br />
first foray into providing these types of API and one we<br />
continue to improve upon.<br />
We also provide access to genome annotation via our<br />
public FTP sites and data-mining tool, BioMart. Both<br />
solutions provide ‘de-normalised’ representations of<br />
Ensembl data customised for genome-wide analysis and<br />
batch processing. We build these resources for Ensembl<br />
and Ensembl Genomes using the same codebases,<br />
ensuring consistent, high quality and correct data across<br />
both projects.<br />
BioMart, developed at EMBl-EBI, is a well-used<br />
service, especially by R programmers through the<br />
Bioconductor biomaRt project. EMBL-EBI projects<br />
(e.g. Gene Expression Atlas, UniProt, Reactome) build<br />
their own resources from annotations provided by these<br />
alternative, efficient access methods.<br />
Major achievements<br />
In <strong>2015</strong> we issued a beta release of a new Track Hub<br />
Registry service, which makes community-submitted<br />
datasets more discoverable. The Registry was developed<br />
as part of our BBSRC funded project ProteoGenomics.<br />
Track Hubs enable users visualise genomic annotation<br />
data within a genome browser, using a UCSC Genome<br />
Browser text format to group and configure these<br />
datasets into ‘tracks’. Ensembl has supported Track<br />
Hubs since their use in the ENCODE project, when<br />
they required more user involvement. Initially, it was<br />
necessary to know the location of a dataset, then and<br />
manually attach it to the Ensembl Browser. Now, the<br />
Track Hub Registry enables users to publish Track Hub<br />
locations and provides a comprehensive search interface<br />
(see Figure). We see Track Hubs and the Track Hub<br />
Registry as the natural successor to the Distributed<br />
Annotation System (DAS).<br />
In <strong>2015</strong> use of our REST API continued to grow, serving<br />
over 70 million requests for data. We made REST<br />
available for all Ensembl and Ensembl Genomes species,<br />
and provided access to annotation based on GRCh37.<br />
We released our first Global Alliance for Genomics and<br />
Health (GA4GH) endpoints, which provide access to<br />
Ensembl-hosted variation data. We also expanded the<br />
amount of data returned from a number of endpoints,<br />
providing more comprehensive information about genes<br />
and variations. We improved the performance of our<br />
endpoints, notably of retrieving genomic variations in<br />
protein coordinates and the VEP.<br />
We developed the first set of programmatic interfaces to<br />
access CRAM data from Perl. Our work, funded as part<br />
of the Crop Bioinformatics Initiative from the BBSRC,<br />
enabled CRAM data attachment into Ensembl; making<br />
Ensembl genome browsers among the first to support<br />
the next-generation format.<br />
In <strong>2015</strong> we built new layers in our indexing schemes,<br />
FAIDX and Tabix, providing fast access to FASTA<br />
formatted sequence and tab-delimited data files,<br />
respectively. The library is available from CPAN under<br />
an Apache 2.0 license.<br />
89<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>