Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Genome Analysis<br />
The Genome Analysis team is responsible for the development of scalable<br />
solutions for the functional annotation of genomes, with a focus on genetics<br />
and epigenomics. Within Ensembl, we maintain the Variation and Regulation<br />
databases that collect all available data on vertebrate genetic markers and gene<br />
expression regulation.<br />
Whereas the cost of sequencing a genome has dropped,<br />
the downstream analysis is still costly and error<br />
prone. Outside of the coding sequences, which are<br />
comparatively well understood, our knowledge of what<br />
the vast majority of the genome does is still hazy, and<br />
our ability to predict the functional consequence of<br />
most variants approximate. For example, most genetic<br />
markers that are revealed by genome wide association<br />
studies (GWAS) lie outside of coding regions. It is<br />
therefore very difficult to understand the corresponding<br />
disease mechanisms, let alone devise effective<br />
therapeutic strategies to counter them.<br />
The first obstacle that needs to be addressed is easy<br />
access to the data. For a number of practical, commercial<br />
or ethical reasons, many genetic datasets are isolated<br />
in separate databases. It is now clear that to detect<br />
minute genetic associations researchers must pool their<br />
resources so as to create a greater common dataset. It<br />
is similarly clear that to understand the full dynamics<br />
of the epigenome, the world research community needs<br />
to coordinate a very large number of assays. With<br />
this in mind, we are active members in a number of<br />
international collaborations to collect and share data.<br />
Having gained access to the relevant datasets, we<br />
develop technical solutions to efficiently process them,<br />
and cross-examine large numbers of experiments. We<br />
therefore develop specialised infrastructure to process<br />
multidimensional genome-wide datasets. This includes<br />
user-friendly webtools as well as more advanced<br />
command-line tools and micro-services. In particular,<br />
the Variant Effect Predictor (VEP) stands out as the<br />
most popular way of querying Ensembl for a set of<br />
DNA variants.<br />
Downstream of our data and our tools, we deliver<br />
integrative analyses that summarise all the data that<br />
we processed. Our users can rely on these results, with<br />
the knowledge that we integrated as many experiments<br />
as possible to produce the most precise annotation<br />
possible. For example, the Ensembl Regulatory Build is<br />
the first published annotation that integrates ChIP-Seq<br />
and open chromatin experiments from multiple<br />
consortia into a single synthetic summary.<br />
Major achievements<br />
A lot of our efforts were focused on obtaining more<br />
datasets and encouraging healthy data sharing between<br />
research partners. We are active members of the<br />
Global Alliance for Genomics and Health (GA4GH),<br />
an organisation that strives to break the technical and<br />
legal barriers to genomic data sharing across the world.<br />
This year, the Global Alliance finished its first standard<br />
exchange protocols. We also play an important role<br />
in the International Human Epigenome Consortium<br />
(IHEC) and the Functional Annotation of Animal<br />
Genomes (FAANG) consortium, that set the standards<br />
and collect references to epigenomic datasets produced<br />
by all major consortia. Finally, we are collaborating with<br />
the GWAS Catalog project to help collect and curate<br />
available GWAS results.<br />
We also continued to improve access to our data. We<br />
continued to improve the VEP, for example by speeding<br />
it up. We also helped develop and release user-friendly<br />
tools for epigenomics as produced by the BLUEPRINT<br />
consortium, GenomeStats and EpiExplorer. We<br />
also improved the querying of the datasets that we<br />
store locally, for example by acceleration linkage<br />
disequilibrium calculations and displaying them as<br />
Manhattan plots. Cis-regulatory interactions are making<br />
their way into Ensembl, and we are currently able to<br />
display user datasets (see Figure 1).<br />
Finally, we kept abreast of technical development in<br />
the field. Our Ensembl Regulatory Build algorithm was<br />
published in Genome Biology. We collaborated with<br />
Peter Fraser and Mikhail Spivakov of the Babraham<br />
Institute on the downstream analysis of an exciting<br />
technology, called Promoter Capture Hi-C, that produces<br />
high-resolution maps of the 3 dimensional conformation<br />
of the chromatin. We are also working with the CTTV<br />
Epigenome Project, where we compare and contrast cell<br />
line epigenomic data.<br />
Future plans<br />
In the next few years, we aim to greatly expand the<br />
number of species, tissues and cell types covered by<br />
the Ensembl Regulatory Build. Having spent much of<br />
85<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>