Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Vertebrate Annotation<br />
The Vertebrate Annotation team aims to create comprehensive, up-to-date gene<br />
annotation and comparative genomics resources that further our understanding<br />
of biology, evolution and the mechanisms of disease. The data from these resources<br />
are distributed by the Ensembl project, and provide a foundation for clinical and<br />
research communities.<br />
While the reference genome assembly is an increasingly<br />
important tool for research, most scientists working in<br />
this area need to link genomic sequence to biological<br />
function. This is made possible by gene annotations,<br />
which identify the location, structure and expression of<br />
genes. Valuable insights can be gained by comparing the<br />
annotated sequences of individuals of the same species,<br />
or across a wide range of species.<br />
Our team produces reference gene annotation that is<br />
used by clinical, agricultural and research communities<br />
as well as by other data services at EMBL-EBI. Our<br />
gene annotation service provides high-quality gene<br />
sets for human, mouse and almost 100 other vertebrate<br />
species, including key model organisms and farmed<br />
animals. These are the primary annotations used in the<br />
initial genomic analyses of many international genome<br />
projects. In collaboration with GENCODE, we produce<br />
the gold-standard gene annotation for human and<br />
mouse.<br />
For every genome assembly in Ensembl, we also produce<br />
comparative genomics resources that link diverse<br />
species at the DNA and gene level. These data are used<br />
for investigating gene function and evolution, adaptive<br />
traits and conservation biology.<br />
We are responsible for TreeFam, which clusters<br />
similar gene sequences into homologous families and<br />
indicates gene history events such as duplication and<br />
speciation. Our comparative annotations include gene<br />
families, gene orthologues, whole genome multi-species<br />
alignments and conserved genomic regions across<br />
many species.<br />
We develop alignment and annotation methods<br />
that integrate diverse data from the public archives.<br />
We collaborate closely with other service teams at<br />
EMBl-EBI including the ENA, UniProt, and Expression<br />
Atlas to annotate new assemblies and to update<br />
annotation on existing assemblies as new data<br />
become available.<br />
Major achievements<br />
We are proud to maintain the reference human and<br />
mouse gene sets through our GENCODE collaboration.<br />
In <strong>2015</strong>, we improved and updated these gene resources<br />
by annotating assembly updates provided by the<br />
Genome Reference Consortium (GRC), incorporating<br />
new manual annotation, identifying gene alleles across<br />
the various haplotypes, and contributing to the CCDS<br />
project.<br />
We released major updates to the rat and zebrafish<br />
resources. We made new assemblies available for both<br />
species, and produced genome-wide annotation for<br />
them using our evidence-based methods to identify<br />
protein-coding genes, noncoding RNA genes, and<br />
pseudogenes. We produced tissue- and sample-specific<br />
transcript sets from RNA-seq data in the public<br />
archives. We updated the multi-species whole-genome<br />
alignments, gene trees and orthologues in Ensembl to<br />
include these new assemblies.<br />
We extended our gene-annotation methods to include<br />
annotation on lincRNA genes. We applied this method<br />
to human, mouse, rat and sheep and will be producing<br />
lincRNA annotation for more species in 2016.<br />
TreeFam produces phylogenetic trees and orthology<br />
predictions for all Ensembl eukaryotes. The number<br />
of publicly available genomes is increasingly rapidly,<br />
providing an opportunity for new insights via<br />
comparative genomics. To achieve scalability, we<br />
designed a novel workflow that will classify protein<br />
sequences from thousands of genomes into gene<br />
families in a quick and robust manner. This workflow is<br />
now partially in production and uses our new library of<br />
Hidden Markov Model (HMM) profiles.<br />
Our comprehensive genome annotations are the<br />
foundation for myriad downstream analysis tools<br />
and research, including the Ensembl Variation Effect<br />
Predictor (VEP). Access to consistent gene annotation<br />
for a wide range of vertebrates is important for<br />
evolutionary studies. Members of our team collaborated<br />
with others in studying gene families in the vervet<br />
(African green monkey) lineage, using freely available<br />
data from ten of our annotation projects.<br />
87<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>