Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Sequence Families<br />
The Sequence Families team is responsible for the InterPro, Pfam, RNAcentral<br />
and Rfam data resources, and coordinates the EBI Metagenomics project. We<br />
are also responsible for the HMMER web server, which performs extremely fast<br />
protein-homology searches.<br />
101<br />
InterPro integrates protein-family data from 11 major<br />
sources, and classifies hierarchically the different<br />
protein family, domain and functional site definitions<br />
to provide a unified view of diverse data. InterPro<br />
curators build on this, adding functional descriptions<br />
to the families and annotating entire datasets with<br />
structured Gene Ontology terms. The InterProScan<br />
tool allows users to identify InterPro entries on protein<br />
sequences. Pfam generates new protein family entries<br />
and has the largest sequence coverage of all the InterPro<br />
member databases. Both InterPro and Pfam have<br />
many important applications, including the automatic<br />
annotation of proteins for UniProtKB/TrEMBL<br />
and genome/metagenome annotation. HMMER, a<br />
sequence-similarity search tool for identifying distant<br />
relationships, enables users to infer function, annotate<br />
protein sequences and trace evolutionary histories.<br />
Rfam classifies non-coding RNA sequences into<br />
families, using probabilistic models that take account<br />
of both sequence and secondary-structure information,<br />
termed covariance models. Rfam is used to annotate<br />
non-coding RNAs in genome projects and is a major<br />
contributor to RNAcentral, an integrating database of<br />
non-coding RNA sequences that serves as a single entry<br />
point for searching and accessing data from over 20<br />
established RNA resources. RNAcentral gives<br />
users a unified view of all non-coding RNA types from<br />
all organisms.<br />
Metagenomics is the study of the sum of genetic<br />
material found in an environmental sample or host<br />
species, using next-generation sequencing (NGS)<br />
technology. The EBI Metagenomics service enables<br />
researchers to submit sequence data and descriptive<br />
metadata to public nucleotide archives. We help ensure<br />
the data is functionally and taxonomically analysed, and<br />
the results primed for visualisation and download.<br />
Major achievements<br />
InterPro, Pfam and HMMER<br />
In <strong>2015</strong> we issued six releases of InterPro. We integrated<br />
over 2000 new member database signatures, resulting<br />
in over 1800 new InterPro entries. Version 54.0, released<br />
in October, included an update to PANTHER 10.0,<br />
which involved a substantial rebuild of the database and<br />
enabled the addition of almost 60 000 new signatures<br />
and the removal of over 20 000. Although this rebuild<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong><br />
resulted in the loss of over 500 of the 3673 integrated<br />
InterPro entries, the curation team ensured the resource<br />
ultimately comprised over 4600 integrated PANTHER<br />
signatures by the end of the year.<br />
Despite the removal of redundant bacterial proteomes<br />
from UniProtKB (see UniProt Development), InterPro<br />
v55.0 coverage remained substantial (79.7% of<br />
UniProtKB proteins). InterPro also continued to be<br />
a major provider of GO terms, with the latest release<br />
assigning almost 110 million terms to around 35 million<br />
proteins in UniProt release 2016_01.<br />
InterPro’s member databases use diverse analysis<br />
algorithms, and our team collaborates with several<br />
groups to improve analysis throughput. Release 50.0 saw<br />
the incorporation of a new version of PIRSF, which uses<br />
the HMMER3 analysis algorithm. HMMER3 runs about<br />
1000 times faster than HMMER2.0, allowing InterPro<br />
to calculate the growing volumes of UniProtKB match<br />
data in a timely manner. InterPro release 51.0 debuted a<br />
sequence database pre-filtering heuristic to reduce the<br />
time it takes to calculate matches against the HAMAP<br />
database. This sped up our protein-match-generation<br />
process, and bolsters our capacity to handle future<br />
sequence data growth.<br />
We completed the migration of Pfam from the Wellcome<br />
Trust Sanger Institute to EMBL-EBI in 2014, and in<br />
<strong>2015</strong> focused on issuing more frequent releases of<br />
the databases. We changed the underlying database<br />
to UniProt reference proteomes, rather than the<br />
whole of UniProt. Using this subset, which provides<br />
excellent representation of the whole of UniProt, brings<br />
significant efficiencies. Training Pfam probabilistic<br />
models on this representative set of sequences<br />
demonstrated no reduction in sensitivity. Furthermore,<br />
it reduces curation overheads, speeds up production and<br />
reduces the amount of data to present via the website.<br />
Pfam still calculates matches for all UniProt, and these<br />
remain accessible for download.<br />
To streamline and improve our curation processes,<br />
we modified our quality-assurance measures to allow<br />
a limited overlap between Pfam entries. This new<br />
measure actually improves Pfam definitions, and helps<br />
speed up entry curation. In <strong>2015</strong>, these improvements<br />
enabled the production of Pfam releases 28.0 and 29.0<br />
We migrated the HMMER website from Janelia<br />
Research Campus to EMBL-EBI, where it provides