22.08.2016 Views

Annual Scientific Report 2015

EMBL_EBI_ASR_2015_DigitalEdition

EMBL_EBI_ASR_2015_DigitalEdition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Sequence Families<br />

The Sequence Families team is responsible for the InterPro, Pfam, RNAcentral<br />

and Rfam data resources, and coordinates the EBI Metagenomics project. We<br />

are also responsible for the HMMER web server, which performs extremely fast<br />

protein-homology searches.<br />

101<br />

InterPro integrates protein-family data from 11 major<br />

sources, and classifies hierarchically the different<br />

protein family, domain and functional site definitions<br />

to provide a unified view of diverse data. InterPro<br />

curators build on this, adding functional descriptions<br />

to the families and annotating entire datasets with<br />

structured Gene Ontology terms. The InterProScan<br />

tool allows users to identify InterPro entries on protein<br />

sequences. Pfam generates new protein family entries<br />

and has the largest sequence coverage of all the InterPro<br />

member databases. Both InterPro and Pfam have<br />

many important applications, including the automatic<br />

annotation of proteins for UniProtKB/TrEMBL<br />

and genome/metagenome annotation. HMMER, a<br />

sequence-similarity search tool for identifying distant<br />

relationships, enables users to infer function, annotate<br />

protein sequences and trace evolutionary histories.<br />

Rfam classifies non-coding RNA sequences into<br />

families, using probabilistic models that take account<br />

of both sequence and secondary-structure information,<br />

termed covariance models. Rfam is used to annotate<br />

non-coding RNAs in genome projects and is a major<br />

contributor to RNAcentral, an integrating database of<br />

non-coding RNA sequences that serves as a single entry<br />

point for searching and accessing data from over 20<br />

established RNA resources. RNAcentral gives<br />

users a unified view of all non-coding RNA types from<br />

all organisms.<br />

Metagenomics is the study of the sum of genetic<br />

material found in an environmental sample or host<br />

species, using next-generation sequencing (NGS)<br />

technology. The EBI Metagenomics service enables<br />

researchers to submit sequence data and descriptive<br />

metadata to public nucleotide archives. We help ensure<br />

the data is functionally and taxonomically analysed, and<br />

the results primed for visualisation and download.<br />

Major achievements<br />

InterPro, Pfam and HMMER<br />

In <strong>2015</strong> we issued six releases of InterPro. We integrated<br />

over 2000 new member database signatures, resulting<br />

in over 1800 new InterPro entries. Version 54.0, released<br />

in October, included an update to PANTHER 10.0,<br />

which involved a substantial rebuild of the database and<br />

enabled the addition of almost 60 000 new signatures<br />

and the removal of over 20 000. Although this rebuild<br />

<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong><br />

resulted in the loss of over 500 of the 3673 integrated<br />

InterPro entries, the curation team ensured the resource<br />

ultimately comprised over 4600 integrated PANTHER<br />

signatures by the end of the year.<br />

Despite the removal of redundant bacterial proteomes<br />

from UniProtKB (see UniProt Development), InterPro<br />

v55.0 coverage remained substantial (79.7% of<br />

UniProtKB proteins). InterPro also continued to be<br />

a major provider of GO terms, with the latest release<br />

assigning almost 110 million terms to around 35 million<br />

proteins in UniProt release 2016_01.<br />

InterPro’s member databases use diverse analysis<br />

algorithms, and our team collaborates with several<br />

groups to improve analysis throughput. Release 50.0 saw<br />

the incorporation of a new version of PIRSF, which uses<br />

the HMMER3 analysis algorithm. HMMER3 runs about<br />

1000 times faster than HMMER2.0, allowing InterPro<br />

to calculate the growing volumes of UniProtKB match<br />

data in a timely manner. InterPro release 51.0 debuted a<br />

sequence database pre-filtering heuristic to reduce the<br />

time it takes to calculate matches against the HAMAP<br />

database. This sped up our protein-match-generation<br />

process, and bolsters our capacity to handle future<br />

sequence data growth.<br />

We completed the migration of Pfam from the Wellcome<br />

Trust Sanger Institute to EMBL-EBI in 2014, and in<br />

<strong>2015</strong> focused on issuing more frequent releases of<br />

the databases. We changed the underlying database<br />

to UniProt reference proteomes, rather than the<br />

whole of UniProt. Using this subset, which provides<br />

excellent representation of the whole of UniProt, brings<br />

significant efficiencies. Training Pfam probabilistic<br />

models on this representative set of sequences<br />

demonstrated no reduction in sensitivity. Furthermore,<br />

it reduces curation overheads, speeds up production and<br />

reduces the amount of data to present via the website.<br />

Pfam still calculates matches for all UniProt, and these<br />

remain accessible for download.<br />

To streamline and improve our curation processes,<br />

we modified our quality-assurance measures to allow<br />

a limited overlap between Pfam entries. This new<br />

measure actually improves Pfam definitions, and helps<br />

speed up entry curation. In <strong>2015</strong>, these improvements<br />

enabled the production of Pfam releases 28.0 and 29.0<br />

We migrated the HMMER website from Janelia<br />

Research Campus to EMBL-EBI, where it provides

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!