Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Protein Function Development<br />
The work of our team spans several major resources under the umbrella of<br />
UniProt, the comprehensive resource of protein sequences and functional<br />
annotation: the UniProt Knowledgebase, the UniProt Archive and the UniProt<br />
Reference Clusters. We develop software and services for protein information<br />
in the UniProt, Gene Ontology (GO) annotation and enzyme data resources at<br />
EMBL-EBI. We are also responsible for developing tools for UniProt and GO<br />
Annotation (GOA) curation, and for the study of novel, automatic methods for<br />
protein annotation.<br />
Major achievements<br />
The UniProt website facilitates the search,<br />
identification and analysis of gene products. In <strong>2015</strong> our<br />
team released new web interfaces and functionalities, all<br />
built in response to user feedback gathered in a number<br />
of user workshops, usability interviews/sessions,<br />
helpdesk reviews and surveys. We now offer better ways<br />
to customise search and present results. A new UniProt<br />
course in Train online allows users to browse, explore<br />
and analyse the profoundly rich, integrated collection of<br />
protein sequence data in this resource.<br />
Prior to the April <strong>2015</strong> release of UniProt, the UniProt<br />
Knowledgebase (UniProtKB) had doubled in size over<br />
the previous year to over 90 million entries, with a<br />
high level of redundancy. This was especially the case<br />
for bacterial species, where different genomes of the<br />
same bacterium have been sequenced and submitted<br />
independently (e.g. 4080 proteomes for Staphylococcus<br />
aureus, comprising 10.88 million entries). To deal with<br />
this redundancy, we developed a procedure to identify<br />
highly redundant proteomes within species groups. We<br />
implemented this procedure for bacterial species and<br />
the sequences corresponding to redundant proteomes<br />
(approximately 47 million entries) were moved from<br />
UniProtKB to the UniProt Archive (UniParc), where<br />
they are still available. This is the first concerted effort<br />
in a public protein database to deal firmly and effectively<br />
with redundancy in big data.<br />
We released a new version of the UniProt Java API that<br />
improved several issues, for example frequent library<br />
updates, retrieval speeds and server availability. With<br />
the new API, users can create their own UniProt service,<br />
query and retrieve sets of proteins of interest, for<br />
instance all records updated in the past few months, or<br />
belonging to a particular family or species.<br />
Our team worked with the UniProt user community<br />
as well as the NCBI RefSeq, Ensembl and<br />
Ensembl Genomes teams to provide a collection of<br />
non-redundant reference proteomes, and to maintain<br />
well-annotated organisms for biomedical and<br />
biotechnological research. New species released in<br />
<strong>2015</strong> include Theobroma cacao (cacao / cocoa), Brassica<br />
napus (rapeseed) and Papio anubis (olive baboon),<br />
among others.<br />
In collaboration with genomics resources Ensembl<br />
and COSMIC, we created data links between DNA<br />
sequences and the functional proteins they encode.<br />
Cross-references to specific genomic sequences<br />
are now provided for each protein isoform. We also<br />
began distributing variants with consequences at the<br />
protein level for human and other species, and released<br />
variants from external resources including the Exome<br />
Aggregation Consortium (ExAC) and the Exome<br />
Sequencing Project (ESP) in the protein context.<br />
We introduced new genome annotation track files<br />
in two formats, BED and bigBed, which allows users<br />
to map and visualise UniProtKB sequence feature<br />
annotations including domains, sites and posttranslational<br />
modifications as genome<br />
browser tracks. These can be<br />
visualised in Ensembl, the UCSC<br />
Genome Browser and NCBI<br />
Genome. This beta release of<br />
the UniProt genome annotation<br />
tracks resource contains sequence<br />
annotations only for human; other<br />
species will be added in future.<br />
We worked with the<br />
ProteomeXchange resources such<br />
as PeptideAtlas and MaxQB to<br />
provide experimental peptides<br />
from publicly available massspectrometry<br />
studies for<br />
UniProt proteins for several<br />
reference species.<br />
In <strong>2015</strong> our team extended<br />
the functionality of our<br />
automated annotation<br />
system, which assists<br />
in the curation of the<br />
103<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>