22.08.2016 Views

Annual Scientific Report 2015

EMBL_EBI_ASR_2015_DigitalEdition

EMBL_EBI_ASR_2015_DigitalEdition

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Protein Function Development<br />

The work of our team spans several major resources under the umbrella of<br />

UniProt, the comprehensive resource of protein sequences and functional<br />

annotation: the UniProt Knowledgebase, the UniProt Archive and the UniProt<br />

Reference Clusters. We develop software and services for protein information<br />

in the UniProt, Gene Ontology (GO) annotation and enzyme data resources at<br />

EMBL-EBI. We are also responsible for developing tools for UniProt and GO<br />

Annotation (GOA) curation, and for the study of novel, automatic methods for<br />

protein annotation.<br />

Major achievements<br />

The UniProt website facilitates the search,<br />

identification and analysis of gene products. In <strong>2015</strong> our<br />

team released new web interfaces and functionalities, all<br />

built in response to user feedback gathered in a number<br />

of user workshops, usability interviews/sessions,<br />

helpdesk reviews and surveys. We now offer better ways<br />

to customise search and present results. A new UniProt<br />

course in Train online allows users to browse, explore<br />

and analyse the profoundly rich, integrated collection of<br />

protein sequence data in this resource.<br />

Prior to the April <strong>2015</strong> release of UniProt, the UniProt<br />

Knowledgebase (UniProtKB) had doubled in size over<br />

the previous year to over 90 million entries, with a<br />

high level of redundancy. This was especially the case<br />

for bacterial species, where different genomes of the<br />

same bacterium have been sequenced and submitted<br />

independently (e.g. 4080 proteomes for Staphylococcus<br />

aureus, comprising 10.88 million entries). To deal with<br />

this redundancy, we developed a procedure to identify<br />

highly redundant proteomes within species groups. We<br />

implemented this procedure for bacterial species and<br />

the sequences corresponding to redundant proteomes<br />

(approximately 47 million entries) were moved from<br />

UniProtKB to the UniProt Archive (UniParc), where<br />

they are still available. This is the first concerted effort<br />

in a public protein database to deal firmly and effectively<br />

with redundancy in big data.<br />

We released a new version of the UniProt Java API that<br />

improved several issues, for example frequent library<br />

updates, retrieval speeds and server availability. With<br />

the new API, users can create their own UniProt service,<br />

query and retrieve sets of proteins of interest, for<br />

instance all records updated in the past few months, or<br />

belonging to a particular family or species.<br />

Our team worked with the UniProt user community<br />

as well as the NCBI RefSeq, Ensembl and<br />

Ensembl Genomes teams to provide a collection of<br />

non-redundant reference proteomes, and to maintain<br />

well-annotated organisms for biomedical and<br />

biotechnological research. New species released in<br />

<strong>2015</strong> include Theobroma cacao (cacao / cocoa), Brassica<br />

napus (rapeseed) and Papio anubis (olive baboon),<br />

among others.<br />

In collaboration with genomics resources Ensembl<br />

and COSMIC, we created data links between DNA<br />

sequences and the functional proteins they encode.<br />

Cross-references to specific genomic sequences<br />

are now provided for each protein isoform. We also<br />

began distributing variants with consequences at the<br />

protein level for human and other species, and released<br />

variants from external resources including the Exome<br />

Aggregation Consortium (ExAC) and the Exome<br />

Sequencing Project (ESP) in the protein context.<br />

We introduced new genome annotation track files<br />

in two formats, BED and bigBed, which allows users<br />

to map and visualise UniProtKB sequence feature<br />

annotations including domains, sites and posttranslational<br />

modifications as genome<br />

browser tracks. These can be<br />

visualised in Ensembl, the UCSC<br />

Genome Browser and NCBI<br />

Genome. This beta release of<br />

the UniProt genome annotation<br />

tracks resource contains sequence<br />

annotations only for human; other<br />

species will be added in future.<br />

We worked with the<br />

ProteomeXchange resources such<br />

as PeptideAtlas and MaxQB to<br />

provide experimental peptides<br />

from publicly available massspectrometry<br />

studies for<br />

UniProt proteins for several<br />

reference species.<br />

In <strong>2015</strong> our team extended<br />

the functionality of our<br />

automated annotation<br />

system, which assists<br />

in the curation of the<br />

103<br />

<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!