Annual Scientific Report 2015

Recommendations

Info

Protein Sequence Resources EMBL-EBI provides foundational resources for researchers who work with protein sequences and protein families, including the UniProt, InterPro and Pfam data services, and the HMMER homology search tool, among others. We develop and curate the UniProt, the universal protein resource, in collaboration with the SIB Swiss Institute of Bioinformatics and Protein Information Resource (PIR). Building on the UniProt Knowledgebase, we provide further resources for exploring and comparing protein families, domains and motifs. HMMER, a fast, sensitive search tool, helps biologists find sequence relationships deep in evolutionary time. In 2015 we made HMMER algorithms available through a dedicated, open-source website at EMBL-EBI, providing an advanced tool to help researchers infer the function of a protein and its evolutionary history. In 2015 we re-launched the Enzyme portal, which integrates all information about enzymes from EMBL-EBI resources. Built following user-centred design methodology, the service makes it easier to navigate comprehensive summaries, enzyme comparison, sequence search and search entry points to enzymes by disease, pathway, taxonomy and EC. Reducing redundancy is important to ensure efficiency and quality are maximised, but is a major challenge in data management. In 2015 we implemented a new method for identifying highly redundant proteome datasets, and removed them from UniProtKB. The result is a more streamlined, efficient resource. We also began distributing new data types: variants with consequences at the protein level. We incorporated variants in the protein context from the Exome Aggregation Consortium (ExAC, hosted by the Broad Institute) and the Exome Sequencing Project (ESP, hosted by the University of Washington). Working with the PeptideAtlas at the Institute for Systems Biology in the US and MaxQB in Gemany, we released peptide data from MS experiments, mapped to UniProt proteins. Visualisation was a focus area in 2015. Users can now map and visualise UniProtKB sequence feature annotations including domains, sites and posttranslational modifications. This feature viewer was released in beta in 2015, and made public in early 2016. We also implemented a PSIQUIC server for visualising protein–protein interaction annotations using the open-source Cytoscape software. In 2015 we helped establish minimum standards for genome annotation, which will make it easier for diverse communities to work with public genome and proteome datasets. UniProt UniProt provides a single, centralised, authoritative resource for protein sequences and functional annotation. The UniProt Consortium supports biological research by maintaining a freely accessible, high-quality database that serves as a stable, comprehensive, fully classified, richly, accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces. Our Protein Function teams contribute to several resources, each optimised for a different purpose: UniProt Knowledgebase (UniProtKB): the central database of protein sequences, providing accurate, consistent, rich annotations about sequence and function; UniProt Archive (UniParc): a stable, comprehensive, non-redundant collection representing the complete body of publicly available protein sequence data; UniProt Reference Clusters (UniRef ): non-redundant data collections that draw on UniProtKB and UniParc to provide complete coverage of the ‘sequence space’ at multiple resolutions. www.uniprot.org InterPro InterPro is used to classify proteins into families and predict the presence of domains and functionally important sites. The project integrates signatures from 11 major protein signature databases. InterPro rationalises instances where more than one protein signature describes the same protein family or domain, uniting these into a single InterPro entry and noting relationships between them. It adds biological annotation and links to external databases such as GO, PDB, SCOP and CATH. InterPro pre-computes all matches of its signatures to UniParc proteins using the InterProScan software, making the data available in a variety of machine-readable formats and via web-based interfaces. This data is updated and incorporated into each UniProtKB release. InterPro applications include the automatic annotation of proteins for UniProtKB/TrEMBL and genome annotation projects, and large-scale mapping of proteins to GO terms for Ensembl and the GOA project. It forms a core component of the EBI Metagenomics analysis pipeline. www.ebi.ac.uk/interpro 18 2015 EMBL-EBI Annual Scientific Report
Pfam Pfam is a database of protein sequence families. Each Pfam family is represented by a statistical model (a profile-hidden Markov model), trained using a curated alignment of representative sequences. These models can be searched against all protein sequences to find occurrences of Pfam families, thereby aiding the identification of evolutionarily related sequences. As homologous proteins are more likely to share structural and functional features, Pfam families can aid in the annotation of uncharacterised sequences and guide experimental work. http://pfam.xfam.org HMMER HMMER is a sequence-analysis package that can be used with both protein and nucleotide sequences. At the core of the software is an algorithm that enables the searching of one or more probabilistic models (profile hidden Markov models, HMMs) against either a single sequence or a database of sequences. The HMMER website has implemented this software as a set of fast web services, with both a programmatic interface and graphical user interfaces. Profile HMMs are incredibly powerful, allowing users to detect distant evolutionary relationships. www.ebi.ac.uk/Tools/Hmmer Protein Function Development Maria Martin • Re-launched the Enzyme portal and developed new interfaces and tools for UniProt and QuickGO, with a focus on optimising user interaction with these websites; • Implemented a method for identification of highly redundant proteomes and removal from UniProtKB; • Extended the provision of variants with consequences at the protein level, incorporated variation data from ExAC and the Exome Sequencing Project (ESP); • Released experimental peptides mapped to UniProt proteins from mass-spectrometry studies in collaboration with PeptideAtlas and MaxQB; • Extended the scope of the annotation tool Protein2GO and the GO browser QuickGO, and implemented a PSIQUIC server for protein-protein interaction annotations visualisation in Cytoscape; • Implemented the automatic annotation of domains, signal peptides, transmembrane and coil-coil regions for millions of protein sequences in UniProtKB/TrEMBL. Protein Function Content Claire O’Donovan • In the context of the Consensus Coding Sequence (CCDS) project, ensured the curated, complete synchronisation with the HGNC, which has assigned unique gene symbols and names to 39 000 human loci (19 001 of which are listed as coding for proteins); • Helped establish minimum standards for genome annotation to enable scientists to exploit complete genome and proteome datasets to their full potential; • Improved UniProt Automatic Annotation by significantly increasing the number of UniRules, with an emphasis on enzymes across the taxonomic space; • Secured funding to continue our contribution to the validation of the computational approaches submitted to the Critical Assessment of Function Annotation experiment. Sequence Families Rob Finn • Refactored Pfam to utilise UniProt reference proteomes as the underlying sequence database, streamlining curation and production processes while minimising impact on sensitivity; • Optimised Pfam quality control to allow minor overlaps between Pfam entries to allow better modeling of protein families; • Streamlined production and delivered monthly updates of InterPro data to UniProt for their automatic annotation procedures; • Integrated a net gain of over 2000 new member database signatures within InterPro, resulting in over 1800 new entries; • Provided GO terms to UniProt, with the latest release assigning ~110 million terms to approximately 35 million proteins in UniProt release 2016_01; • Migrated the HMMER web services from Janelia Research Campus; • Expanded HMMER services to include PIRSF HMM searches and support for UniProt reference proteomes, now the default sequence database; • Issued two releases of Pfam and six releases of InterPro. 2015 EMBL-EBI Annual Scientific Report 19
Page 1 and 2: The European Bioinformatics Institu
Page 3 and 4: SERVICE TEAMS TRAINING PROGRAMME RE
Page 5 and 6: Foreword We are pleased to present
Page 7 and 8: awareness amongst some of our stron
Page 9 and 10: Chemical biology The 17 million nov
Page 11 and 12: The most extensive catalogue of str
Page 13 and 14: “ EMBL -EBI services are the back
Page 15 and 16: European Nucleotide Archive The ENA
Page 17 and 18: Vertebrate Genomics Paul Flicek Bro
Page 19: Functional Genomics Alvis Brazma
Page 23 and 24: Protein Data Bank in Europe Gerard
Page 25 and 26: MetaboLights MetaboLights is a data
Page 27 and 28: Proteomics Services and Molecular I
Page 29 and 30: BioSamples The BioSamples database
Page 31 and 32: “ EMBL -EBI is a critical mass of
Page 33 and 34: EMBL International PhD Programme at
Page 35 and 36: “ It would be a considerable loss
Page 37 and 38: The Birney group used methods devel
Page 39 and 40: Marioni group • Improved and exte
Page 41 and 42: “ Because I work for a micro biot
Page 43 and 44: Industry workshops • In silico AD
Page 45 and 46: The work of our institute relies on
Page 47 and 48: Web production Rodrigo Lopez System
Page 49 and 50: 2015 EMBL-EBI Annual Scientific Rep
Page 51 and 52: Capital investment Support from the
Page 53 and 54: In 2015 our core data resources con
Page 55 and 56: Joint publications Most of our 299
Page 57 and 58: One from Many: Perspectives on a Mu
Page 61 and 62: European Nucleotide Archive • Mar
Page 63 and 64: Technical Services Cluster Scientif
Page 65 and 66: Expression Atlas • Oregon State U
Page 67 and 68: Photo: Uma Maheswari 2015 EMBL-EBI
Page 71 and 72:
037. Chiapparino A, Maeda K, Turei
Page 73 and 74:
115. Jakubec D, Hostas J, Laskowski
Page 75 and 76:
192. Perez-Riverol Y, Xu QW, Wang R
Page 77 and 78:
269. van den Berg BA, Reinders MJ,
Page 79 and 80:
Director Ewan Birney Admininstratio
Page 81 and 82:
2015 EMBL-EBI Annual Scientific Rep
Page 83 and 84:
Guy Cochrane European Nucleotide Ar
Page 85 and 86:
Vertebrate Genomics Research The mo
Page 87 and 88:
Daniel Zerbino Ensembl Genome Analy
Page 89 and 90:
Future plans We will continue to de
Page 91 and 92:
Andy Yates Genome Technology and In
Page 93 and 94:
Paul Kersey Non-vertebrate Genomics
Page 95 and 96:
Justin Paschall Variation Archive M
Page 97 and 98:
Alvis Brazma Functional Genomics Ph
Page 99 and 100:
Ugis Sarkans Functional Genomics De
Page 101 and 102:
Robert Petryszak Gene Expression MP
Page 103 and 104:
Rob Finn Sequence Families PhD in B
Page 105 and 106:
Maria-Jesus Martin Protein Function
Page 107 and 108:
Claire O’Donovan Protein Function
Page 109 and 110:
(such as the on-going EMDataBank Ma
Page 111 and 112:
Sameer Velankar PDBe Content and In
Page 113 and 114:
containing the mapping between comp
Page 115 and 116:
of 14 leading European labs in Meta
Page 117 and 118:
Henning Hermjakob Proteomic service
Page 119 and 120:
coimmunoprecipitation coimmunopreci
Page 121 and 122:
development of Europe PMC as a plat
Page 123 and 124:
Mouse informatics In 2015 we contin
Page 125 and 126:
Page 127 and 128:
Train online, EMBL-EBI’s web-base
Page 129 and 130:
Nils Koelling Quantitative genetics
Page 131 and 132:
Page 133 and 134:
Pedro Beltrao PhD in Biology, Unive
Page 135 and 136:
Ewan Birney PhD 2000, Wellcome Trus
Page 137 and 138:
Anton Enright PhD in Computational
Page 139 and 140:
Nick Goldman PhD University of Camb
Page 141 and 142:
John Marioni PhD in Applied Mathema
Page 143 and 144:
Julio-Saez Rodriguez PhD University
Page 145 and 146:
Oliver Stegle PhD in Physics, Unive
Page 147 and 148:
Future plans The Teichmann group wi
Page 149 and 150:
findings regarding association were
Page 151 and 152:
Page 153 and 154:
Future plans The Industry Programme
Page 155 and 156:
Page 157 and 158:
Reporting on usage We further devel
Page 159 and 160:
to find the support they need. The
Page 161 and 162:
Petteri Jokinen Systems & Networkin
Page 163 and 164:
Standby Facility and Database Disas
Page 165 and 166:
External Relations leads on brand a
Page 167 and 168:
Mark Green EMBL-EBI Administration
show all

Annual Scientific Report 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?