PProteins and protein familiesUniProt, the unified resource of protein sequence and functionalinformation, is closely integrated with Ensembl and Ensembl Genomesand in <strong>2012</strong> generated new reference proteome sets to match theirgenes in the reference genomes.As the genome data collections broaden their taxonomic range, UniProt continues to refine its automatic annotation pipelines,as well as improving its tools for manual annotation.User-experience design has been a recurring theme for our protein resources: the UniProt Development and InterPro teamsspent significant time in <strong>2012</strong> creating user-experience-driven interfaces.After 15 years of overseeing Protein sequence resources at <strong>EMBL</strong>-<strong>EBI</strong>, Rolf Apweiler welcomed Alex Bateman as hissuccessor in late <strong>2012</strong>. Alex and his team bring with them a portfolio of important resources, including Pfam, which will resideunder the <strong>EMBL</strong>-<strong>EBI</strong> umbrella in 2013.UniProtUniProt is a collaboration among <strong>EMBL</strong>-<strong>EBI</strong>, the SwissInstitute of Bioinformatics (SIB) and the Protein InformationResource (PIR) group in the US. Its purpose is to provide thescientific community with a single, centralised, authoritativeresource for protein sequences and functional annotation.The consortium supports biological research by maintaininga freely accessible, high-quality database that serves as astable, comprehensive, fully classified, richly and accuratelyannotated protein sequence knowledgebase, with extensivecrossreferences and querying interfaces.The work of our team spans several major resources underthe umbrella of UniProt, each of which is optimised for adifferent purpose:• The UniProt Knowledgebase (UniProtKB) is the centraldatabase of protein sequences and provides accurate,consistent and rich annotation about sequenceand function.• The UniProt Metagenomic and Environmental Sequences(UniMES) database serves researchers who are exploringthe rapidly expanding area of metagenomics, whichencompasses both health and environmental data.InterProInterPro is used to classify proteins into families and predictthe presence of domains and functionally important sites. Theproject integrates signatures from 11 major protein signaturedatabases: Pfam, PRINTS, PROSITE, ProDom, SMART,TIGRFAMs, PIRSF, SUPERFAMILY, CATH-Gene3D, PANTHERand HAMAP. During the integration process, InterProrationalises instances where more than one protein signaturedescribes the same protein family or domain, uniting theseinto single InterPro entries and noting relationships betweenthem where applicable.InterPro adds biological annotation and links to externaldatabases such as GO, PDB, SCOP and CATH. Itprecomputes all matches of its signatures to UniProtArchive (UniParc) proteins using the InterProScan software,and displays the matches to the UniProt KnowledgeBase(UniProtKB) in various formats, including XML files andweb-based graphical interfaces.InterPro has a number of important applications, including theautomatic annotation of proteins for UniProtKB/Tr<strong>EMBL</strong> andgenome annotation projects. InterPro is used by Ensembl andin the GOA project to provide large-scale mapping of proteinsto GO terms.• The UniProt Archive (UniParc) is a stable, comprehensive,non-redundant collection representing the complete bodyof publicly available protein sequence data.• UniProt Reference Clusters (UniRef) are non-redundantdata collections that draw on UniProtKB and UniParc toprovide complete coverage of the ‘sequence space’ atmultiple resolutions.28 <strong>2012</strong> <strong>EMBL</strong>-<strong>EBI</strong> <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>
Rolf ApweilerPhD 1994, University of Heidelberg. At <strong>EMBL</strong>since 1987. At <strong>EMBL</strong>-<strong>EBI</strong> since 1994.Joint Associate Director since <strong>2012</strong>.Summary of progress <strong>2012</strong>Sarah HunterInterProClaire O’DonovanUniProt contentMaria-Jesus MartinUniProt development• Issued five major releases of theInterPro database: created 1756new entries and integrated 2355signatures;• Released a new version ofInterProScan;• Redesigned and re-launched theInterPro website;• Incorporated a new InterProsearch facility based on thecentral <strong>EBI</strong> search engine.• Continued to manually annotateUniProtKB, with a particularfocus on the human and otherreference proteomes;• Collaborated closely withother resources worldwide toensure comprehensiveness,avoiding duplication of effort andachieving mutually beneficialexchange of data;• Substantially progressedautomatic annotation efforts,achieving a widening of thetaxonomic and annotation depthas well as establishing morecollaborations with externalinformatic- and laboratoryorientedgroups;• Increased manual and electronicGO annotation efforts: as ofNovember <strong>2012</strong> there were127 million GO annotations for18.9 million UniProtKB entries,covering more than 370 000taxonomic groups.• Analysed different interfacedesigns for accessing UniProtdata, focusing on user interactionwith the website;• Integrated new species asReference proteomes incollaboration with Ensembl andEnsembl Genomes to achieveconsensus sequence annotation;• Improved annotation tools(UniRule, Gene Ontology,proteome editors) to supportcuration of these resources;• In collaboration with Ensembl,RefSeq and PRIDE, extendedthe data-import infrastructureto incorporate variation andproteomics data;• Consolidated software andextended the databases toaccommodate a rapidly growingvolume of data.<strong>2012</strong> <strong>EMBL</strong>-<strong>EBI</strong> <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>29
- Page 1 and 2: EMBL-European Bioinformatics Instit
- Page 3: Table of contentsIntroduction & ove
- Page 6 and 7: EMBL-EBI 2012It was a year of trans
- Page 8 and 9: New service developments• Underst
- Page 10 and 11: Organisation ofEMBL-EBI Leadership
- Page 12 and 13: dGenes, genomes and variationThe Eu
- Page 14 and 15: dGenes, genomes and variationSummar
- Page 16 and 17: European Nucleotide ArchiveOur team
- Page 18 and 19: Vertebrate genomicsThe Vertebrate G
- Page 20 and 21: Nonvertebrate genomicsWe provide to
- Page 22 and 23: gMolecular atlasLife scientists are
- Page 24 and 25: Functional genomicsThe Functional G
- Page 26 and 27: Functional genomics productionOur t
- Page 28 and 29: Functional genomics developmentOur
- Page 32 and 33: UniProt contentOne of the central a
- Page 34 and 35: UniProt developmentOur team provide
- Page 36 and 37: InterProOur team co-ordinates the I
- Page 38 and 39: sMolecular and cellular structureUn
- Page 40 and 41: Protein Data Bank in EuropeThe majo
- Page 42 and 43: PDBe content and integrationOur goa
- Page 44 and 45: PDBe databases and servicesOur team
- Page 46 and 47: yMolecular systemsThe genes and gen
- Page 48 and 49: Proteomics servicesThe Proteomics S
- Page 50 and 51: Chemical biologyThe importance of s
- Page 52 and 53: ChEMBLThe ChEMBL team develops and
- Page 54 and 55: Cheminformatics and metabolismOur t
- Page 56 and 57: cCross-domain toolsand resourcesSci
- Page 58 and 59: Literature servicesScientific liter
- Page 60 and 61: Research2012 has seen the further t
- Page 62 and 63: Bertone groupPluripotency, reprogra
- Page 64 and 65: Birney groupNucleotide dataDNA sequ
- Page 66 and 67: Enright groupFunctional genomics an
- Page 68 and 69: Goldman groupEvolutionary tools for
- Page 70 and 71: Le Novère groupComputational syste
- Page 72 and 73: Luscombe groupGenomics and regulato
- Page 74 and 75: Marioni groupComputational and evol
- Page 76 and 77: Rebholz groupPhenotypes and multili
- Page 78 and 79: Saez-Rodriguez groupSystems biomedi
- Page 80 and 81:
Thornton groupProteins: structure,
- Page 82 and 83:
The EMBL International PhDProgramme
- Page 84 and 85:
SupportOur support teams provide fo
- Page 86 and 87:
T TrainingAs part of EMBL-EBI’s m
- Page 88 and 89:
IIndustry programmeSince 1996 the I
- Page 90 and 91:
NExternal relationsAs a European In
- Page 92 and 93:
sExternal servicesOur team manages
- Page 94 and 95:
SSystems and networkingOur team man
- Page 96 and 97:
q AdministrationThe EMBL-EBI Admini
- Page 98 and 99:
Funding and resource allocationDesp
- Page 100 and 101:
Growth of core resourcesIn 2012 the
- Page 102 and 103:
CollaborationsEMBL-EBI is a highly
- Page 104 and 105:
Staff growthOur organisational stru
- Page 106 and 107:
Scientific advisory commiteesEMBL S
- Page 108 and 109:
The International Nucleotide Sequen
- Page 110 and 111:
EMDataBank Advisory Committee• Jo
- Page 112 and 113:
Major database collaborationsARRAYE
- Page 114 and 115:
THE GENE ONTOLOGY CONSORTIUM• Agb
- Page 116 and 117:
REACTOME• New York University Med
- Page 118 and 119:
Publications in 2012In 2012, EMBL-E
- Page 120 and 121:
Doreleijers, J. F., Vranken W. F.,
- Page 122 and 123:
Kruger, F. A., Rostom R. and Overin
- Page 124 and 125:
Sahakyan, Aleksandr B., Cavalli And
- Page 128:
EMBL - European Bioinformatics Inst