InterProOur team co-ordinates the InterPro and Metagenomics projects at<strong>EMBL</strong>-<strong>EBI</strong>. InterPro integrates protein data from 11 major sources,classifying them into families and predicting the presence of domains andfunctionally important sites.InterPro has a number of important applications, including the automatic annotation of proteins for UniProtKB/Tr<strong>EMBL</strong> andgenome annotation projects. InterPro is used by Ensembl and in the GOA project to provide large-scale mapping of proteins toGO terms.Metagenomics is the study of the sum of genetic material found in an environmental sample or host species, typically usingnext-generation sequencing (NGS) technology. The Metagenomics Portal, a resource established at <strong>EMBL</strong>-<strong>EBI</strong> in 2011,enables metagenomics researchers to submit sequence data and associated descriptive metadata to the public nucleotidearchives. Deposited data is subsequently functionally analysed using an InterPro-based pipeline, and the results generated arevisualised via a web interface.Major achievementsWe redesigned and re-launched the InterPro website inlate <strong>2012</strong>, and played a key role in the <strong>EMBL</strong>-<strong>EBI</strong> websiteredesign process. We also built a new InterPro search facilitythat utilises the central <strong>EBI</strong> search engine. Search results arenow much easier to interpret and browse: the engine behavesin a Google-like manner, allowing users to enter wildcards(e.g., * and ?), use logic (AND or NOT), search with singlewords or phrases and quickly select subsets of the resultsusing faceted filtering. InterPro results are now paginated andhighlight the context of the query terms.The new <strong>EMBL</strong>-<strong>EBI</strong> website, which will launch in early 2013,features improved discoverability of InterPro and otherresources. Global <strong>EBI</strong> search results are shown in categorieson local search pages to encourage users to explore the datain different ways.In <strong>2012</strong> we moved the InterPro DAS and BioMart services tothe London Data Centres; the main InterPro website will jointhem there shortly.The InterPro database continues to benefit from improvedcoverage of UniProtKB proteins, increasing to 80.8% in thelatest release (v. 40.0). This is partly due to significant datacuration and integration efforts, which led to an additional2355 signatures being incorporated into the databasein <strong>2012</strong>.Focussed curation of InterPro2GO term associations ledto 334 additional entries being assigned GO terms; 44%of entries now have at least one term associated. The totalnumber of GO mappings has increased by 838, despite aconcerted effort to remove terms that are too general (andtherefore uninformative) or erroneously mapped. In <strong>2012</strong> wepublished the first paper describing how this highly utilisedannotation resource is created and maintained.InterProScan5 is poised to take over as the main InterProscanning software in 2013. Multiple release candidates weremade publicly available in <strong>2012</strong>, each containing new featuresand improved implementation.InterPro Scan 5:release candidate 4 features• Search all 11 member databases, plus four additionalalgorithms: Phobius, TMHMM, Coils and SignalPv4;• Predict potential membership of a protein in a pathwaybased on InterPro results;• Use a BerkeleyDB-based protein match look-up servicethat reduces calculation overheads by only searchingsequences not already found in UniProtKB (install thislocally or query the <strong>EBI</strong>-hosted service);34 <strong>2012</strong> <strong>EMBL</strong>-<strong>EBI</strong> <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>
Sarah HunterMSc University of Manchester, 1998.Pharmaceutical and Biotech Industry (Sweden),1999–2005.At <strong>EMBL</strong>-<strong>EBI</strong> since 2005. Team Leader since 2007.• Use multiple output formats: HTML, GFF3, XML, TSV andSVG;• Run it ‘out of the box’ on any Linux machine with minimalconfiguration, and utilise cluster-queuing technologies;• Handle both protein and nucleotide sequences, withresults mapped back to the original sequence.<strong>EBI</strong> Metagenomics reached 20 public metagenomics projectsin <strong>2012</strong>, comprising 131 separate samples and a significantnumber of privately held studies. In collaboration with theEuropean Nucleotide Archive, we developed a system forthe submission of sequence files and minimum-standardscompliantmetadata. We expanded the initial analysis pipelinefrom quality control, clustering, CDS prediction and functionalclassification steps to include an rRNA prediction step (usingrRNAselector) and taxonomic diversity estimation, usingthe Qiime software. We are investigating Taverna for thestructuring and managing the complex workflows used in theanalysis pipeline (see Figure) and in <strong>2012</strong> developed a utility tointegrate Taverna processes with the LSF queue system.Our work on the organisation and display of data on thewebsite has made it easier for users to access analysisresults. In addition, we developed a metagenomics ‘GO slim’(a subset of GO terms particularly useful to metagenomics) toassist users in their interpretation of function prediction results.The data can be downloaded in a variety of formats, andwe have made it possible to download sequences that arefunctionally classified by the resource or remain of unknownfunction.Future plansTo facilitate the move of the InterPro website to the LondonData Centres in early 2013, we have re-written the InterProrelational database into a data warehouse structure. Thissimplifies the web application code written to access the data,and greatly reduces the amount of down-time experiencedby our curation team during release. Together with the officialrelease of InterProScan5, we expect these developments tosimplify our data-production processes. InterProScan5 will beused by the <strong>EBI</strong>-hosted installation, completing the five-yeareffort to re-architecture the InterPro resource.We are designing and testing new <strong>EBI</strong> Metagenomicswebpages that will help users visualise taxonomic predictiondata from a variety of experiment types (i.e., shotgunFigure. The analysis workflow for a shotgun metagenomicsexperiment, as processed by <strong>EBI</strong> Metagenomics.metagenomics, amplicon-based marker gene analysis,metatranscriptomics). We believe these changes, to beimplemented in 2013, will provide a more complete suite ofanalysis tools, bringing us in line with competing resources.We will transition our pipeline fully into the Taverna software,simplifying maintenance and offering multiple workflows,depending on the environment that has been sequenced.Finally, we will encourage data submission to the repository toincrease the coverage of the experiments carried out by themetagenomics community.Selected publicationsBurge, S., et al. (<strong>2012</strong>) Manual GO annotation of predictiveprotein signatures: the InterPro approach to GO curation.Database (Oxford) <strong>2012</strong>, bar068.Lewis, T.E., et al. (<strong>2012</strong>) Genome3D: a UK collaborativeproject to annotate genomic sequences with predicted 3Dstructures based on SCOP and CATH domains. Nucleic AcidsRes 41 (D1), D499-507.Salazar, G.A., et al. (<strong>2012</strong>) MyDas, an Extensible Java DASServer. PLoS One 7, e44180.Hunter, C., et al. (<strong>2012</strong>) Metagenomic analysis: the challengeof the data bonanza. Brief Bioinform 13, 743-746.<strong>2012</strong> <strong>EMBL</strong>-<strong>EBI</strong> <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>35
- Page 1 and 2: EMBL-European Bioinformatics Instit
- Page 3: Table of contentsIntroduction & ove
- Page 6 and 7: EMBL-EBI 2012It was a year of trans
- Page 8 and 9: New service developments• Underst
- Page 10 and 11: Organisation ofEMBL-EBI Leadership
- Page 12 and 13: dGenes, genomes and variationThe Eu
- Page 14 and 15: dGenes, genomes and variationSummar
- Page 16 and 17: European Nucleotide ArchiveOur team
- Page 18 and 19: Vertebrate genomicsThe Vertebrate G
- Page 20 and 21: Nonvertebrate genomicsWe provide to
- Page 22 and 23: gMolecular atlasLife scientists are
- Page 24 and 25: Functional genomicsThe Functional G
- Page 26 and 27: Functional genomics productionOur t
- Page 28 and 29: Functional genomics developmentOur
- Page 30 and 31: PProteins and protein familiesUniPr
- Page 32 and 33: UniProt contentOne of the central a
- Page 34 and 35: UniProt developmentOur team provide
- Page 38 and 39: sMolecular and cellular structureUn
- Page 40 and 41: Protein Data Bank in EuropeThe majo
- Page 42 and 43: PDBe content and integrationOur goa
- Page 44 and 45: PDBe databases and servicesOur team
- Page 46 and 47: yMolecular systemsThe genes and gen
- Page 48 and 49: Proteomics servicesThe Proteomics S
- Page 50 and 51: Chemical biologyThe importance of s
- Page 52 and 53: ChEMBLThe ChEMBL team develops and
- Page 54 and 55: Cheminformatics and metabolismOur t
- Page 56 and 57: cCross-domain toolsand resourcesSci
- Page 58 and 59: Literature servicesScientific liter
- Page 60 and 61: Research2012 has seen the further t
- Page 62 and 63: Bertone groupPluripotency, reprogra
- Page 64 and 65: Birney groupNucleotide dataDNA sequ
- Page 66 and 67: Enright groupFunctional genomics an
- Page 68 and 69: Goldman groupEvolutionary tools for
- Page 70 and 71: Le Novère groupComputational syste
- Page 72 and 73: Luscombe groupGenomics and regulato
- Page 74 and 75: Marioni groupComputational and evol
- Page 76 and 77: Rebholz groupPhenotypes and multili
- Page 78 and 79: Saez-Rodriguez groupSystems biomedi
- Page 80 and 81: Thornton groupProteins: structure,
- Page 82 and 83: The EMBL International PhDProgramme
- Page 84 and 85: SupportOur support teams provide fo
- Page 86 and 87:
T TrainingAs part of EMBL-EBI’s m
- Page 88 and 89:
IIndustry programmeSince 1996 the I
- Page 90 and 91:
NExternal relationsAs a European In
- Page 92 and 93:
sExternal servicesOur team manages
- Page 94 and 95:
SSystems and networkingOur team man
- Page 96 and 97:
q AdministrationThe EMBL-EBI Admini
- Page 98 and 99:
Funding and resource allocationDesp
- Page 100 and 101:
Growth of core resourcesIn 2012 the
- Page 102 and 103:
CollaborationsEMBL-EBI is a highly
- Page 104 and 105:
Staff growthOur organisational stru
- Page 106 and 107:
Scientific advisory commiteesEMBL S
- Page 108 and 109:
The International Nucleotide Sequen
- Page 110 and 111:
EMDataBank Advisory Committee• Jo
- Page 112 and 113:
Major database collaborationsARRAYE
- Page 114 and 115:
THE GENE ONTOLOGY CONSORTIUM• Agb
- Page 116 and 117:
REACTOME• New York University Med
- Page 118 and 119:
Publications in 2012In 2012, EMBL-E
- Page 120 and 121:
Doreleijers, J. F., Vranken W. F.,
- Page 122 and 123:
Kruger, F. A., Rostom R. and Overin
- Page 124 and 125:
Sahakyan, Aleksandr B., Cavalli And
- Page 128:
EMBL - European Bioinformatics Inst