Annual Scientific Report 2015

Recommendations

Info

Goldman Group Evolutionary Tools for Genomic Analysis The diversity of all life has been shaped by its evolutionary history. Our research focuses on the processes of molecular sequence evolution, developing data analysis methods that allow us to exploit this information and glean powerful insights into genomic function, evolutionary processes and phylogenetic history. To understand the evolutionary relationships between all organisms, it is necessary to analyse molecular sequences with consideration of their underlying structure. This is usually represented by an evolutionary tree indicating the branching relationships of organisms as they diverge from their common ancestors, and showing degrees of genetic difference between them. We develop mathematical, statistical and computational techniques to reveal information from genome data, draw inferences about the processes that gave rise to these interrelationships and make predictions about the biology of the systems whose components are encoded in those genomes. We develop new evolutionary models and methods, sharing them via stand-alone software and web services, and apply new techniques to interesting biological questions. We participate in comparative genomic studies, both independently and in collaboration with others. Our evolutionary studies involve the analysis of next-generation sequencing (NGS) data, which enables enormous gains in our understanding genomes but poses many new challenges. Major achievements In 2015 we investigated the impact of popular multiple sequence alignment (MSA) programs on reconstruction of ancestral sequences. Many researchers are interested in synthesising proteins of parental or extinct species to study their biochemical properties and compare them with those of their extant relatives. Accurate reconstruction of ancestral states is vital to these studies; however, we discovered that different aligners introduce various biases. We also studied the impact of different models on estimation of divergence time, as traditional substitution models tend to underestimate genetic distances between distantly related species, introducing large biases. By relaxing the assumption of site-invariant selective pressures, we demonstrated that longer distances are estimated for basal branches of species trees. We published the results of a long-running comparative study of MSA filtering methods, which automatically identify and remove unreliable alignment sites, including data that may introduce errors into evolutionary analyses. The filtering step has been assumed to lead to better trees; however, our study showed that with most filtering methods and in most circumstances, alignment filtering worsens the resulting trees. These findings have implications for scientists working with difficult-to-align sequences. We investigated how the presence of gaps in a MSA – for example those introduced by insertions or deletions of genetic material (indels) – affects the accuracy of inferred phylogenies. Standard phylogenetic methods do not attempt to model indels, rather treating gaps as missing data. To address the suggestion that this could lead to statistical inconsistency giving rise to an incorrect evolutionary tree, even as more and more data are collected, we derived a new, simple proof of statistical consistency of maximum likelihood phylogenetic reconstruction for un-gapped alignments. In so doing we showed that the suggested inconsistency only pertains to one, very specific, outlier case. As most cell divisions introduce a small number of mutations into the genome, it is possible to use single-cell exome/genome sequencing data to infer somatic evolutionary histories of individual cells. These phylogenies, called ‘cell lineage trees’, are useful in developmental biology and cancer research but are difficult to resolve due to very low mutation rates and high sequencing error rates resulting from allelic dropout. The most commonly used algorithm does not distinguish between mutations and sequencing or sample amplification errors, so we developed a scalable algorithm based on a phylogenetic model that explicitly models mutations and sequencing errors as two separate processes. In addition to resolving a long-standing, tractable, mathematical problem, the new method gives more accurate trees than those produced using standard existing methods. ‘Incongruence’, increasingly common with the rise of multi-locus and whole-genome sequencing, occurs when an estimated tree varies depending on the particular set of sequences on which it was built. Incongruence can arise due to stochastic differences between loci, or to different sets of sequences having undergone different processes of evolution. To address the latter case we developed treeCl, a clustering method that groups loci that share a common evolutionary history and distinguishes sets that do not. We tested the method using both simulated data, a curated set of yeast proteins and RAD-sequenced DNA loci from globeflower flies (Chiastocheta spp.; Diptera, Anthomyiidae). We improved the curation of the yeast data by identifying several non-orthologous sequences, 137 2015 EMBL-EBI Annual Scientific Report
Nick Goldman PhD University of Cambridge, 1992. Postdoctoral work at National Institute for Medical Research, London, 1991-1995, and University of Cambridge, 1995-2002. Wellcome Trust Senior Fellow, 1995-2006. and produced better-resolved evolutionary trees for the globeflower flies, gaining insights into the underlying causes of the flies’ incongruent trees. We continued to investigate structural and functional determinants of selective evolutionary constraint in mammals, focusing on the level of genes and domains. To facilitate the analysis of the mode of evolution, we developed a web service that integrates and displays structural information with selective constraints discovered using our Sitewise Likelihood Ratio (SLR) method. We regularly share our expertise with experimental wet-lab biologists studying specific biological problems, and in such a collaboration in 2015 contributed analysis of the evolutionary dynamics of DNA regions differentially methylated between different tissues. DNA in blood has more sites that become methylated compared to other tissues but, contrary to expectation, we found that this subset of sites shows the same evolutionary patterns as those that exhibit higher fractional methylation. Our work to re-purpose DNA as a medium for archiving digital information continues to be of great interest worldwide, and in 2015 the Biotechnology and Biological Sciences Research Council (BBSRC) supported our development of computational and laboratory DNA-handling technologies needed to bring DNA-storage closer to market. Extending state-ofthe-art methods in coding theory, we began modelling the statistical properties of both the storage medium itself and the errors induced by the “DNA synthesis > processing > storage > sequencing channel”, and developed algorithms to exploit the properties of this channel. We began developing a system that will allow mass storage of data on DNA with proven reliability and guaranteed efficiency at a level comparable to industry norms. Working closely with molecular biologists and DNA synthesis and sequencing specialists, we are developing solutions that will make DNA a viable choice for long-term, reliable and robust digital storage. Future plans We are dedicated to using mathematical modelling, statistics and computation to enable biologists to draw as much scientific value as possible from modern molecular sequence data. We will continue to improve and develop new methods for phylogenetic analysis, and new techniques to analyse incomplete datasets. We will apply of our cell lineage tree algorithm to single-cell sequencing data, and further develop the method to identify lineage divergences that are extremely difficult to pinpoint by manual analysis. At EMBL-EBI since 2002. EMBL Senior Scientist since 2009. Past work in the group was some of the first in phylogenetics to be able to relate protein sequence evolution to features of the entire evolving protein, rather than assuming mutations at different locations had independent effects. We will further develop our method to model the evolutionary forces acting on proteins involved in cellular information processing, shifting focus from the 3D structure of the evolving proteins to the interactions between binding pairs of molecules in signalling networks. The method will enable us to infer how evolutionary pressures have impacted the evolution of sequences. Clinicians are looking to genome sequencing to provide diagnostic aids and inform treatment decisions, for example in determining the correct antibiotic based on rapid determination of pathogen species and strain, or detecting mutations known to impact antibiotic resistance. We believe that state-of-the-art genomic analysis methods can assist clinicians and be further optimised to be fast and accurate. In collaboration with clinicians who have expertise in diagnostics and treatment policy, we will work on methods for informing their choices based on bacterial whole-genome sequencing. We will analyse the performance of existing methods for detecting antibiotic resistance using limited data sets, producing knowledge of value for linking the latest NGS technologies with the appropriate software for diagnostic and clinical applications. Selected publications Lowe R, Slodkowicz G, Goldman N, Rakyan VK (2015) The human blood DNA methylome displays a highly distinctive profile compared with other somatic tissues. Epigenetics 10:274–281 Schwarz RF, et al. (2015) Changes in postural syntax characterize sensory modulation and natural variation of C. elegans locomotion. PLoS Comp Biol 11:e1004322 Schwarz RF, et al. (2015) Spatial and temporal heterogeneity in high-grade serous ovarian cancer: a phylogenetic analysis. PLoS Medicine 12:e1001789 Tan GM, et al. (2015) Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst. Biol. 64:778–791 Truszkowski J and Goldman N (2015) Maximum likelihood phylogenetic inference is consistent on multiple sequence alignments, with or without gaps. Syst. Biol. doi: 10.1093/sysbio/syv089 2015 EMBL-EBI Annual Scientific Report 138
Page 1 and 2:
The European Bioinformatics Institu
Page 3 and 4:
SERVICE TEAMS TRAINING PROGRAMME RE
Page 5 and 6:
Foreword We are pleased to present
Page 7 and 8:
awareness amongst some of our stron
Page 9 and 10:
Chemical biology The 17 million nov
Page 11 and 12:
The most extensive catalogue of str
Page 13 and 14:
“ EMBL -EBI services are the back
Page 15 and 16:
European Nucleotide Archive The ENA
Page 17 and 18:
Vertebrate Genomics Paul Flicek Bro
Page 19 and 20:
Functional Genomics Alvis Brazma
Page 21 and 22:
Pfam Pfam is a database of protein
Page 23 and 24:
Protein Data Bank in Europe Gerard
Page 25 and 26:
MetaboLights MetaboLights is a data
Page 27 and 28:
Proteomics Services and Molecular I
Page 29 and 30:
BioSamples The BioSamples database
Page 31 and 32:
“ EMBL -EBI is a critical mass of
Page 33 and 34:
EMBL International PhD Programme at
Page 35 and 36:
“ It would be a considerable loss
Page 37 and 38:
The Birney group used methods devel
Page 39 and 40:
Marioni group • Improved and exte
Page 41 and 42:
“ Because I work for a micro biot
Page 43 and 44:
Industry workshops • In silico AD
Page 45 and 46:
The work of our institute relies on
Page 47 and 48:
Web production Rodrigo Lopez System
Page 49 and 50:
2015 EMBL-EBI Annual Scientific Rep
Page 51 and 52:
Capital investment Support from the
Page 53 and 54:
In 2015 our core data resources con
Page 55 and 56:
Joint publications Most of our 299
Page 57 and 58:
One from Many: Perspectives on a Mu
Page 59 and 60:
Page 61 and 62:
European Nucleotide Archive • Mar
Page 63 and 64:
Technical Services Cluster Scientif
Page 65 and 66:
Expression Atlas • Oregon State U
Page 67 and 68:
Photo: Uma Maheswari 2015 EMBL-EBI
Page 69 and 70:
Page 71 and 72:
037. Chiapparino A, Maeda K, Turei
Page 73 and 74:
115. Jakubec D, Hostas J, Laskowski
Page 75 and 76:
192. Perez-Riverol Y, Xu QW, Wang R
Page 77 and 78:
269. van den Berg BA, Reinders MJ,
Page 79 and 80:
Director Ewan Birney Admininstratio
Page 81 and 82:
Page 83 and 84:
Guy Cochrane European Nucleotide Ar
Page 85 and 86:
Vertebrate Genomics Research The mo
Page 87 and 88: Daniel Zerbino Ensembl Genome Analy
Page 89 and 90: Future plans We will continue to de
Page 91 and 92: Andy Yates Genome Technology and In
Page 93 and 94: Paul Kersey Non-vertebrate Genomics
Page 95 and 96: Justin Paschall Variation Archive M
Page 97 and 98: Alvis Brazma Functional Genomics Ph
Page 99 and 100: Ugis Sarkans Functional Genomics De
Page 101 and 102: Robert Petryszak Gene Expression MP
Page 103 and 104: Rob Finn Sequence Families PhD in B
Page 105 and 106: Maria-Jesus Martin Protein Function
Page 107 and 108: Claire O’Donovan Protein Function
Page 109 and 110: (such as the on-going EMDataBank Ma
Page 111 and 112: Sameer Velankar PDBe Content and In
Page 113 and 114: containing the mapping between comp
Page 115 and 116: of 14 leading European labs in Meta
Page 117 and 118: Henning Hermjakob Proteomic service
Page 119 and 120: coimmunoprecipitation coimmunopreci
Page 121 and 122: development of Europe PMC as a plat
Page 123 and 124: Mouse informatics In 2015 we contin
Page 125 and 126: 2015 EMBL-EBI Annual Scientific Rep
Page 127 and 128: Train online, EMBL-EBI’s web-base
Page 129 and 130: Nils Koelling Quantitative genetics
Page 133 and 134: Pedro Beltrao PhD in Biology, Unive
Page 135 and 136: Ewan Birney PhD 2000, Wellcome Trus
Page 137: Anton Enright PhD in Computational
Page 141 and 142: John Marioni PhD in Applied Mathema
Page 143 and 144: Julio-Saez Rodriguez PhD University
Page 145 and 146: Oliver Stegle PhD in Physics, Unive
Page 147 and 148: Future plans The Teichmann group wi
Page 149 and 150: findings regarding association were
Page 153 and 154: Future plans The Industry Programme
Page 157 and 158: Reporting on usage We further devel
Page 159 and 160: to find the support they need. The
Page 161 and 162: Petteri Jokinen Systems & Networkin
Page 163 and 164: Standby Facility and Database Disas
Page 165 and 166: External Relations leads on brand a
Page 167 and 168: Mark Green EMBL-EBI Administration
show all

Annual Scientific Report 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?