Annual Scientific Report 2015
EMBL_EBI_ASR_2015_DigitalEdition
EMBL_EBI_ASR_2015_DigitalEdition
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Goldman Group<br />
Evolutionary Tools for<br />
Genomic Analysis<br />
The diversity of all life has been shaped by its evolutionary history. Our research<br />
focuses on the processes of molecular sequence evolution, developing data<br />
analysis methods that allow us to exploit this information and glean powerful<br />
insights into genomic function, evolutionary processes and phylogenetic history.<br />
To understand the evolutionary relationships between<br />
all organisms, it is necessary to analyse molecular<br />
sequences with consideration of their underlying<br />
structure. This is usually represented by an evolutionary<br />
tree indicating the branching relationships of organisms<br />
as they diverge from their common ancestors, and<br />
showing degrees of genetic difference between them.<br />
We develop mathematical, statistical and computational<br />
techniques to reveal information from genome data,<br />
draw inferences about the processes that gave rise to<br />
these interrelationships and make predictions about<br />
the biology of the systems whose components are<br />
encoded in those genomes. We develop new evolutionary<br />
models and methods, sharing them via stand-alone<br />
software and web services, and apply new techniques<br />
to interesting biological questions. We participate in<br />
comparative genomic studies, both independently and<br />
in collaboration with others. Our evolutionary studies<br />
involve the analysis of next-generation sequencing<br />
(NGS) data, which enables enormous gains in our<br />
understanding genomes but poses many new challenges.<br />
Major achievements<br />
In <strong>2015</strong> we investigated the impact of popular multiple<br />
sequence alignment (MSA) programs on reconstruction<br />
of ancestral sequences. Many researchers are interested<br />
in synthesising proteins of parental or extinct species<br />
to study their biochemical properties and compare<br />
them with those of their extant relatives. Accurate<br />
reconstruction of ancestral states is vital to these<br />
studies; however, we discovered that different aligners<br />
introduce various biases. We also studied the impact of<br />
different models on estimation of divergence time, as<br />
traditional substitution models tend to underestimate<br />
genetic distances between distantly related species,<br />
introducing large biases. By relaxing the assumption<br />
of site-invariant selective pressures, we demonstrated<br />
that longer distances are estimated for basal branches of<br />
species trees.<br />
We published the results of a long-running comparative<br />
study of MSA filtering methods, which automatically<br />
identify and remove unreliable alignment sites,<br />
including data that may introduce errors into<br />
evolutionary analyses. The filtering step has been<br />
assumed to lead to better trees; however, our study<br />
showed that with most filtering methods and in most<br />
circumstances, alignment filtering worsens the resulting<br />
trees. These findings have implications for scientists<br />
working with difficult-to-align sequences.<br />
We investigated how the presence of gaps in a MSA – for<br />
example those introduced by insertions or deletions<br />
of genetic material (indels) – affects the accuracy of<br />
inferred phylogenies. Standard phylogenetic methods<br />
do not attempt to model indels, rather treating gaps<br />
as missing data. To address the suggestion that this<br />
could lead to statistical inconsistency giving rise to an<br />
incorrect evolutionary tree, even as more and more<br />
data are collected, we derived a new, simple proof<br />
of statistical consistency of maximum likelihood<br />
phylogenetic reconstruction for un-gapped alignments.<br />
In so doing we showed that the suggested inconsistency<br />
only pertains to one, very specific, outlier case.<br />
As most cell divisions introduce a small number<br />
of mutations into the genome, it is possible to use<br />
single-cell exome/genome sequencing data to infer<br />
somatic evolutionary histories of individual cells.<br />
These phylogenies, called ‘cell lineage trees’, are useful<br />
in developmental biology and cancer research but<br />
are difficult to resolve due to very low mutation rates<br />
and high sequencing error rates resulting from allelic<br />
dropout. The most commonly used algorithm does<br />
not distinguish between mutations and sequencing or<br />
sample amplification errors, so we developed a scalable<br />
algorithm based on a phylogenetic model that explicitly<br />
models mutations and sequencing errors as two separate<br />
processes. In addition to resolving a long-standing,<br />
tractable, mathematical problem, the new method gives<br />
more accurate trees than those produced using standard<br />
existing methods.<br />
‘Incongruence’, increasingly common with the rise of<br />
multi-locus and whole-genome sequencing, occurs<br />
when an estimated tree varies depending on the<br />
particular set of sequences on which it was built.<br />
Incongruence can arise due to stochastic differences<br />
between loci, or to different sets of sequences having<br />
undergone different processes of evolution. To address<br />
the latter case we developed treeCl, a clustering method<br />
that groups loci that share a common evolutionary<br />
history and distinguishes sets that do not. We tested<br />
the method using both simulated data, a curated<br />
set of yeast proteins and RAD-sequenced DNA loci<br />
from globeflower flies (Chiastocheta spp.; Diptera,<br />
Anthomyiidae). We improved the curation of the yeast<br />
data by identifying several non-orthologous sequences,<br />
137<br />
<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>