22.08.2016 Views

Annual Scientific Report 2015

EMBL_EBI_ASR_2015_DigitalEdition

EMBL_EBI_ASR_2015_DigitalEdition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Goldman Group<br />

Evolutionary Tools for<br />

Genomic Analysis<br />

The diversity of all life has been shaped by its evolutionary history. Our research<br />

focuses on the processes of molecular sequence evolution, developing data<br />

analysis methods that allow us to exploit this information and glean powerful<br />

insights into genomic function, evolutionary processes and phylogenetic history.<br />

To understand the evolutionary relationships between<br />

all organisms, it is necessary to analyse molecular<br />

sequences with consideration of their underlying<br />

structure. This is usually represented by an evolutionary<br />

tree indicating the branching relationships of organisms<br />

as they diverge from their common ancestors, and<br />

showing degrees of genetic difference between them.<br />

We develop mathematical, statistical and computational<br />

techniques to reveal information from genome data,<br />

draw inferences about the processes that gave rise to<br />

these interrelationships and make predictions about<br />

the biology of the systems whose components are<br />

encoded in those genomes. We develop new evolutionary<br />

models and methods, sharing them via stand-alone<br />

software and web services, and apply new techniques<br />

to interesting biological questions. We participate in<br />

comparative genomic studies, both independently and<br />

in collaboration with others. Our evolutionary studies<br />

involve the analysis of next-generation sequencing<br />

(NGS) data, which enables enormous gains in our<br />

understanding genomes but poses many new challenges.<br />

Major achievements<br />

In <strong>2015</strong> we investigated the impact of popular multiple<br />

sequence alignment (MSA) programs on reconstruction<br />

of ancestral sequences. Many researchers are interested<br />

in synthesising proteins of parental or extinct species<br />

to study their biochemical properties and compare<br />

them with those of their extant relatives. Accurate<br />

reconstruction of ancestral states is vital to these<br />

studies; however, we discovered that different aligners<br />

introduce various biases. We also studied the impact of<br />

different models on estimation of divergence time, as<br />

traditional substitution models tend to underestimate<br />

genetic distances between distantly related species,<br />

introducing large biases. By relaxing the assumption<br />

of site-invariant selective pressures, we demonstrated<br />

that longer distances are estimated for basal branches of<br />

species trees.<br />

We published the results of a long-running comparative<br />

study of MSA filtering methods, which automatically<br />

identify and remove unreliable alignment sites,<br />

including data that may introduce errors into<br />

evolutionary analyses. The filtering step has been<br />

assumed to lead to better trees; however, our study<br />

showed that with most filtering methods and in most<br />

circumstances, alignment filtering worsens the resulting<br />

trees. These findings have implications for scientists<br />

working with difficult-to-align sequences.<br />

We investigated how the presence of gaps in a MSA – for<br />

example those introduced by insertions or deletions<br />

of genetic material (indels) – affects the accuracy of<br />

inferred phylogenies. Standard phylogenetic methods<br />

do not attempt to model indels, rather treating gaps<br />

as missing data. To address the suggestion that this<br />

could lead to statistical inconsistency giving rise to an<br />

incorrect evolutionary tree, even as more and more<br />

data are collected, we derived a new, simple proof<br />

of statistical consistency of maximum likelihood<br />

phylogenetic reconstruction for un-gapped alignments.<br />

In so doing we showed that the suggested inconsistency<br />

only pertains to one, very specific, outlier case.<br />

As most cell divisions introduce a small number<br />

of mutations into the genome, it is possible to use<br />

single-cell exome/genome sequencing data to infer<br />

somatic evolutionary histories of individual cells.<br />

These phylogenies, called ‘cell lineage trees’, are useful<br />

in developmental biology and cancer research but<br />

are difficult to resolve due to very low mutation rates<br />

and high sequencing error rates resulting from allelic<br />

dropout. The most commonly used algorithm does<br />

not distinguish between mutations and sequencing or<br />

sample amplification errors, so we developed a scalable<br />

algorithm based on a phylogenetic model that explicitly<br />

models mutations and sequencing errors as two separate<br />

processes. In addition to resolving a long-standing,<br />

tractable, mathematical problem, the new method gives<br />

more accurate trees than those produced using standard<br />

existing methods.<br />

‘Incongruence’, increasingly common with the rise of<br />

multi-locus and whole-genome sequencing, occurs<br />

when an estimated tree varies depending on the<br />

particular set of sequences on which it was built.<br />

Incongruence can arise due to stochastic differences<br />

between loci, or to different sets of sequences having<br />

undergone different processes of evolution. To address<br />

the latter case we developed treeCl, a clustering method<br />

that groups loci that share a common evolutionary<br />

history and distinguishes sets that do not. We tested<br />

the method using both simulated data, a curated<br />

set of yeast proteins and RAD-sequenced DNA loci<br />

from globeflower flies (Chiastocheta spp.; Diptera,<br />

Anthomyiidae). We improved the curation of the yeast<br />

data by identifying several non-orthologous sequences,<br />

137<br />

<strong>2015</strong> EMBL-EBI <strong>Annual</strong> <strong>Scientific</strong> <strong>Report</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!