Phylogeny and molecular evolution of green algae - Phycology ...
Phylogeny and molecular evolution of green algae - Phycology ...
Phylogeny and molecular evolution of green algae - Phycology ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
14 CHAPTER 1<br />
Missing data<br />
Deep phylogenies require the simultaneous analysis <strong>of</strong> many characters <strong>and</strong> many taxa (Delsuc et al.<br />
2005). Individual, orthologous genes can be combined into a supermatrix which inevitably involves a<br />
certain amount <strong>of</strong> missing data. Many studies have studied the effects <strong>of</strong> missing data on<br />
phylogenetic reconstruction. A simulation study suggests that the placement <strong>of</strong> individual taxa in a<br />
tree is robust to large amounts <strong>of</strong> missing data in the sequences <strong>of</strong> the taxa in question (up to 50%<br />
under the simulated conditions) <strong>and</strong> that model-based methods can deal with even greater amounts<br />
<strong>of</strong> missing data (Wiens 2005). Another simulations study demonstrates that Bayesian analyses are<br />
even more robust to missing data, i.e. the phylogenetic position <strong>of</strong> taxa with 95% <strong>of</strong> missing data in<br />
their sequence is still accurate, as long as the total number <strong>of</strong> characters in the dataset is large<br />
(Wiens <strong>and</strong> Moen 2008). Studies <strong>of</strong> empirical datasets have shown that datasets with up to 92% <strong>of</strong><br />
missing data are still able to provide insights into various parts <strong>of</strong> the tree <strong>of</strong> life (Driskell et al. 2004,<br />
Philippe et al. 2004, Delsuc et al. 2005).<br />
Models <strong>of</strong> sequence <strong>evolution</strong><br />
The General Time Reversible (GTR) model <strong>and</strong> its simpler variants include one or more parameters to<br />
describe the substitution rate between the different bases. The GTR model uses a set <strong>of</strong> parameters<br />
to describe the relative substitution rate between all combinations <strong>of</strong> bases (AC, AG, AT, CG, CT, <strong>and</strong><br />
GT). The simpler models only consider transitions versus transversions or attribute an equal<br />
substitution rate to all possible changes. A second important component <strong>of</strong> a model are the base<br />
frequencies. They can be calculated directly from the dataset (‘empirical’ base frequencies) or<br />
optimized along with the other parameters <strong>of</strong> the model. A third common element <strong>of</strong> the model<br />
allows for variations <strong>of</strong> <strong>evolution</strong>ary rate across site (e.g. different codon positions in protein coding<br />
genes, loops <strong>and</strong> stems in ribosomal DNA). Such among site rate variation is commonly accounted for<br />
by assuming that the site rates follow a gamma distribution <strong>and</strong>/or by incorporating a proportion <strong>of</strong><br />
invariable sites.<br />
Partitioning strategies<br />
A supermatrix, a dataset composed <strong>of</strong> different genes, <strong>of</strong>ten dem<strong>and</strong>s data partitioning to account<br />
for across site heterogeneity in <strong>evolution</strong>ary rate (Delsuc et al. 2005). Therefore, careful attention<br />
has to be paid to the selection <strong>of</strong> suitable partitioning strategies (Brown <strong>and</strong> Lemmon 2007, Li et al.<br />
2008, Verbruggen <strong>and</strong> Theriot 2008). Protein coding genes usually benefit from partitioning into<br />
codon position. Empirical studies showed that codon position models perform better than models<br />
which do not take codon position into account (Shapiro et al. 2006). In order to accommodate<br />
differences in <strong>evolution</strong>ary rate among partitions rate multipliers can be used.