12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Sybil: Multiple Genome Comparison and Visualization 103a. The results did not appear to be overly sensitive to the values chosen (i.e.,small changes in the parameter values in the neighborhood of 80% and 0.6 didnot produce disproportionately large changes in the composition of the resultingprotein clusters).b. The protein clusters produced were—in the judgment of the curator—a goodapproximation of the “true” paralogous families in each of the genomes in question.With respect to condition b it is worth noting that the Jaccard clustering phase ofthe clustering analysis can serve multiple purposes. Its primary goal is to clusterparalogs within each genome and prevent them from confusing the subsequent bidirectionalbest hit analysis. However, the Jaccard clustering phase can be viewed moregenerally as a kind of compression algorithm that eliminates duplicate or near-duplicatepolypeptides and their corresponding genes from the data set. In realistic data setssuch duplicates can be produced by processes other than recent gene duplication. Forexample, in one recent project (22) sequencing was performed on genomic DNAsampled from two distinct haplotypes and in this case the Jaccard clustering was usedto collapse the two extremely similar sets of polypeptides into one, which greatlysimplified the downstream analyses. Incomplete or erroneously assembled sequencecontigs in early versions of draft genomes may also contain small-scale duplicationsthat are artifacts of the assembly process and lead to duplicate gene calls.9. An earlier version of the clustering algorithm relied solely on the second phaseof the clustering process (see Fig. 5), which is acceptable for analyzing compactgenomes with relatively little recent gene duplication. But as a bidirectional besthit analysis is easily confounded by the presence of close paralogs, the initialJaccard clustering phase was introduced and the best hit analysis was modifiedto run on (Jaccard) clusters instead of individual polypeptides (see Fig. 6).10. The “highest-scoring” BLASTP match is determined by comparing BLAST E-values.In the case of a tie one of the matches is picked arbitrarily as the “highest-scoring.”The exact method for doing this is not important, but it should be deterministic sothat the algorithm generates reproducible results. In practice, it should not matterhow such ties are broken, because any two polypeptides that match a third equallywell are likely to be clustered together by the first phase of the algorithm.11. A consequence of using connected components is that the clustering of genes fromgenomes A and B may depend on the other genomes included in the analysis. Forexample, if genomes A, B, and C are clustered and gene A1 is a reciprocal best hitof B1 but not C1, and B1 is a reciprocal best hit of C1 but not A1, then A1, B1,and C1 will be placed in the same cluster. If, however, genome B were notincluded in the analysis then A1 and C1 would not be clustered. At first glance thismay seem to be an undesirable property of the algorithm. However, it is justifiablefrom a logical standpoint, because if it is believed that A1 and B1 are orthologsand B1 and C1 are orthologs then it follows from the definition of the term that itshould also be believed that A1 and C1 are orthologs.12. As particularly large clusters (in terms of the number of proteins) can take muchlonger to run through ClustalW, and may even cause the program to (eventually)fail, a parameter for this phase of the analysis allows the ClustalW computation to

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!