12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Sybil: Multiple Genome Comparison and Visualization 99Fig. 4. Computing JACs. A graph is created in which each pair of proteins witha nonzero Jaccard coefficient is connected by an edge (left panel). Edges labeled withJaccard coefficients below the default threshold of 0.6 are removed (middle panel). Theconnected components of the resulting graph are the JACs (JAC1–JAC4 in the rightpanel). Note that in the current implementation of the algorithm only clusters of sizetwo or greater are reported and stored in the database (i.e., JAC1) and any polypeptidenot in one of these clusters is assumed, by convention, to be a cluster of size one.3.1.4. Generate ClustalW Alignments1. ClustalW is run on each of the protein clusters (see Note 12) generated by the previousstep to produce a set of multiple sequence alignments (17). These alignments arestored alongside the clusters and presented in the Sybil interface as a means to assessthe quality of each cluster.3.1.5. Compute Cluster Summary ScoresThe all-vs-all BLASTP results are used to compute two scores for each ofthe JACs and JOCs. The first score is an average percent identity score and thesecond is an average coverage score; together these two numbers allow one tomake a rapid quantitative assessment of a cluster without having to examine itsfull ClustalW alignment. The average percent identity score reflects how wellconservedthe matching regions of the clustered polypeptides are, whereas theaverage percent coverage score reflects how much of each of the clusteredpolypeptides matches the others (i.e., how completely the BLASTP GSPs“cover” the clustered proteins). Using only these two scores one can quicklyidentify the most highly conserved high-confidence clusters—they are thosewith both a high average percent identity and a high percent coverage score. If,on the other hand, the average percent identity score is very high but the coveragescore is relatively low, it may indicate a cluster of polypeptides that share acommon motif (or one or more exons, in the case of alternatively spliced transcriptsor misannotated genes). Finally, a cluster with a high percent coveragescore but a relatively low percent identity score may be a genuine cluster oforthologous genes whose members are only distantly related.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!