12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Sybil: Multiple Genome Comparison and Visualization 973. Methods3.1. Protein Clustering3.1.1. “All-vs-All” BLASTP Analysis1. xdformat is used to create a BLASTP-searchable database of the predictedpolypeptide sequences from all of the input genomes: xdformat –p –I –o all-peptidesall-peptides.fsa (15). It is assumed that each polypeptide has been assigned a uniqueidentifier and can be related back to the gene of which it is a product.2. Each of the predicted polypeptide sequences is searched against the database fromstep 1 with WU-BLASTP (15,16) (see Note 5) and the results are stored for use insubsequent steps (see Note 6): blastp all-peptides pep-1.fsa –E 1e-5 –matrix BLO-SUM62 –wordmask none –B 150 –V 150 –gspmax 5 –shortqueryok –novalidctxok–cpus 1 > pep-1-vs-all-blastp.raw.3.1.2. Clustering Phase 1: Jaccard Coefficient-Based Protein ClusteringThe first phase of the protein clustering algorithm is run on each inputgenome separately. In this phase, a subset of the all-vs-all BLASTP matches isused to compute a Jaccard similarity coefficient (10) for every pair of polypeptidesfrom the same genome. All pairs of polypeptides whose Jaccard coefficientis more than a specified threshold are then subjected to a straightforwardgraph analysis to determine the resulting clusters. For each input genome:1. Identify the subset of the BLASTP matches to be used. By default only BLASTPmatches with at least 80% sequence identity and an E-value of at most1 × 10 −5 are used in the subsequent steps (see Note 7).2. Use the BLASTP matches from step 1 to determine which pairs of polypeptidesare “related” to one another; by definition one considers two polypeptides relatedif either one has a BLASTP match to the other that meets the conditions describedin step 1. Every polypeptide is also considered to be related to itself, regardless ofwhether a BLASTP self-match was found in step 1.3. Compute and record a Jaccard similarity coefficient for each pair of predictedpolypeptides. Fig. 3 illustrates how this is done for a representative pair of polypeptides.For any two polypeptides P1 and P2 the Jaccard similarity coefficient is theratio of the number of polypeptides (including P1 and P2 themselves) that are relatedto both P1 and P2 to the number of polypeptides that are related to either P1 or P2.Therefore, the Jaccard similarity coefficient for any pair of polypeptides P1 and P2 isa number between zero and one that reflects how similarly connected P1 and P2 areto the other polypeptides in the same data set (in this case, a single genome).4. Create a graph (see Fig. 4) in which each node corresponds to one of the polypeptidesfrom the selected input genome, and an edge is drawn between two polypeptidesP1 and P2 only if the Jaccard similarity coefficient of P1 and P2 is equal to or morethan a predetermined threshold (set to 0.6 by default) (see Note 8).5. The connected components of the graph generated in step 4, when treated as setsof polypeptides, are referred to as “Jaccard clusters,” or “JACs” for short. These

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!