12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

94 Crabtree et al.genomes, which is to examine their relative protein-coding gene complements.Doing this requires that one make judgments about which of the genes areorthologs, under the assumption that these genes are most likely to have conservedfunctional roles. Numerous published algorithms deal with the problemof computing clusters of orthologous and paralogous genes (1–6) and such clustersmay also be refined or defined manually, with the aid of trained curators(7–9). Although the cluster analysis and display tools in Sybil are largely agnosticwith respect to the question of how the proteins are clustered, they have been usedprimarily with the combination of simple protein clustering techniques describedin Subheading 3.1. This is a two-phase heuristic protein clustering method thatcombines an initial step in which a Jaccard similarity coefficient (10) is calculatedfor every pair of proteins (see Subheading 3.1.2.), with a second step thatperforms a bidirectional best hit analysis (see Subheading 3.1.3.) on the clustersgenerated by the first phase of the algorithm, rather than on individual proteins.Once protein clusters representing paralogs and/or orthologs have beendefined, Sybil provides a web-based interface that allows the cluster data to beexplored. At the level of entire genomes the protein clusters are used to supportqueries about relative gene complements (e.g., clusters which contain at least onerepresentative from genome A and at least one representative from genome B butnone from genomes C or D), and to support the generation of multiple-genomecomparative figures (see Fig. 1). At the level of individual genes the clusters areused for finer-grained analyses (e.g., enumerate all differences in gene structurethat appear to be unique to genome B). Central to this latter, high-resolutionview of the protein clusters, is a graphical display that shows each gene in a clusterin its relevant genomic context, with nearby gene clusters highlighted (see Fig. 2).Variants of this basic graphical view are utilized in a number of places in Sybil andthe method used to generate this view, which leverages the Bio::Graphics packageof Bioperl (http://en.wikipedia.org/wiki/Open_source, http://www.bioperl.org/wiki/History_of_BioPerl) (11), is described in Subheading 3.2.Other tools display matches between sequences and/or genomes in a similarway (12–14), but the figures produced by Sybil tend to be somewhat simplerand easier to interpret owing to the use of the protein cluster as a “minimumunit” of conservation. Sybil can also make use of the protein clusters to inferthe presence of regions of conserved synteny or “syntenic blocks.” A numberof tools have been developed in Sybil to identify and visualize suchlarge-scale conserved regions and how they are rearranged between genomes(21; Fig. 2 and 21; color plate no. 1). However, these are beyond the scopeof this chapter.It should be noted that although the current system relies on certain softwarepackages, programming libraries, data exchange formats, languages, and databases,these choices are largely incidental to the protein clustering method

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!