13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DETECTING MULTISPECIES CONSERVED SEQUENCES 257pq22.322.121.321.115.315.114.314.11312.312.111.211.111.2111.2211.2321.1121.1321.322.122.231.131.3131.3332.233343536.136.236.37Figure 2. Features <strong>of</strong> two targeted genomic regionssequenced in multiple species. <strong>The</strong> two indicatedregions <strong>of</strong> human chromosome 7 weresubjected to multispecies comparative sequencing(see http://www.nisc.nih.gov for details). <strong>The</strong>general features <strong>of</strong> each region are listed, with theindicated numbers corresponding to the humanreference sequence. “No. Known Genes” is thenumber <strong>of</strong> annotated genes with defined codingsequences in each region. “Total Mb Sequenced”reflects the amount <strong>of</strong> sequence data generated inaggregate from the indicated number <strong>of</strong> differentspecies. “Percent GC” is the average GC content(calculated in nonoverlapping 1-kb windowsacross each region). Also indicated is informationabout the MCSs detected in each region (seetext). For the 7q31.3 region, the MCSs were identifiedusing the sequences from a subset <strong>of</strong> 13species (among those excluded were all primates,tetraodon, and zebrafish); for the 7q11.23 region,sequences from all 12 species were used to detectMCSs. <strong>The</strong> last four rows reflect the percent <strong>of</strong>MCS bases that overlap coding, UTRs, ARs, andnoncoding sequence, respectively.Detected MCSs General Features7q11.23 7q31.3Size (Mb) 6.3 1.8No. Known Genes 42 10No. Species Sequenced 12 25Total Mb Sequenced 29 31Percent Coding 0.9 1.2Percent UTR 0.5 0.6Percent Repeats 55.1 40.3Percent GC 49.1 38.4No. 4,520 1,572Average Length (bp) 70 60Percent <strong>of</strong> Total Sequence 5 5Percent Coding 16 22Percent UTR 3.4 4Percent AR 3.6 3.6Percent Non-Coding 77 70represents the most detailed comparison <strong>of</strong> vertebrategenomes performed to date. Several findings from this effortare particularly relevant to our multispecies sequencingprogram. First, roughly 40% <strong>of</strong> the mouse genomeforms sequence alignments with the human genome usingestablished alignment methods (e.g., blastz [Schwartzet al. 2003a]; see the human–mouse alignments depictedin Fig. 3). Second, only about 5% <strong>of</strong> the mammaliangenome is estimated to be actively conserved (Waterstonet al. 2002; Roskin et al. 2003), and this consists <strong>of</strong>roughly 1.5% that is protein coding and 3.5% that is noncoding.At present, the specific bases that constitute this5% are not known. Thus, a critical challenge is to developstrategies for identifying this small fraction <strong>of</strong> activelyconserved sequence, since it presumably reflects the bulk<strong>of</strong> the functionally important portion <strong>of</strong> the humangenome. However, simple pair-wise comparisons <strong>of</strong> humanand mouse sequences are ineffective at identifyingthe correct subset <strong>of</strong> actively conserved sequence in themammalian genome (Thomas et al. 2003).To broaden the above human–mouse sequence comparisons,we have performed similar pair-wise alignmentsbetween human and various other species’ sequences(Fig. 3), revealing a number <strong>of</strong> interestingfindings. First, primate sequences are highly similar tothe human sequence, as expected. Second, alignments betweenthe human sequence and that <strong>of</strong> the other mammalsshow patterns similar to that seen with mouse; however,the sequence conservation with human is generallyhigher for most nonrodent placental mammals comparedto rodents (Thomas et al. 2003). Alignments between humanand fish sequences are almost exclusively confinedto coding regions. Interestingly, marsupial, monotreme,and chicken sequences show significantly fewer alignchromosome7q31.3 (Thomas et al. 2003) and 7q11.23, respectively(Fig. 2; also see http://www.nisc.nih.gov/data).Together these regions span >8 Mb in the human genomeand contain 52 known genes; they are also notably differentwith respect to their general sequence composition(e.g., GC content, amount <strong>of</strong> repetitive sequences, and proportion<strong>of</strong> coding sequence).<strong>The</strong> sequences and associated analyses emanating fromthe NISC Comparative Sequencing Program reflect increasinglycomplex data sets, especially with respect tothe number <strong>of</strong> vertebrate species represented and the variouscomparisons being performed. For visualizing anddisseminating these data, we have collaborated with ourcolleagues at the University <strong>of</strong> California, Santa Cruz(UCSC) to facilitate their development <strong>of</strong> a “zoobrowser” component <strong>of</strong> the UCSC <strong>Genom</strong>e Browser(Kent et al. 2002; Karolchik et al. 2003; Thomas et al.2003; also see http://genome.ucsc.edu). A sample display<strong>of</strong> this browser for a small portion <strong>of</strong> the 7q31.3 target isshown in Figure 3. In addition to providing convenientaccess to the assembled sequences for each species, thebrowser has been customized to display the results <strong>of</strong>pair-wise sequence alignments (which can be performedusing any species’ sequence as the reference) and othercomparative analyses (see below). By direct integrationwith the standard UCSC <strong>Genom</strong>e Browser environment,such data can then be viewed within the context <strong>of</strong> theever-growing collection <strong>of</strong> annotations and associated informationabout the human genome sequence.MULTISPECIES CONSERVED SEQUENCES<strong>The</strong> recent comparative analysis between the humanand mouse genome sequences (Waterston et al. 2002)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!