13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PREDICTING GENE REGULATORY REGIONS 337count for at most 2% <strong>of</strong> the human genome, and they areobviously under selective constraint. <strong>The</strong> other 38% <strong>of</strong>the human genome that does not code for protein but stillaligns with mouse should include gene regulatory sequencesand other functional noncoding sequences. However,these alignments also include much neutral DNA;e.g., about one-fourth <strong>of</strong> all the ancestral repeats in humansalign with orthologs in mouse. All the sequencesthat align between mouse and human are conserved in thesense that they are present in both species, but the goal isto identify the sequences that are subject to purifying selection.It is the latter sequences that are playing a role insome conserved function.A major complication to answering this question is thatthe rate <strong>of</strong> neutral evolution varies across the genome.<strong>The</strong> distribution <strong>of</strong> nucleotide substitutions per site in ancestralrepeats (computed on 1-Mb nonoverlapping windows)is quite wide (Fig. 1), reflecting substantial regionalvariation in the underlying neutral substitutionrate. In addition, the amount <strong>of</strong> DNA inferred to bedeleted from mouse and the amount <strong>of</strong> transposable elementsinserted and retained show substantial variation(Fig. 1). Furthermore, the amounts <strong>of</strong> neutral substitution,deletion, insertion (<strong>of</strong> LTR repeats), recombination, andsingle-nucleotide polymorphisms (SNPs) covary dramatically(Fig. 2A). Because the substitutions are measuredin neutral DNA, different levels <strong>of</strong> selection cannot exonecan draw an informative inference about the nonaligningpart <strong>of</strong> the ancestral portion <strong>of</strong> a genome—it isnot likely to be present in the other genome. Because wedo not align lineage-specific insertions, and lineage-specificduplicates can align, the simplest explanation for thesequences not being in the comparison genome is thatthey were deleted. Other analyses based on a relativelyconstant genome size in mammals also argue that thenonaligning fraction reveals deletions in the comparisongenome (Waterston et al. 2002).<strong>Genom</strong>e sequences <strong>of</strong> additional species, such as rat (InternationalRat <strong>Genom</strong>e Sequencing Consortium, in prep.),are being assembled as large-scale genomic sequence andanalysis projects move into more functional and analyticalstudies (Collins et al. 2003). Pair-wise and multiple alignments<strong>of</strong> these sequences are regularly updated and madeavailable on the UCSC <strong>Genom</strong>e Browser (Kent et al. 2002)at http://genome.ucsc.edu. Additional mammalian genomesequences substantially improve the power <strong>of</strong> sequencealignment techniques to resolve functional from nonfunctionalDNA sequences (Thomas et al. 2003).CONSERVATION AND SELECTIONAbout 40% <strong>of</strong> the human genome aligns with sequencesin the mouse genome. As expected, almost all(99%) <strong>of</strong> the genes align between the genomes. <strong>The</strong>se ac-A.correlation coefficient0.50.40.30.20.1OriginalResiduals onGC parametersB.0.0deletions SNPs recombinationLTR insertionssubstitutions per AR site versusBase positionGata1HumanConsRepeat Masker4130000 4135000Primitive erythropoiesisDefinitive erythropoiesis40Figure 2. Covariation in divergence rates and application <strong>of</strong> an alignment score that accounts for local rate variation. (A) Correlationamong the amounts <strong>of</strong> neutral substitution (in ARs) with large deletions (based on human–mouse alignments), insertions <strong>of</strong> LTR repeats,recombination, and SNPs in human. Correlations are shown for the original data and for residuals after the quadratic effects <strong>of</strong>fraction G+C, change in fraction G+C, and CpG island density have been removed (Hardison et al. 2003). <strong>The</strong> correlations <strong>of</strong> variousdivergence processes with insertion and retention <strong>of</strong> different classes <strong>of</strong> repeats is the subject <strong>of</strong> ongoing work. (B) <strong>The</strong> HumanCons track plots the L score giving the log-likelihood that alignments reflect selection for the mouse Gata1 gene. <strong>The</strong> gene is transcribedfrom right to left. <strong>The</strong> upstream and intronic noncoding regions with high L scores correspond to previously described strongerythroid enhancers (Onodera et al. 1997).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!