13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

252 CHIAROMONTE ET AL.However, we could still have a bias in the estimate <strong>of</strong> theshare under selection if there are properties that cannot becorrected for by a score function that accounts for compositionalbiases, and that affect the neutral substitutionrate differently in relics <strong>of</strong> ancestral transposons than theydo in other types <strong>of</strong> neutral DNA.One possibility is that relics <strong>of</strong> ancestral transposons,because <strong>of</strong> their similarity to each other, are more apt toundergo gene conversion (or ectopic conversion) eventsthan other neutral DNA. If we are looking at a pair <strong>of</strong> orthologouslyplaced transposon relics in the human andmouse genomes, but one <strong>of</strong> them, say the one in themouse, has undergone a gene conversion in the rodentlineage, then additional substitutions may have been introducedby that event, making the human–mouse alignedelements less conserved than would be a pair that did notundergo lineage-specific gene conversion. If this werequite common, it would cause us to overestimate theshare under selection using our neutral model. However,we note that all gene conversion events that occurred totransposon relics before the human–mouse split wouldhave no effect: <strong>The</strong>y would merely replace the DNA thatis inherited by both species. Only primate or rodent lineage-specificgene conversion events can introduce bias.Since triggering a gene conversion event requires a reasonablyhigh degree <strong>of</strong> sequence identity between the twocopies, this means that relics from ancestral transposonfamilies that were only active long before thehuman–mouse split, and hence whose copies were highlydissimilar to each other at the time <strong>of</strong> the split, are muchless likely to have had a lineage-specific gene conversionevent after the split. Basing the neutral model on these“most ancient” relics would then remove the bias.We divided ancestral repeats into 130 subfamilies asdefined by RepeatMasker (Smit and Green 1999), computedthe average number <strong>of</strong> mismatches between eachrepeat and the consensus sequence for its subfamily, andeliminated from the analysis all repeats belonging to the65 least diverged subfamilies, representing those transposonrelics that were present in the ancestral genome theleast amount <strong>of</strong> time before the human–mouse split. Thiseliminated roughly 60% <strong>of</strong> the bases in the original collection<strong>of</strong> ancestral repeats. For W = 50-bp WA-windows,the estimated share under selection obtained with this restrictedset <strong>of</strong> ancestral repeats was very similar to thatobtained with the full set <strong>of</strong> ancestral repeats as a neutralmodel (ratio <strong>of</strong> the two estimates was between 0.95 and 1in different tests <strong>of</strong> this type with various data sets andalignments). This argues against gene conversion being asource <strong>of</strong> bias.Another possibility is that after transposons are insertedthey undergo a more rapid substitution rate. Possiblecauses for this may include mechanisms to suppresstransposon transcription, or some holdover from anotherancient cellular defense mechanism against insertions <strong>of</strong>transposons, as well as rapid adaptation <strong>of</strong> their GC contentto the GC content <strong>of</strong> the surrounding DNA (Bernardi1993). This would also cause an overestimate in the shareunder selection using transposon relics as neutral model.However, it seems plausible that such an increased rate <strong>of</strong>substitutions would diminish as the transposon relics age,so that a very old transposon relic which has been accumulatingsubstitutions in the genome for 50 to 100 millionyears would behave more like typical neutral DNA.Consequently, we would have expected to see more <strong>of</strong> achange in the estimate <strong>of</strong> the share under selection in theabove-mentioned experiments, in which we eliminated“younger” ancestral transposon relics from the neutralmodel. <strong>The</strong> largest effect we saw when experimentingwith different subsets <strong>of</strong> ancestral transposons to definethe neutral model was when we used only SINEs (both“young” and “old” ancestral SINEs). Here the estimatedropped to 4.67%, possibly indicating some bias, but stillnot as big a fluctuation as we saw by varying the windowsize and alignment threshold (Table 1).Finally, one type <strong>of</strong> neutral DNA that may affect ourresults because it evolves in a distinct way, different fromthat <strong>of</strong> transposon relics, is DNA in simple repeats. However,we found that there are only 2 million bases <strong>of</strong> humansimple repeats aligned with mouse, less than1/1000th <strong>of</strong> the genome, so these by themselves could notsubstantially affect our estimate <strong>of</strong> the share under selection.Essentially, even rejecting all simple repeats as a priorinot being under selection, we would not reduce our estimate<strong>of</strong> the share <strong>of</strong> the genome under selection.DISCUSSIONUltimately, one would like to identify individual bases<strong>of</strong> the human genome that are under selection. However,with only one alignment available (to the mousegenome), there is insufficient information to do so at thistime. Even a global statistical estimate <strong>of</strong> the share underselection cannot be made from data relative to singlebases, because the neutral and genome-wide distributionswill be very similar for any reasonable conservation scorecomputed on two-species comparisons <strong>of</strong> individualbases.To illustrate this, consider the simple conservationfunction that has score = 1 if the aligned bases are identicalin human and mouse, and score = 0 otherwise. <strong>The</strong> estimatedprobability <strong>of</strong> score = 1 is 0.667 for aligned neutralbases (from ancestral repeats) and 0.699 for alignedbases genome-wide. Applying our basic mixture estimationmethod to these data gives an upper-bound estimate<strong>of</strong> p o = 0.904. This is the maximum fraction <strong>of</strong> thegenome-wide aligned sites that is compatible with theneutral score distribution, and is (implicitly) obtained byassuming that bases under selection are always identicalbetween human and mouse. We cannot do any betterwithout assuming prior knowledge <strong>of</strong> the score distributionfor selected sites, something we have avoided in ourapproach. When we then convert p o into a lower bound onthe fraction <strong>of</strong> sites in the human genome undergoing selection,we obtain a selected = 0.35*(1–0.904) = 0.0336, orabout 3.4% (35% <strong>of</strong> the bases in the human genome arealigned to mouse). In light <strong>of</strong> the implicit assumption thatall selected bases are exactly conserved, this is clearly avery weak lower bound. <strong>The</strong> problem here is the strongoverlap between the neutral distribution and the (unob-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!