13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

250 CHIAROMONTE ET AL.can use this formula to calculate the expected fraction <strong>of</strong>windows in C that are under selection as1___Σ Pr(w selected|S(w)) (6)N wεCIf, for example, we apply Equation 6 with C defined as allW = 50-bp WA-windows (alignment threshold T = 40),we recover the mixture coefficient p 1 = 0.192 discussedabove, because p 1 is the fraction <strong>of</strong> these WA-windowsthat are estimated by the mixture decomposition to be underselection, and this must be the same as the expectedfraction <strong>of</strong> windows under selection. Here Equation 6merely provides another way <strong>of</strong> calculating the samenumber, and hence a nice test for our s<strong>of</strong>tware. However,if we apply Equation 6 with C defined as W = 50-bpWAC-windows, then we can calculate something moreinteresting; namely, the expected fraction <strong>of</strong> well-alignedcoding windows that are under selection. We performedthis calculation for various window sizes.For 200-bp windows we obtained 86%, for 100-bpwindows 78%, for 50-bp windows 65%, but for 30-bpwindows, we obtained only 48%. This further indicateshow our mixture decomposition method produces a veryconservative lower bound for the share under selectionwhen applied to the normalized percent identity distribution<strong>of</strong> small windows.A Tighter Lower Bound: SplittingWell-aligned WindowsOur computational strategy requires enough separationbetween the neutral and selected distribution <strong>of</strong> normalizedpercent identity for the mixture to reliably detect thedifference. In fact, the definition <strong>of</strong> well-aligned windows(T = 25 for W = 30 bp, T = 40 for W = 50 bp, T = 80for W = 100 bp, T = 160 for W = 200 bp) and choice <strong>of</strong>window size for the main analysis (W = 50 bp) stemmedfrom separation considerations; see also the Discussionsection below. However, if ancestral transposon relics area good neutral model, our figure <strong>of</strong> 5.2% may still representa fairly conservative lower bound for the share underselection. As a means to tighten this lower bound, we canfurther isolate extremely well-aligned genome-wide andneutral windows, splitting WA-windows into a high anda low alignment range. We tried, respectively, 20–24 and25–30 aligned bases for W = 30, 40–44 and 45–50 forW = 50, 80–94, and 95–100 for W = 100, and 160–194and 195–200 for W = 200.We repeated our calculations (estimating smooth densitiesfor neutral and genome-wide scores, decomposingthe genome-wide score distribution into a neutral and aselected component, computing a share under selectionbased on the mixture weight estimate and coverage) separatelyfor high- and low-range WA-windows, and addedthe results. As shown in Table 2, this consistently producesslightly higher share figures.<strong>The</strong> reason for the tighter lower bound is that neutraland genome-wide normalized percent identity distributionsare more dissimilar within each <strong>of</strong> the two groupsthan they are for WA-windows as a whole; that is, the splitincreases separation between neutral and selected behavior.From a purely theoretical point <strong>of</strong> view, splitting couldeither increase or decrease separation (this represents aninteresting area for further theoretical study), but if it increasesseparation, then still finer partitions <strong>of</strong> WA-windowsmay lead to even higher share estimates. However,finer partitions lead to the compounding <strong>of</strong> errors in thecalculations performed for each group, and this limits theirutility. We address the issue <strong>of</strong> statistical error next.Control ExperimentsAs a control for the error associated with our Gaussiansmoothing and mixture decomposition, using 50-bp windowswith a threshold <strong>of</strong> 40 aligned bases, we divided theWA-windows in ancestral repeats into two sets, A and B,at random. Set A was used to estimate the neutral scoredistribution. Set B was used to estimate a genome-widedistribution under a “null” scenario <strong>of</strong> no selection. Sinceboth data sets contain neutral windows, one expects anear 0 estimate for the fraction under selection: Iff neutral (S) = f genome (S) exactly for all scores S, we wouldhave p o = min S [f genome (S)/f neutral (S)] = 1, and hence 1 – p o= 0. However, random differences between f neutral (S) andf genome (S) do occur, especially for extreme values <strong>of</strong> Swhere very few observations are available and thus bothdensities are very close to 0. <strong>The</strong>se differences betweensmall density values can generate fairly wide fluctuationsin the ratio, resulting in a minimum sizably smaller than1 (on some control experiments the minimum was

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!