13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

338 HARDISON ET AL.plain the regional differences. Rather, the covariation inthe various divergence processes appears to reflect an inherenttendency <strong>of</strong> large, megabase-sized regions tochange at a fast or slow rate (Chiaromonte et al. 2001).<strong>The</strong> molecular and cellular basis for this inherent tendencyto change is unknown, although it is possible thatrepair <strong>of</strong> double-stranded breaks could be at least part <strong>of</strong>the explanation (Lercher and Hurst 2002).Given the variation in neutral substitution rates, thegoal is to find aligning segments whose similarity significantlyexceeds that expected from divergence at the localneutral rate. <strong>The</strong>se should be the sequences subject topurifying selection. Indeed, the significance <strong>of</strong> a particularalignment score will vary substantially depending onthe divergence rate <strong>of</strong> the surrounding DNA (Li andMiller 2002). Thus, the fraction <strong>of</strong> matching nucleotidesfor alignments in small (50 bp) windows was adjusted forthe local neutral rate, empirically estimated from nearbyaligning ancestral repeats. <strong>The</strong> overall distribution <strong>of</strong>these adjusted scores is broad; when compared to its neutralcomponent (the distribution for ancestral repeatsonly) it presents a marked right-skewedness—i.e., increasedfrequencies on higher score values (Waterston etal. 2002). A statistical decomposition <strong>of</strong> this skewedoverall distribution leads to the conclusion that about 5%<strong>of</strong> the human genome is under purifying selection (Waterstonet al. 2002; Chiaromonte et al., this volume). Thisis over twice the amount <strong>of</strong> DNA that codes for protein,showing that the noncoding portion <strong>of</strong> the genome contributessignificantly to the functional DNA. However, itis only about one-eighth <strong>of</strong> the conserved sequences, so amajority <strong>of</strong> the aligning sequences do not reflect selectionfor some function.To make these scores more useful to biomedical scientists,L scores (or Mouse Cons and Human Cons) havebeen computed that convert locally adjusted similarityscores into probabilities that alignments in a given 50 bpresult from selection. <strong>The</strong>se can be accessed at the UCSC<strong>Genom</strong>e Browser. An example <strong>of</strong> this track for the mouseGata1 gene shows that protein-coding exons, the first intron,the promoter, and a region about 3–4 kb further upstreamare not only conserved (align), but are highlylikely to be generated by selection (Fig. 2B). <strong>The</strong> upstreamregion corresponds to an enhancer that conferserythroid-specific expression during primitive erythropoiesisand collaborates with the intronic enhancer to activateexpression during definitive erythropoiesis (Onoderaet al. 1997). Thus, these scores, generated from thealignment scores adjusted for local rate variation, can beeffective indicators <strong>of</strong> CRMs.DISCRIMINATING CRMS FROMOTHER DNAs<strong>The</strong> Mouse/Human Cons or L scores are measures <strong>of</strong>alignment quality, where matches are favored more thanmismatches, which are favored more than gaps. NoncodingDNA sequences with a high L score are more likely tobe subject to purifying selection, and this set <strong>of</strong> sequencesshould contain CRMs regulating conserved functions.However, it should also contain other functional sequencessuch as genes encoding structural RNAs and microRNAs.<strong>The</strong>refore, we explored several computational approachesto analyzing interspecies genomic sequencealignments, aiming to develop computational methods todistinguish regulatory regions from neutrally evolvingDNA. To do so, we employed statistical models that recognizealignment patterns characteristic <strong>of</strong> those seen inknown CRMs. Alignments rich in these patterns need notbe those that score highest in quality (e.g., a similarityscore) or a likelihood <strong>of</strong> being under selection. Knownenhancers and other CRMs tend to be clusters <strong>of</strong> highlyconserved binding sites for transcription factors, but sequencesbetween those binding sites are more variablebetween species. Thus, alignment quality measurementsin the CRMs are usually less than those seen in regionsunder more uniform selection, such as coding exons.Three training sets were collected from the wholegenomehuman–mouse alignments: (1) known CRMs,which are a set <strong>of</strong> 93 experimentally defined mammaliangene regulatory regions (accessible from GALA athttp://www.bx.psu.edu/), (2) well-characterized exons(coding sequences, as a positive control), and (3) ancestralinterspersed repeats (the major sequence class usedfor neutrally evolving DNA). Quantitative evaluation <strong>of</strong>statistical models that potentially could distinguish functionalnoncoding sequences from neutral DNA showedthat discrimination based on frequencies <strong>of</strong> individual nucleotidepairs or gaps (i.e., <strong>of</strong> possible alignmentcolumns) is only partially successful. In contrast, scoringprocedures that include the alignment context, based onfrequencies <strong>of</strong> short runs <strong>of</strong> alignment columns, achievegood separation between regulatory and neutral features(Elnitski et al. 2003).<strong>The</strong> best-performing scoring function, called regulatorypotential (RP) score, employs transition probabilitiesfrom two Markov models estimated on the training data.In practice, the procedure evaluates short strings <strong>of</strong>columns in the alignments, giving a higher value to thosethat occur more frequently in the CRMs training set thanin the ancestral repeats set (Fig. 3). In this procedure,alignments are described using a reduced alphabet A. Ineach training set, we compute the frequencies with whichshort strings <strong>of</strong> alignment characters are followed by aparticular alignment character. As an example, consideralignment columns to consist <strong>of</strong> two types <strong>of</strong> matches,those that involve G or C (S) and those that involve A orT (W), plus transitions (I), transversions (V), and gaps(G). A 5-symbol alphabet can thus describe the alignments.For short strings, the number <strong>of</strong> possible arrangements<strong>of</strong> these 5 symbols is computationally manageable.<strong>The</strong>refore, we estimate the probability that any string <strong>of</strong>length T is followed by a particular symbol (transitionprobabilities), where T is the order <strong>of</strong> the Markov model.For example, the empirical frequencies <strong>of</strong> a pentamer, sayWIISV, followed by a given symbol, say S (or W, I, V, G)are used to estimate the transition probabilities <strong>of</strong> a fifthorderMarkov model.More generally, for a Markov model <strong>of</strong> order T, we estimatethe probability that within a regulatory region an

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!