13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> Effects <strong>of</strong> Evolutionary Distance on TWINSCAN, anAlgorithm for Pair-wise Comparative Gene PredictionM. WANG, J. BUHLER, AND M.R. BRENTDepartment <strong>of</strong> Computer Science and Engineering, Washington University, St. Louis, Missouri 63130Although the human genome sequence is finished, completedelineation <strong>of</strong> all human protein-coding genes remainsa distant prospect. <strong>The</strong>re currently are only about13,000 genes (loci) for which at least one complete openreading frame is known with high confidence (http://mgc.nci.nih.gov/, http://www.ncbi.nlm.nih.gov/LocusLink/RSstatistics.html) (Pruitt and Maglott 2001; Strausberg etal. 2002), out <strong>of</strong> an estimated total <strong>of</strong> at least 20,000 (RoestCrollius et al. 2000; Waterston et al. 2002). Thus, we are inurgent need <strong>of</strong> improved techniques for delineating completegene structures. One way in which such improvementshave come about in the last few years is throughcomparison <strong>of</strong> the human genome to other sequenced vertebrategenomes. New gene modeling programs were developedto exploit information in alignments between themouse and human genomes (Bafna and Huson 2000; Korfet al. 2001; Alexandersson et al. 2003; Flicek et al. 2003;Parra et al. 2003), and these are now being used to obtaincDNA sequence via hypothesis-driven RT-PCR and sequencingexperiments (Guigó et al. 2003; Wu et al. 2004).One <strong>of</strong> the first comparative gene modeling programs toachieve substantial improvements over the previous state<strong>of</strong> the art was TWINSCAN, which can annotate a targetgenome by exploiting alignments from an informantgenome even if the informant sequences are unassembledwhole-genome shotgun reads (Flicek et al. 2003).<strong>The</strong> rapid pace at which new vertebrate genomes are beingsequenced can be expected to lead to further improvementsin the accuracy <strong>of</strong> gene modeling systems. We nowhave complete, published draft sequences <strong>of</strong> the mouse(Waterston et al. 2002) and pufferfish (Aparicio et al.2002) genomes, an unpublished assembly <strong>of</strong> the rat, andfive- to sixfold coverage <strong>of</strong> the chicken and dog in wholegenomeshotgun reads (also see Kirkness et al. 2003). Furthermore,regions orthologous to a 1.8-Mb segment <strong>of</strong> humanChromosome 7 containing CFTR and 9 other genes(the “greater CFTR region”) have now been fully sequencedin a number <strong>of</strong> vertebrate species (Thomas et al.2003). It is therefore possible, for the first time, to evaluatethe evolutionary distance at which pair-wise comparison<strong>of</strong> vertebrate genomes is most useful for improvingthe accuracy <strong>of</strong> gene modeling.In this paper, we explore the effects <strong>of</strong> evolutionarydistance on gene prediction using TWINSCAN. To gaininsight into our observations about gene prediction accuracy,we investigate the characteristics <strong>of</strong> BLASTNalignments in coding and noncoding sequence as a function<strong>of</strong> evolutionary distance. Mouse sequence is used asthe target in order to maximize the range <strong>of</strong> evolutionarydistances at which whole-genome informant sequencesare available. We first focus on the previously studiedCFTR region, then test the generality <strong>of</strong> our CFTR resultsusing complete mouse chromosomes.TWINSCAN takes as input local alignments between atarget genome and a database <strong>of</strong> sequences from an informantgenome. For each nucleotide <strong>of</strong> the target genome,only the highest-scoring local alignment overlapping thatnucleotide is used. <strong>The</strong>se alignments are converted into arepresentation called conservation sequence, which assignsone <strong>of</strong> three symbols to each nucleotide <strong>of</strong> the targetgenome (Fig. 1). Each target nucleotide is paired with“⎪” if the alignment contains a match, “:” if the alignmentcontains a gap or mismatch, and “.” if there is no overlappingalignment. TWINSCAN has separate probabilitymodels for the likelihood that each conservation sequencepattern will occur in coding regions, UTRs, splice signals,and translation initiation and termination signals. Given atarget DNA sequence and its conservation sequence,TWINSCAN predicts the most likely gene structures accordingto its probability model.RESULTSMouse CFTR RegionWe ran TWINSCAN on the mouse CFTR region usingthe CFTR regions <strong>of</strong> Fugu, Tetraodon, chicken, human,cat, and rat as the informant databases. TWINSCAN performsbest with the chicken informant. When TWIN-Figure 1. Conversion <strong>of</strong> the best local alignment in each region <strong>of</strong> the target genome (top) into the conservation sequence representationused by TWINSCAN (bottom). (Left) A typical coding region, in which there are no unaligned bases or gaps, and the distances betweenmismatches tend to be multiples <strong>of</strong> three. (Right) A typical intron, in which there are unaligned regions, gaps and adjacent mismatches.Cold Spring Harbor Symposia on Quantitative Biology, Volume LXVIII. © 2003 Cold Spring Harbor Laboratory Press 0-87969-709-1/04. 125

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!