13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

128 WANG, BUHLER, AND BRENTMETHODSSequences: CFTR study. <strong>The</strong> sequence <strong>of</strong> the mouseCFTR region was downloaded from the NISC Web site(http://www.nisc.nih.gov/data/20020612_Target1_0051/mouse_T1.fasta).Sequences: Mouse chromosomes study. <strong>The</strong> sequences<strong>of</strong> mouse chromosomes 11, 17, and 19 weredownloaded from (http://genome.ucsc.edu/golden Path/mmFeb2003/chromosomes/). <strong>The</strong> chromosomes were dividedinto nonoverlapping 1-Mb segments for both theBLAST and TWINSCAN portions <strong>of</strong> the analysis. Shotgunreads from Fugu (Takifugu rubripes), chicken (Gallusgallus), dog (Canis familiaris), human (<strong>Homo</strong><strong>sapiens</strong>), and rat (Rattus norvegicus) were downloadedfrom the NCBI trace archive (ftp://ftp.ncbi.nih.gov/pub/TraceDB/).Fold coverage calculations. Each 1x database consists<strong>of</strong> clipped reads whose total length is equal to the estimatedsize <strong>of</strong> the source genome. <strong>The</strong> genome sizes forTetraodon (0.31 Gb), Fugu (0.31 Gb), and rat (2.6 Gb)were estimated as the number <strong>of</strong> bases in the assembly;the size for human (2.9 Gb) was taken from the mousegenome paper (Waterston et al. 2002), the size for dog(2.4 GB) was taken from the recent shotgun sequencingsurvey paper (Kirkness et al. 2003), and the size forchicken (1.2 Gb) was taken from the proposal to sequencethe chicken genome (McPherson et al. 2002). An alternativedefinition <strong>of</strong> 1x is the number <strong>of</strong> raw reads needed toachieve an N-fold redundant assembly, divided by N(Waterston et al. 2002; Flicek et al. 2003). This alternativeyields substantially larger 1x databases because assemblerstypically discard many <strong>of</strong> the raw input reads.BLAST. To prepare BLAST databases, we masked repeatsin the informant genome sequences with RepeatMasker (Smit and Green, http://ftp.genome.washington.edu/RM/RepeatMasker.html), performed an additionalround <strong>of</strong> low-complexity masking with nseg (Wootton andFederhen 1996) using default parameters, and removed allstrings <strong>of</strong> 15 or more consecutive Ns in order to speed processing.All BLAST jobs were run using WUBLAST 2.0,18-Jan-2003, running under x86 Linux. <strong>The</strong> analysis reportedhere uses the following BLAST parameters: M = 1,N = –1, Q = 5, R = 1, Z = 3,000,000,000, Y = 3,000,000,000,B = 10,000, V = 100, W = 10, X = 30, S = 30, S2 = 30, andgapS2 = 30. <strong>The</strong> seg and dust filter options were used.Sequence annotation.<strong>The</strong> annotation <strong>of</strong> the mouseCFTR region created by the NISC was downloaded fromhttp://www.nisc.nih.gov/data/20020612_Target1_0051/mouse_T1.annot.ff (Genbank accession AE017189). Forthe mouse chromosomes, we downloaded the MGC transcriptsaligned to the genome by BLAT fromhttp://genome.ucsc.edu . This annotation contains 8,547known genes for the entire genome.TWINSCAN. We used TWINSCAN version 1.3. BothTWINSCAN source code and a Web server are availableat http://genes.cse.wustl.edu.Accuracy calculations. <strong>The</strong> predictions were comparedto the MGC annotations using the Eval s<strong>of</strong>tware package(Keibler and Brent 2003; http://genes.cse.wustl.edu/eval/).<strong>The</strong> accuracy measure plotted in Figure 2A is the average<strong>of</strong> exact exon sensitivity (the fraction <strong>of</strong> annotated exonsfor which TWINSCAN predicts both splice sites correctly)and exon specificity (the fraction <strong>of</strong> exons predicted byTWINSCAN that exactly match annotated exons). Becausethe MGC transcripts include only a fraction <strong>of</strong> theexons in the mouse genome, specificity measured againstMGC is a systematic underestimate <strong>of</strong> actual specificity—all predicted exons that are not in the MGC transcripts arecounted as wrong. In order to make the results on the chromosomecomparable to those on the well-annotated CFTRregion, the accuracy measure plotted in Figure 4A is theaverage <strong>of</strong> exon sensitivity and 3.5 times the exon specificity.DISCUSSION<strong>The</strong> availability <strong>of</strong> extensive genome sequence from avariety <strong>of</strong> vertebrate lineages has enabled the first directinvestigation <strong>of</strong> how evolutionary distance affects comparativegene prediction in the vertebrates. Initially, westudied the greater CFTR region using the sequences <strong>of</strong>BACs that were selected for orthology to the humanCFTR region. A previous analysis <strong>of</strong> this region (Thomaset al. 2003) based on different alignment methods(Schwartz et al. 2003) reported that chicken alignmentscovered a large fraction <strong>of</strong> coding sequence (CDS) in themammalian CFTR region but only a small fraction <strong>of</strong> noncodingsequence. Our analysis confirms this and showsthat it has the expected positive effect on the accuracy <strong>of</strong>TWINSCAN, a state-<strong>of</strong>-the-art gene prediction system. Inaddition to improving coding versus noncoding predictions,chicken alignments yielded the greatest accuracy asmeasured by prediction <strong>of</strong> exact exon boundaries.Application <strong>of</strong> the same analysis to a much broadersample <strong>of</strong> the mouse genome (chromosomes 11, 17, and19) told a very different story. <strong>The</strong> CDS <strong>of</strong> the greaterCFTR region is exceptionally well conserved betweenthe mammalian and avian lineages, relative to other portions<strong>of</strong> the genome. In the broader survey, there was nosudden jump in the percentage <strong>of</strong> mouse CDS that alignsas one moves from fish comparisons to chicken comparisons.Instead, the aligned percentage seems to increaselinearly with the percent identity in aligned regions as onemoves from fish to chicken to human (Fig. 4). This differencewas reflected directly in the TWINSCAN performance,which peaked at human rather than chicken. <strong>The</strong>curves in Figure 5, as well as previously reported results(Flicek et al. 2003), suggest that using complete, assembledinformant genomes in place <strong>of</strong> the 5x shotgun coverageis unlikely to have any qualitative effect on the relativeutility <strong>of</strong> the comparisons. Likewise, changingalignment parameters or algorithms affects the absolutefraction <strong>of</strong> intron and CDS that aligns in each genomepair, but it does not change the relative values <strong>of</strong> comparisonsat the distances studied here (data not shown). Usingtranslated alignments (TBLASTX) also has littlequalitative effect and does not improve TWINSCAN per-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!