13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

EVOLUTIONARY DISTANCE AND GENE PREDICTION 129formance. Finally, our results on CFTR agree qualitativelywith those that Thomas et al. obtained withBLASTZ (Schwartz et al. 2003), an algorithm that producesa very different type <strong>of</strong> alignment than BLASTN.<strong>The</strong> accuracy curve shown in Figure 4A reflects comparisonsat only four significantly different evolutionarydistances. <strong>The</strong> alignment curves in Figure 4B suggest thatfilling in the intermediate distances will yield an accuracycurve with a single peak at the evolutionary distance thatis optimal for informing gene modeling. If so, the peak isat a distance farther than that between mouse and rat butcloser than that between mouse and chicken. <strong>The</strong> peak atthe mouse–human comparison in Figure 4A is consistentwith a highly idealized theoretical analysis suggesting apeak somewhere in the vicinity <strong>of</strong> mouse–human, or possiblysomewhat farther out (Zhang et al. 2003). To determinethe peak location more precisely, we will need thesequences <strong>of</strong> more genomes. <strong>The</strong> most immediate possibilityfor a comparison at a distance intermediate betweenmouse–human and mouse–chicken will come from theopossum, Monodelphis domestica, a marsupial that hasbeen designated as a high-priority sequencing targetby the National Human <strong>Genom</strong>e Research Institute(http://www.genome.gov/page.cfm?pageID=10002154).Although the mouse–rat divergence is much too closefor optimal annotation <strong>of</strong> mammalian genomes using asingle genome pair, preliminary data suggest that the situationmay be quite different when the target is a genomewith very short introns. For example, gene prediction inCryptococcus ne<strong>of</strong>ormans serotype D benefits fromalignments to serotype A. <strong>The</strong>se alignments cover 100%<strong>of</strong> CDS and 87% <strong>of</strong> intron bases, excluding splice sites(A. Tenney, unpubl.). Visual inspection suggests thatwhen TWINSCAN is trained on these alignments, it predictsintrons or intergenic regions that include most unalignedregions; conversely, it rarely predicts an intronthat does not overlap an unaligned region. Because the intronsare very short (68 bp on average), the locations <strong>of</strong>the intron boundaries are quite constrained relative tothose <strong>of</strong> mammalian introns. Apparently, TWINSCANuses these alignments to find the general locations <strong>of</strong> introns,rather than their precise boundaries.<strong>The</strong> research presented here is a significant step towarddetermining the optimal distance for gene modeling usingpair-wise genome alignments. Looking ahead to the nextstep, we and other investigators are developing methodsthat use information from multiple genome alignmentsrather than choosing a single best alignment (B<strong>of</strong>felli etal. 2003; Siepel and Haussler 2003). When it has beenfully developed, the multi-genome approach is expectedto yield real breakthroughs in the accuracy <strong>of</strong> genemodeling.ACKNOWLEDGMENTSWe are grateful to the centers that produced thegenome sequences used in this study. Special mention isdue to the Washington University <strong>Genom</strong>e SequencingCenter for producing the chicken whole-genome shotgunsequence, the National Institutes <strong>of</strong> Health Intramural SequencingCenter for producing BAC-based sequences <strong>of</strong>the greater CFTR regions, and the Whitehead Institute<strong>Genom</strong>e Research Center for producing the dog wholegenomeshotgun sequence. <strong>The</strong> authors were supported inpart by grant HG-02278 from the National Human<strong>Genom</strong>e Research Institute.REFERENCESAlexandersson M., Cawley S., and Pachter L. 2003. SLAM:Cross-species gene finding and alignment with a generalizedpair hidden Markov model. <strong>Genom</strong>e Res. 13: 496.Aparicio S., Chapman J., Stupka E., Putnam N., Chia J.M., DehalP., Christ<strong>of</strong>fels A., Rash S., Hoon S., Smit A., GelpkeM.D., Roach J., Oh T., Ho I.Y., Wong M., Detter C., VerhoefF., Predki P., Tay A., Lucas S., Richardson P., Smith S.F.,Clark M.S., Edwards Y.J., and Doggett N., et al. 2002.Whole-genome shotgun assembly and analysis <strong>of</strong> the genome<strong>of</strong> Fugu rubripes. Science 297: 1301.Bafna V. and Huson D.H. 2000. <strong>The</strong> conserved exon method forgene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8: 3.B<strong>of</strong>felli D., McAuliffe J., Ovcharenko D., Lewis K.D.,Ovcharenko I., Pachter L., and Rubin E.M. 2003. Phylogeneticshadowing <strong>of</strong> primate sequences to find functional regions<strong>of</strong> the human genome. Science 299: 1391.Burge C. and Karlin S. 1997. Prediction <strong>of</strong> complete gene structuresin human genomic DNA. J. Mol. Biol. 268: 78.Flicek P., Keibler E., Hu P., Korf I., and Brent M.R. 2003.Leveraging the mouse genome for gene prediction in human:From whole-genome shotgun reads to a global synteny map.<strong>Genom</strong>e Res. 13: 46.Guigó R., Dermitzakis E.T., Agarwal P., Ponting C., Parra G.,Reymond A., Abril J.F., Keibler E., Lyle R., Ucla C., AntonarakisS.E., and Brent M.R. 2003. Comparison <strong>of</strong> mouseand human genomes followed by experimental verificationyields an estimated 1,019 additional genes. Proc. Natl. Acad.Sci. 100: 1140.Keibler E. and Brent M.R. 2003. Eval: A s<strong>of</strong>tware package foranalysis <strong>of</strong> genome annotations. BMC Bioinformatics 4: 50.Kirkness E.F., Bafna V., Halpern A.L., Levy S., Remington K.,Rusch D.B., Delcher A.L., Pop M., Wang W., Fraser C.M.,and Venter J.C. 2003. <strong>The</strong> dog genome: Survey sequencingand comparative analysis. Science 301: 1898.Korf I., Flicek P., Duan D., and Brent M.R. 2001. Integrating genomichomology into gene structure prediction. Bioinformatics(suppl. 1) 17: S140.McPherson J.D., Dodson J., Krumlauf R., and Olivier P. 2002.Proposal to sequence the genome <strong>of</strong> the chicken. (http://www.wattnet.com/library/DownLoad/PD12 genome.<strong>pdf</strong>)Parra G., Agarwal P., Abril J.F., Wiehe T., Fickett J.W., andGuigó R. 2003. Comparative gene prediction in human andmouse. <strong>Genom</strong>e Res. 13: 108.Pruitt K.D. and Maglott D.R. 2001. RefSeq and LocusLink:NCBI gene-centered resources. Nucleic Acids Res. 29: 137.Roest Crollius H., Jaillon O., Bernot A., Dasilva C., Bouneau L.,Fischer C., Fizames C., Wincker P., Brottier P., Quetier F.,Saurin W., and Weissenbach J. 2000. Estimate <strong>of</strong> human genenumber provided by genome-wide analysis using Tetraodonnigroviridis DNA sequence. Nat. Genet. 25: 235.Schwartz S., Kent W.J., Smit A., Zhang Z., Baertsch R., HardisonR.C., Haussler D., and Miller W. 2003. Human-mousealignments with BLASTZ. <strong>Genom</strong>e Res. 13: 103.Siepel A.C. and Haussler D. 2003. Combining phylogenetic andhidden Markov models in biosequence analysis. In RECOMB2003 (ed. W. Miller et al.), p. 277. ACM Press (ACM DigitalLibrary), New York.Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G., KlausnerR.D., Collins F.S., Wagner L., Shenmen C.M., SchulerG.D., Altschul S.F., Zeeberg B., Buetow K.H., Schaefer C.F.,Bhat N.K., Hopkins R.F., Jordan H., Moore T., Max S.I.,Wang J., Hsieh F., Diatchenko L., Marusina K., Farmer A.A.,Rubin G.M., and Hong L., et al. 2002. Generation and initialanalysis <strong>of</strong> more than 15,000 full-length human and mousecDNA sequences. Proc. Natl. Acad. Sci. 99: 16899.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!