13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ASSEMBLING LARGE GENOMES 193tant adjunct to a draft reference sequence. Mapping suchsequence variation has high value in genetic analysis, aswell as for insights into population structures and naturalvariation. Where structured inbred laboratory populationsor well-studied outbred populations are available, there isa clear justification for SNP discovery within genome sequencingefforts. In other organisms, SNP discovery isharder to justify—for example, when there is little geneticanalysis to exploit such markers.<strong>The</strong> range <strong>of</strong> genomes currently under study at theBCM-HGSC illustrates the variable value in studyingpolymorphism (Table 2). <strong>The</strong> human studies justified targetedSNP discovery by sequencing additional humangenomes to light coverage. Similar data generation wouldbe <strong>of</strong> utility in the rat and bovine genome projects, as eachhas interesting and mapped quantitative traits. In anotherproject, the rhesus macaque, SNP discovery is harder tojustify, but the degree <strong>of</strong> natural diversity <strong>of</strong> these populationsis an intriguing question. Finally, in the sea urchinproject, the most polymorphic outbred genome we arestudying, there is no obvious application for a map <strong>of</strong>polymorphisms.Polymorphism is also problematic for genome assemblies,as it results in a high level <strong>of</strong> mismatched bases betweenoverlapping sequence reads. In the extreme, a veryhighly diverged polymorphism can require the independentassembly <strong>of</strong> each separate haplotype—effectivelydoubling the effort required for the genome sequence.However, BACs <strong>of</strong>fer a pathway to resolve this difficulty,as each BAC clone comes from a single chromosome andallows greater definition <strong>of</strong> its haplotype than would bepossible with purely WGS reads.Sequencing cDNAs is also an important part <strong>of</strong> themenu <strong>of</strong> items that can contribute to a completed genome.<strong>The</strong> BCM-HGSC has previously developed a novelmethod for rapid full-length cDNA sequencing and appliedit as part <strong>of</strong> the Mammalian <strong>Genom</strong>e Collection project(http://mgc.nci.nih.gov/) to sequence over 15,000 individualcDNAs. <strong>The</strong>se data have enormous utility inannotation as well as providing a data set for measuringcompleteness <strong>of</strong> the draft sequence, and therefore cDNAgeneration has been included in each subsequent projectwhere possible.In the case <strong>of</strong> cDNA sequencing, BACs do not contributea direct benefit in the generation <strong>of</strong> the primarydata. Nevertheless, in BAC-based projects, there are frequentoccasions where exons <strong>of</strong> interesting cDNAs areobserved in the genome, and the BAC clones from whichthey are derived are scrutinized in order to give the mostaccurate genomic structure and complete the gene model.OTHER KINDS OF DATA ON THE HORIZONFluorescent Sanger DNA sequencing technology hasnow been used for ~20 years, and has so far provided asuperior performance to alternative procedures. Individualsequence determinations typically span >700 contiguoushigh-quality bases, creating a high standard for comparisonto other technologies. Nevertheless, there aremany innovative alternative strategies under developmentand considerable interest in improving on the cumbersomegel-matrix separation phase <strong>of</strong> the Sanger approach.Much <strong>of</strong> this effort is in the industrial sector andis directed at the enablement <strong>of</strong> highly parallelized procedures,so many more reactions can occur simultaneously.A common feature <strong>of</strong> each new method is therefore thatthe likely length <strong>of</strong> each single contiguous sequence determinationwill be much less than the existing technology.In some cases, the effective length <strong>of</strong> new sequenceread types will measure in the range <strong>of</strong> 100 bases or so,but techniques that produce very short reads (10–20bases) cannot be discounted. It is also likely that the basereads generated will have an overall lower quality and reproducibility.Nevertheless, these new methods will verypossibly provide superior performance in terms <strong>of</strong> the totalnumber <strong>of</strong> newly sequenced bases for a given time orcost. <strong>The</strong>refore, assembly strategies and algorithms willbe required to use these data in an efficient manner.<strong>The</strong> ability to confidently align sequences diminishesas the read length and quality decrease (reduced signal)and the total sequence target complexity increases (increasednoise). In addition, shorter reads will not have thesame ability to span many classes <strong>of</strong> genome repeats.Hence, the likely influx <strong>of</strong> these new data will provide amuch greater challenge for WGS projects than for morelocalized BAC alignments. This is a strong additional argumentfor a current focus on refining methods for assemblybased on BAC clones.SUMMARY OF STRATEGY FOR SEQUENCINGNEW GENOMES<strong>The</strong> balance <strong>of</strong> available resources and scientific prioritiesdictates that most new genomes will be “drafted” andnot finished as in the human and mouse genome projects.A practical strategy therefore aims to maximize the yield<strong>of</strong> biological information that can be realized within this“draft sequencing” model by generating sequence coveragethat is as complete, contiguous, and high-quality aspossible. Achieving this requires coordination <strong>of</strong> many elements.On the basis <strong>of</strong> our previous experience, we recommendexploiting components <strong>of</strong> BAC-constrained sequences,clone pooling, WGS reads, and a modest amount<strong>of</strong> physical mapping. <strong>The</strong> overall concert that is orchestratedis complex, and each component has features thatneed to be tuned to take advantage <strong>of</strong> the individual contributionsand characteristics <strong>of</strong> the particular genome beingsequenced, to optimize the overall result. For example,the balance between the depth <strong>of</strong> coverage <strong>of</strong> BAC readsand WGS reads, and the genome size, needs carefulchoice. Similarly, the role <strong>of</strong> clones with different insertsizes in each <strong>of</strong> the categories is important in order to ensureprecise joining <strong>of</strong> “contigs” into scaffolds. Each <strong>of</strong>these needs to be considered in the context <strong>of</strong> the ease andexpense <strong>of</strong> generating that class <strong>of</strong> data and the history <strong>of</strong>resources available for that organism. <strong>The</strong> precise strategyfor each individual organism is slightly different. A globalgoal is to both standardize the approaches that can be usedfor future genomes and to better understand which elements<strong>of</strong> the characterization <strong>of</strong> individual genomes mustbe most precisely defined at the outset, to ensure smoothexecution <strong>of</strong> a genome project.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!