13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Assessing the Quality <strong>of</strong> Finished <strong>Genom</strong>ic SequenceJ. SCHMUTZ, J. WHEELER, J. GRIMWOOD, M. DICKSON, AND R.M. MYERSStanford Human <strong>Genom</strong>e Center, Stanford University School <strong>of</strong> Medicine, Palo Alto, California 94304This April, the Human <strong>Genom</strong>e Project (HGP) announcedthe essential completion <strong>of</strong> the human genomesequence. In just a few years, from 2001 to 2003, the percentage<strong>of</strong> finished <strong>Homo</strong> <strong>sapiens</strong> sequence jumped from25% to 99%. This represented a dramatic increase in theproduction finishing capacity <strong>of</strong> genome centers worldwideand a shift from a primary focus on the production<strong>of</strong> draft shotgun sequence (a streamlined productionpipeline) to the generation <strong>of</strong> complete and accurate finishedgenomic sequence (a difficult process involving decision-makingand consecutive rounds <strong>of</strong> experiments).By 2001, the large genome centers had proven that theycould reduce the cost <strong>of</strong> the sequencing read through increasedautomation, conservation <strong>of</strong> reagents, and 24/7production level processes, but could they do the samething for producing finished sequence? Although it is asignificant challenge to maintain a production level <strong>of</strong>millions <strong>of</strong> shotgun sequencing reads per month, it is arguablymore difficult to maintain a steady output <strong>of</strong> finishedsequence that meets a defined accuracy standard.Perhaps surprising ourselves, we did it, overcoming thecomplexities <strong>of</strong> the finishing process and the allelic variationin the human genome to produce an essentiallycomplete human genome sequence.Now that 2.82 billion base pairs <strong>of</strong> finished human sequencehave been generated, how can we be assured thatthe production <strong>of</strong> finished genomic sequence merited theenormous investment? Because <strong>of</strong> the cost and immensity<strong>of</strong> the project, the finished human reference sequenceis not likely to be reproduced at any time in the near future.<strong>The</strong>refore, to assess the general quality <strong>of</strong> the finishedproduct, we must examine small portions <strong>of</strong> it andextrapolate to the rest <strong>of</strong> the completed human genome.During the project, the Stanford Human <strong>Genom</strong>e Centerwas given a mandate by the National Human <strong>Genom</strong>eResearch Institute (NHGRI) to perform such an examination.In the process, we learned much about the problem<strong>of</strong> assessing the quality <strong>of</strong> the product generated by sucha complex scientific process as the HGP. In this paper, wesummarize the results <strong>of</strong> our quality assessment <strong>of</strong> thefinished human genome sequence and <strong>of</strong>fer suggestionsas to how to apply these lessons to the problem <strong>of</strong> evaluatingthe quality <strong>of</strong> future genome sequencing projects.HISTORICAL MEASUREMENTS OFSEQUENCE ACCURACY IN THE HGPIn 1997, world standards for sequence accuracy wereestablished at a meeting <strong>of</strong> HGP participants in Bermuda(now known as the “Bermuda Standards”). At this meeting,it was decided that any clone from the humangenome sequence submitted as finished should have lessthan one error per 10,000 bases and that the sequenceshould be contiguous with no gaps (http://www.gene.ucl.ac.uk/hugo/bermuda2.htm). At the time, veryfew centers were submitting clone sequences that had nogaps, and laboratory and data analysis mechanisms forfinishing all clones with no gaps were not in commonpractice among the data producers, and a sufficient protocolfor measuring the sequence accuracy component wasalso unknown. Due to the prohibitively high cost <strong>of</strong> producingfinished sequence, it was impractical to independentlyresequence and refinish many clones to establish afirm error rate for the finished sequence that was beingproduced.Since this time, two principal methods for estimatingsequence accuracy have been employed by the HGPgenome centers: the use <strong>of</strong> quality scores from Phred processedby Phrap (Ewing and Green 1998; Ewing et al.1998) and examintion <strong>of</strong> potential overlapping sequencefrom different clones for errors. To better understand thelimitations, it is helpful to understand how the accuracy isgenerally estimated with these methods.In the first <strong>of</strong> these methods, Phred assigns error probabilitiesto each base pair in every sequencing trace, basedon large training data sets. Traces base-called by Phredare subsequently assembled by the assembly algorithmPhrap, which propagates these single-base-pair errorscores to the consensus sequence constructed from manyoverlapping sequence reads. Although Phred qualityscores for measuring the accuracy <strong>of</strong> single sequencingtraces have been extensively validated (Ewing et al. 1998;Richterich 1998) and are used to monitor the quality <strong>of</strong>production sequencing, the cumulative Phrap score forfinished bases appearing in a consensus sequence has notbeen similarly examined. <strong>The</strong> Phrap score provides an indication<strong>of</strong> the value <strong>of</strong> the underlying base, but because<strong>of</strong> the complexity <strong>of</strong> the data assembly process, one cannotsimply add up the Phrap estimated errors for all <strong>of</strong> thebases in a finished clone to determine the error rate. In ourexperience, the Phrap error rates are tenfold lower thanthe actual errors in the clone sequences. <strong>The</strong> simple reasonfor this is that Phrap error rates are assigned only tobases appearing in the consensus, and potential problembase pairs tend not to be included in the Phrap-derivedconsensus for the very reason that they are poor-qualitybases. <strong>The</strong>se problem bases include compressed orstretched peaks, and errors created by the Phrap assemblyCold Spring Harbor Symposia on Quantitative Biology, Volume LXVIII. © 2003 Cold Spring Harbor Laboratory Press 0-87969-709-1/04. 31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!