sequencing gap - Rice Genome Annotation Project
sequencing gap - Rice Genome Annotation Project
sequencing gap - Rice Genome Annotation Project
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Library Construction• Break the genomic DNA from theorganism into many small fragments• The fragments are cloned into plasmidvectors• This collection of fragments is called alibraryTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Library Requirements1. Free vector should be at low or undetectable level.2. No chimeric clones. Chimeras occur two or more randomfragments from separate parts of the genome recombineand end up next to each other.3. The majority of the inserts should be of relatively uniformsize.4. Libraries need to be random and cover the wholegenome.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
DNA Shearing with HydroShearHydroShear Assembly Cleaning QC1 2 3 4Nebulized DNA vs HydroShear sheared DNASize Selection GelsTIGR7 ngQuant Gels Prior to Vector Ligation18ng 20ng 25 ngTHE INSTITUTE FOR GENOMIC RESEARCHPCR Products After 4 Wash Cycles1 BAC DNA, 2 sheared BAC DNA,3 sheared water, 4 sterile waterBACs assembled and checkedfor contaminant sequencesXSCQB – 3 seqs(2- Lotus japonicus, 1- Homo sapiens)XSCQC – 0XSCQD – 0XSCQE – 0
BstXI adaptor cloning systemDNA insertLigateBstXI adaptorCTTTCCAGCACAGAAAGGTCLigateComplementary to BstXI adaptorCTGGAAAGGTGTGACCTTTCVectorTIGRVectorplus insertTHE INSTITUTE FOR GENOMIC RESEARCH
Design Features of the BstXI AdaptorCloning SystemForward <strong>sequencing</strong> primerForward PCR primerrrnBT1BstXI site BstXI site Reverse <strong>sequencing</strong> primerReverse PCR primerrrnBT2ter2pHOS vector plus insertProriAmpRter1TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
pHOS Vector Features• The <strong>sequencing</strong> primer sites immediately flank the cloningsite to avoid excessive re-<strong>sequencing</strong> of vector DNA.• PCR primer sites are located immediately outside of theSequencing primer sites to allow PCR amplification fortemplate preparation.• The vector continues to have the regular, good vectorfeatures such as a gene for replication (origin of replication),antibiotic resistance and multiple cloning sites where wecan cut the circular plasmid and insert desired features.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Template Production• Transform clones intoE. coli• Plate and pick coloniesto isolate clones• Replicate and isolateplasmid for <strong>sequencing</strong>• 384 well HT plateprocessingTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Alkaline Lysis 384-well DNA Isolation MethodPick colonies into media, growFreezer copyCentrifuge to pellet cellsCell LysisDNA PurificationRe-suspend in TE/BlueAverage concentration30 - 40 ng/µLTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Template Production LaboratoryCurrent Capacity: 22,000,000 plasmids/yearExpansion Capacity: 54,000,000 plasmids/yearTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence Production• Sequencing Reactionsand Precipitations• Samples are loaded intothe capillaries on a<strong>sequencing</strong> machine• Data Collection• Data AnalysisTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence Production LaboratoryCurrent Capacity: 40,000,000 sequences/yearExpansion Capacity: 100,000,000 sequences/yearTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Capillary array viewTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sample ElectropherogramTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Basecalling & Quality AssignmentsTIGRTHE INSTITUTE FOR GENOMIC RESEARCHPhred & TraceTuner•Read DNA sequencer traces•Call bases•Assign base quality values•Write basecalls and qualityvalues to output files.Warner Brothers, Inc.Lowering the error rate, averaging 40% - 50% fewer errors thanABI software independent of position in read, machine runningconditions, or <strong>sequencing</strong> chemistry
What are phred quality values?The quality value q assigned to a base call isdefined as:q = - 10 x log 10 (p)where p is the estimated error probabilityfor that base-call.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
ORA base-call having a probability of1/1000 of being incorrect is assigned aquality value of 30.ProbabilityQuality Value1/100 201/10 10TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
QC Tools1. Web based document control system including SOP’s andProcess Control Forms2. Receipt testing and lot control of <strong>sequencing</strong> reagents3. QC audits of reagents and protocols used byproduction teams4. Record of operators performing each step of the process5. Equipment calibration and preventive maintenance program6. QC/R&D group and QC representative on each teamTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Assembling the fragmentsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Celera Assembler• creates high confidence “uniquelyassembleable contigs” = unitigs• marks those that appear repetitive w.r.t.arrival-rate statistics (surrogates anddegenerates)• uses insert (clone-mate) information tobuild contigs & mark ambiguous unitigs(surrogates)TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Merging two sequencesoverlap (19 bases)overhang (6 bases)…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGACCAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…TIGRoverhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequencesThe assembler screens merges based on:• length of overlap• % identity in overlap region• maximum overhang size.THE INSTITUTE FOR GENOMIC RESEARCH% identity = 18/19 % = 94.7%
Arrival rate statistics8x coverage, 800bp reads - a read every 100bp0 100 200 300 400 800A-stat = log (p(single copy) / p(two-copy))p(single copy) = ((pF/G)^k/k!) exp(-pF/G)p(two-copy) = ((2pF/G)^k/k!) exp(-2pF/G)TIGRTHE INSTITUTE FOR GENOMIC RESEARCHF - # fragmentsG - genome sizep/k - average arrival rate
Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known(within certain experimental error)clone lengthFRsequenced endsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Surrogate UnitigsA UB U CUAUBTIGRTHE INSTITUTE FOR GENOMIC RESEARCHC
Degenerate contigs• Too deep• Too few readsTIGR(NOT IN SCAFFOLDS)THE INSTITUTE FOR GENOMIC RESEARCH
GreedyTIGR Assembler• Build a rough map of fragmentoverlaps• Pick the largest scoring overlap• Merge the two fragments• Repeat until no more merges can bedoneTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Scaffolding• Given a set of non-overlapping contigsorder and orient them along a chromosomeII I III IVIIIIIIVTIGRTHE INSTITUTE FOR GENOMIC RESEARCHI
Clone-matesInsertFRIIIRFVectorIIIRFIIITIGRTHE INSTITUTE FOR GENOMIC RESEARCHFR
Linking information• Overlaps• Mate-pair links• Similarity links• Physical markersreference genomephysical map• Gene syntenyTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Grouping the contigsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Assembly <strong>gap</strong>sphysical <strong>gap</strong>group Agroup B<strong>sequencing</strong> <strong>gap</strong>s<strong>sequencing</strong> <strong>gap</strong> - we know the order and orientation of the contigs and have atleast one clone spanning the <strong>gap</strong>physical <strong>gap</strong> - no information known about the adjacent contigs, nor about the DNAspanning the <strong>gap</strong>TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Scaffolder outputPhysical <strong>gap</strong>sSequencing <strong>gap</strong>sTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Problems with the data• Incorrect sizing of inserts– cut from gel – sizing is subjective– error increases with size• Chimeras (ends belong to different inserts)– biological reasons (esp. for large sized inserts)– sample tracking (human error)• Software must handle a certain error rate.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Ambiguous scaffoldI II IIIIIIIIIIIIIIII III I’ IITIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Unifying view of assemblyAssemblyScaffoldingTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
BAC-by-BAC Sequencing and ClosureWhy?• Aids complete closure by avoiding genomic repeatproblems• More streamlined collaboration with other <strong>sequencing</strong>and closure centers• Allows for data release and annotation of completedsequence throughout the projectWhy Not?• Complete closure not required• Small genome size• Time/cost concernsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
BAC-by-BAC Sequencing and Closure• Generate BAC-end sequence data• Choose and sequence “seed” BACs• Select minimally overlapping “tiling” BACs for contigextensionChromosome Marker1 Marker2 Marker3BAC LibraryMarker4Select BACs based on BAC-end overlaps, physical mapping data, etc.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Life of a BAC at TIGRIdentify BACBAC prep, fingerprintLibraryAssembleShotgunSequence test and verifyReactions, reassemblyCLOSED BACCompare digestsVerify BACTIGRTHE INSTITUTE FOR GENOMIC RESEARCH<strong>Annotation</strong>
BAC Shotgun SequencingBAC DNASequencingShotgunLibraryEg. 4-6 KbAssembly and ScaffoldingContigsGap ClosureTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
<strong>Genome</strong> Closure/FinishingTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Why Completeness is Important• Improves characterization of genome features– Gene order, replication origins• Better comparative genomics– <strong>Genome</strong> duplications, inversions• Determination of presence and absence of particulargenes and features is less subjective• Missing sequence might be important (e.g.,centromere)• Allows researchers to focus on biology not <strong>sequencing</strong>• Facilitates large scale correlation studies• Controls for contaminationTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
What Is Closure?• Obtaining sequence that was not obtained during random<strong>sequencing</strong> which resulted in:• Sequencing Gaps• Physical Ends• Confirming the integrity of assemblies• Repeats and misassemblies• Verification of Clone Coverage• Confirming the base sequence of the consensus• Editing• Verification of Sequence CoverageTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Levels of <strong>Genome</strong> ClosureTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGR Finishing CriteriaA genome is considered finished and ready if it satisfies the followingcriteria:1. A continuous consensus of DNA sequence2. No ambiguous consensus basepairs3. At least 2X sequence coverage over the entire genome.a) both strands of one clone are sequencedb) two different clones sequencedc) Same clone sequenced with dual chemistry4. At least 2X clone coverage over the entire genome5. Complete confidence in all repetitive areasTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCHCauses for <strong>gap</strong>s• Non-random shotgun library– Toxicity of genes or promoters in E. coli– Genomic DNA difficult to clone (capsularpolysaccharides)– Unstable regions (low complexity)• Sequencing problems– Hard stops• Secondary structures• Very high or low GC content• Small unit tandem repeats– Loss of signal• Homopolymeric tracts• Very high or low GC content
<strong>Genome</strong> Closure StepsAssemble SequencesOrder and orient contigsValidate Sequence andAssembly IntegrityGap ClosureRepeat Identification andVerificationDesign PrimersSequencing GapsPhysical EndsTIGRTHE INSTITUTE FOR GENOMIC RESEARCHComplete BAC
Sequencing GapsContig AUnderlyingsequencesGAPPrimer 1 Primer 2Contig BTo close Sequencing <strong>gap</strong>s:A. Resequence short reads at contig endsB. Perform <strong>sequencing</strong> reactions:1. Design primers at contig ends2. Do <strong>sequencing</strong> reactions with:Insert 1 + primer 1 Insert 1 + primer 2Insert 2 + primer 1 Insert 2 + primer 2TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Physical EndsNo known links exist; there is no template available for<strong>sequencing</strong>.Contig CPhysical endTo link and close Physical ends use PCR:1. Design primers at contig ends2. If PCR products are obtained, sequence PCR products completelyTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Repetitive Areas• Repetitive areas are regions of high similarity within thegenome/BAC.• Sequences in these areas may be misassembled by theAssembler.• Verification of the sequence of repetitive areas:• A. Identify potential repetitive areas, using repeatFinder and other tools.• B. Classify repeats based on length, copy number, % similarity,structure and complexity.• C. If repeats are misassembled, transpose spanning clones or obtainPCR products and sequence to verify assembly.• Misassemblies can occur in two ways:TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
The base sequence of the repeat may bemisassembled:RPT 1C• Repeat is spanned by linking clones• Underlying sequences are unlinked clones, with mates inother assemblies (branched clones) or mates that did notsequence (dead end clones).• Results in incorrect consensus sequenceTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Mis-assembled repeatClones link different repeat flanksTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Identifying repetitive areasRepetitive areas are identified by the programrepeatFinderftp://ftp.tigr.org/pub/software/repeatFinder/• Analyzes sets of contigs for repeats greater than 50bp and then groups these repeats into classes• repeatFinder calls REPuter[Copyright© University of Bielefeld, Germanywww.genomes.de; Used with permission]TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence editor• In this example thereis an obviousdiscrepancy betweenthe base calls ofseveral of theunderlying clones inthis region.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Assembly Viewer• Several clones appear tobe misassembled• There are both obvioussize violations andorientation issuesTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Resultingcontig afterrepeatresolutionTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
AMOS Assembly InvestigatorToolbarPositionInsert andRead CoverageScaffoldFeaturesInsertsTIGRTHE INSTITUTE FOR GENOMIC RESEARCHCurrent Contig Position
Collapsed Repeat-5.5 CE DipReadCoverageSpikeIndividualCompressedMatesTIGRTHE INSTITUTE FOR GENOMIC RESEARCH68 Correlated SNPs
Resolved Repeat• Unique flank order is correct• Use linking information across the repeat (largeinsert clones or PCR)• Consensus sequence is correct• Use linked clones that have one mate in the repeatand the other anchored in unique sequence• Transposon mediated librariesTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence Validation: Editing• Finishers look at all areas where the consensussequence is below a certain quality threshold.• Electropherograms are used to determine which basecalls may need to be changed.• Only supporting sequences are edited, not theconsensus.• Editing provides an integrity check, may aid in assembly,identifying repeat areas and other unique problems.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence Validation:Sequence coverage1X2X3XSequence coverage rule:Every base in an assembly must be covered by at least twosequences of high quality.Why?Validating sequence coverage provides a high degree ofconfidence in the consensus base calls.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence Coverage in CloeTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Sequence Coverage in CloeTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Assembly Verification: CloneCoverageClone Coverage Rule:Every base must be spanned by two valid subclones.A valid subclone is a pair of forward/reverse mates that are:• Oriented correctly (pointing toward each other)• Separated by an approximate length of sequence determined bythe expected template size for the library.Why?Validating clone coverage provides a high degree ofconfidence that the contig was assembled correctly.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Clone Coverage ExampleCATCCGAAGACGATGCATCAATAGCATAClone-1 (F)Clone-1 (R)Clone-3 (R)Clone-2 (R)Clone-2 (F)Clone-5 (F)Clone-4 (R) Clone-6 (F) Clone-7 (R)0x1x• Red: 0 x clone coverage• Black: 1x clone coverage• Blue: 2x clone coverage2x1xTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Verify Closure by FingerprintComparisonTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Finished BAC Ready for <strong>Annotation</strong>!BAC DNASequencingShotgunLibraryEg. 4-6 KbContigsAssembly and ScaffoldingGap ClosureTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
EXTRA SLIDES STARTHERE…TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
HT TemplateLibraryConstructionHT SequencingClosure SequencingLIMSTrace ProcessingTrace ProcessingTrace DeliveryTrace DeliveryTrace ArchiveTIGRTHE INSTITUTE FOR GENOMIC RESEARCH<strong>Project</strong> DB
Representing assembliesconsensus startconsensus endconsensus ACTTTAGGAAGACGA-GAAGCCTGATAGGATCTCCGTCTCCGACaligned sequence CTGAC-ATGAA-CCTGATAGCAGsequence startsequence endAssembly table:consensus w/ and w/o <strong>gap</strong>sTIGRTHE INSTITUTE FOR GENOMIC RESEARCHAsmbl_link table:asmbl_idseq_nameseq_lend, seq_rendasm_lend, asm_rendlsequenceSequence table:sequenceend5, end3
Schematic View of 3 Typical ScaffoldsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Types of repeats• Short Tandem Repeats• Large Multi Unit Tandem Repeats• Multi Class Repeats• Inverted RepeatsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Short Tandem RepeatsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Large Multi Unit TandemRepeatsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Inverted RepeatsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Construction of Transposon-MediatedLibraries• Transposons are genetic elementsthat provide mobile primer siteswithin the target DNA for<strong>sequencing</strong>• The GPS-1 <strong>Genome</strong> PrimingSystem (New England BiolabsE7100S) enables transposoninsertions by using TnsABC*transposase+• Only one random insertion occursper target DNATIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Use of Transposon-mediated libraries:•Make a library using a transposon kit on a spanning clone.•Sequence the clones using the transposon primers•Assemble the sequencesPhysical map of thetransposon inserts:M13FM13FTransposon1 Transposon2N S N SRepeatClone DNAN1 S1M13RM13RN2S2TIGRFinal AssemblyTHE INSTITUTE FOR GENOMIC RESEARCH
Assembly of Transposon-Mediated Clone Walks12500 5000 7500 10000 1250015000GEGDK94TR2940 bpGEGDK94TFGEGDK94T175GEGDK94T48GEGDK94T22GEGDK94T6GEGDK94T34GEGDK94T73GEGDK94T72GEGDK94T36GEGDK94T41GEGDK94T77GEGDK94T106GEGDK94T10GEGDK94T12GEGDK94T47GEGDK94T189GEGDK94T90TIGRforward and reverse clone matestandem repeat regiontransposon insertionsiteTHE INSTITUTE FOR GENOMIC RESEARCH
Closure Challenges: SequencingThrough Secondary StructuresTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Hairpin structureTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
Homopolymeric tractsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
• Apply different <strong>sequencing</strong> chemistries– Big-Dye terminator (default)– Dye-primer– dGTP mix (GC rich regions)• Denature structures - Additives– Betaine– DMSO• Break structure– Restriction digest– Transposon insertion– Micro-librariesTIGRTHE INSTITUTE FOR GENOMIC RESEARCHSolutions
Automated Finishing at TIGR• Closure is a feature driven activity• Feature Classification SystemTIGR– Identify/Classify Features• SEQ <strong>gap</strong>s• PHY <strong>gap</strong>s / scaffold ends• Low coverage• Repeat features• Feature Reaction Design Strategy– Bin by Reaction Type and Feature Size• Primer walks• PCR• Transposon BombingTHE INSTITUTE FOR GENOMIC RESEARCH
Automated Finishing (cont’d)• Laboratory Process Integration– Order reactions from the lab– Group reactions by lab process– Integrate changing laboratory automation– Monitor results• Reaction Product Resolution Strategy– Resolve reaction products back to features & contigs:conduct targeted assembly.– Evaluate resolution criteria for feature type.TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
System ArchitectureCA PipelineInputProcessingFeatureIdentificationReactionDesignDBWork OrderPreparationReactionProcessingWork OrderManagementTIGRProductResolutionTHE INSTITUTE FOR GENOMIC RESEARCH
• Dead EndsBAC <strong>Project</strong> Challenges• Filling small <strong>gap</strong>s between adjacentsupercontigs• bp mismatches in BAC sequence overlaps• <strong>Project</strong> management/organizationTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
TIGRTHE INSTITUTE FOR GENOMIC RESEARCHDead EndsA Dead End occurs when a BAC at the end of asequence contig has no BAC-end hits or FPC BACsto extend the contig further• Design primers and PCR small unique region nearend of BAC• Hybridize to BAC filters to identify candidate BACs• End sequence and fingerprint candidate BACs toselect minimal tiling BAC• Sequence and Close new tiling BAC
Filling small <strong>gap</strong>s in a chromosome• GAP < 15kb:– PCR using genomic DNA template– Sequence and close PCR product• GAP > 15kb or unable to PCR:– Select <strong>gap</strong>-spanning BAC– Construct large insert library (10+kb inserts)– Sequence to low coverage (2-3x)– Scaffold mate pairs and select subclone tiling to close <strong>gap</strong>TIGRTHE INSTITUTE FOR GENOMIC RESEARCH
p Mismatches In BAC OverlapsWhy?• Basecalling error in <strong>sequencing</strong>• Point mutation in BAC• Misassembly of repeats near BAC endWhat to do?• Examine sequence quality in each BAC• PCR region using genomic DNA template• Sequence PCR to verify correct basecallTIGRTHE INSTITUTE FOR GENOMIC RESEARCH
<strong>Project</strong> Organization• Database to store various BAC data:– Tiling, overlap info– Mapping, FPC info– Sequencing and Closure data– Collaborator’s BAC data• Web-based BAC status tracking andclosure task tracking toolsTIGRTHE INSTITUTE FOR GENOMIC RESEARCH