Read Mapping - EMBL
Read Mapping - EMBL
Read Mapping - EMBL
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Read</strong> <strong>Mapping</strong><br />
Tobias Rausch<br />
July 2011
Data Analysis<br />
<strong>Read</strong><br />
<strong>Mapping</strong><br />
Assembly<br />
Reference<br />
Overlap<br />
Layout<br />
Consensus
Target<br />
genome<br />
Fragments<br />
Paired-End Sequencing<br />
...<br />
Sequenced<br />
reads<br />
Paired-end
Paired-End Libraries
Paired-End Libraries<br />
R1 R2
Mate-Pair Libraries
Mate-Pair Libraries<br />
R1 R2<br />
R2 R1
<strong>Read</strong> <strong>Mapping</strong><br />
Reference
<strong>Read</strong><br />
<strong>Read</strong> <strong>Mapping</strong><br />
Reference<br />
Reference Genome<br />
0 n<br />
m
<strong>Read</strong><br />
<strong>Read</strong> <strong>Mapping</strong><br />
• Dynamic Programming: Quadratic algorithm<br />
– Requires O(m*n) time and space<br />
• Infeasible for millions of short reads<br />
Reference<br />
Reference Genome<br />
0 n<br />
m
Genome<br />
Filtering<br />
Preprocess<br />
Index
<strong>Read</strong><br />
Genome<br />
Filter<br />
Algorithm<br />
Filtering<br />
Preprocess<br />
Index<br />
Filtration Phase<br />
Potential Matches
<strong>Read</strong><br />
Genome<br />
Filter<br />
Algorithm<br />
Exact<br />
Algorithm<br />
Filtering<br />
Preprocess<br />
Index<br />
Filtration Phase<br />
Potential Matches<br />
Verification Phase<br />
True Matches<br />
False Matches
Simple k-mer Index, k=3<br />
S = ACGAAAACTCGATTACTCGACC<br />
Hitlist Hitlist Hitlist<br />
AAA ACC CGA<br />
AAC ACG …<br />
AAG ACT GAA<br />
AAT AGA …<br />
ACA … TTT<br />
• Size of that table: 4 3 = 64 entries = |Σ| k
Simple k-mer Index, k=3<br />
S = ACGAAAACTCGATTACTCGACC<br />
Hitlist Hitlist Hitlist<br />
AAA ACC CGA<br />
AAC ACG 0 …<br />
AAG ACT GAA<br />
AAT AGA …<br />
ACA … TTT<br />
• Size of that table: 4 3 = 64 entries = |Σ| k
Simple k-mer Index, k=3<br />
S = ACGAAAACTCGATTACTCGACC<br />
Hitlist Hitlist Hitlist<br />
AAA ACC CGA 1<br />
AAC ACG 0 …<br />
AAG ACT GAA<br />
AAT AGA …<br />
ACA … TTT<br />
• Size of that table: 4 3 = 64 entries = |Σ| k
Simple k-mer Index, k=3<br />
S = ACGAAAACTCGATTACTCGACC<br />
Hitlist Hitlist Hitlist<br />
AAA ACC CGA 1<br />
AAC ACG 0 …<br />
AAG ACT GAA 2<br />
AAT AGA …<br />
ACA … TTT<br />
• Size of that table: 4 3 = 64 entries = |Σ| k
Simple k-mer Index, k=3<br />
S = ACGAAAACTCGATTACTCGACC<br />
Hitlist Hitlist Hitlist<br />
AAA 3,4 ACC 19 CGA 1<br />
AAC 5 ACG 0 … …<br />
AAG Empty ACT 6,14 GAA 2<br />
AAT Empty AGA … … …<br />
ACA Empty … … TTT Empty
Searching a <strong>Read</strong><br />
Hitlist Hitlist Hitlist<br />
AAA 3,4 ACC 19 CGA 1<br />
AAC 5 ACG 0 … …<br />
AAG Empty ACT 6,14 GAA 2<br />
AAT Empty AGA … … …<br />
ACA Empty … … TTT Empty<br />
• <strong>Read</strong> Sequence: ACTG<br />
– Potential match at position 6 and 14
<strong>Read</strong><br />
Verification Algorithm<br />
Banded Dynamic Programming<br />
Reference Genome<br />
0 n<br />
m<br />
Bandwidth
• Needleman-Wunsch<br />
Pairwise Alignment<br />
- Global Alignment -<br />
• Constant gap penalty s
Global Alignment<br />
Reference
• Smith-Waterman<br />
Pairwise Alignment<br />
- Local Alignment -<br />
• Constant gap penalty s
Pairwise Alignments - Variants<br />
• Overlap alignments<br />
• Semi-global alignments<br />
• Banded alignments<br />
ACGTTAGTTAGC<br />
ACTTAGCAACTTG<br />
...ACGGCCAACTTAGTTAGC...<br />
AA-TTTGT<br />
0<br />
S 1<br />
m<br />
0<br />
S 0<br />
k<br />
n
Banded Alignment
Banded vs. non-banded Alignment
<strong>Read</strong> <strong>Mapping</strong><br />
Reference<br />
Source: illumina
Techniques<br />
• Index<br />
– Hash tables, k-mer Index<br />
– Suffix trees, suffix arrays<br />
– Burrows-Wheeler-Transformation (BWT) of a suffix array<br />
• Filtering Algorithms<br />
– Single or multiple seeds<br />
– Pigeonhole principle<br />
– q-gram filtering<br />
• Verification<br />
– Simple seed-and-extend<br />
– Banded dynamic programming<br />
– Quality-based dynamic programming
Variant Calling<br />
SNPs Short Indels Structural Variants
SNP Calling
Set of reads<br />
SNP Calling
Set of reads<br />
SNP Calling<br />
Forward
Set of reads<br />
SNP Calling<br />
Reverse
SNP Calling<br />
Consensus sequence
SNP Calling<br />
Variations: Indels & SNPs
SNP Calling<br />
Sequencing errors: Insertions, deletions & basecalling errors
Variant Calling<br />
SNPs Short Indels Structural Variants<br />
(i) ELANDv2e GATK<br />
(ii) ELANDv2e Mpileup Annovar
• Tools<br />
SNP Calling<br />
– GATK (Genome Analysis Toolkit)<br />
– SAMtools mpileup (MAQ SNP Caller)<br />
– CASAVA SNP Caller<br />
– Pyrobayes (454)<br />
– GigaBayes<br />
– Commercial packages (CLC Bio, Genomatix, etc.)<br />
• My rough guess for the open-source community<br />
is that about 80% or 90% use GATK or SAMtools<br />
mpileup
• Differentiating<br />
SNP Annotation<br />
– Coding/Non-coding SNPs<br />
– Known/Unknown SNPs (dbSNP, 1kGP)<br />
– Homozygous/Heterozygous<br />
• Annotating coding SNPs<br />
– Affected Gene/Transcript names<br />
– Silent/Non-silent mutations<br />
– Amino acid change<br />
• Predicting possible impacts of an amino acid<br />
substitution (Sift, Polyphen, …)
Annotation Tools<br />
• Annotation of coding SNPs<br />
• GATK (Genome Analysis Toolkit)<br />
• ANNOVAR<br />
• Commercial packages (CLC Bio, Genomatix, etc.)<br />
• Predicting possible impacts of an amino acid<br />
substitution<br />
• PolyPhen2, Mutation Taster, Sift, etc.
GATK<br />
• Java library<br />
• Easily extendable<br />
• <strong>Read</strong>y-to-use tools are scarce<br />
• SNP Calling + Annotation is difficult to get<br />
running, long runtime (chr-by-chr works)<br />
• Supports VCF<br />
Software library for tool development
GATK – SNP Calling<br />
Genome Analysis Toolkit, McKenna et al.
SAMtools mpileup<br />
• Call & Filter SNPs and INDELs<br />
• Very fast compared to GATK<br />
• Output is in VCF<br />
• SNP Calling<br />
samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf<br />
• SNP Filtering<br />
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
• Set of perl scripts<br />
Annovar Annotation<br />
• Based on the UCSC annotation database<br />
• Generic across different species<br />
• SNP Calling + Annotation is easy to get running<br />
• VCF is supported as input<br />
Easy to use SNP annotation tool
• Drosophila<br />
Generic Annotation<br />
chrX.fa 2527302 A T hom 222 48 60 exonic<br />
CG10260 nonsynonymous SNV CG10260:NM_130658:exon10:c.6509A>T:p.Q2170L,<br />
• Human<br />
chr10.fa 93603 C T hom 43.3 17 42 exonic<br />
TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P,<br />
• Format<br />
chr pos ref alt hom/het snp qual. depth mq …
• Human<br />
Gene-based Annotation<br />
chr10.fa 93603 C T hom 43.3 17 42 exonic<br />
TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P,<br />
• Type<br />
– exonic, splicing, ncRNA, UTR5, UTR3, intronic,<br />
upstream, downstream, intergenic<br />
• Gene, Neighboring gene<br />
– If the variant is exonic/intronic/ncRNA<br />
• Gene name<br />
– Otherwise<br />
• Neighboring genes
• Human<br />
Exonic Annotation<br />
chr10.fa 93603 C T hom 43.3 17 42 exonic<br />
TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P,<br />
• Type<br />
– frameshift insertion, frameshift deletion, frameshift block<br />
substitution, stopgain, stoploss, nonframeshift insertion,<br />
nonframeshift deletion, nonframeshift block substitution,<br />
nonsynonymous SNV, synonymous SNV, unknown<br />
• Gene, transcript, sequence change in the<br />
corresponding transcript and amino acid<br />
change
Additional Annotation for Human<br />
chr15.fa 45404066 G A hom 104 16 60 snp132<br />
rs2001616 exonic DUOX2 nonsynonymous SNV<br />
DUOX2:NM_014080:exon5:c.413C>T:p.P138L, avsift 0.02 ljb_pp2 0.474<br />
segdup Score=0.973842;Name=chr15:45427292 tfbs<br />
Score=818;Name=V$MYCMAX_03 mce46way Score=317;Name=lod=26 band<br />
15q21.1 dgv Name=3961<br />
• Further Information (in bold)<br />
– dbSNP (v132), SIFT and PolyPhen2<br />
– Segmental duplication<br />
– Transcription factor binding site<br />
– Conserved element<br />
– Cytogenetic band<br />
– Structural variants in DGV and GWAS
Variant Calling by Consensus
Mpileup/Annovar<br />
SNP Calls, chr19<br />
Sample #Total #dbSNPs #Novel_SNVs #Synonymous_SNVs #Nonsynonymous_SNVs #Novel_synonymous_SNVs #Novel_nonsynonymous_SNVs<br />
Sample1 69682 66122 3560 895 880 31 62<br />
Sample2 69092 65590 3502 897 869 28 56<br />
GATK<br />
Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs<br />
Sample1 80188 71113 9075 879 6 86 1<br />
Sample2 80008 71133 8875 884 5 76 1
SNP Call Overlap<br />
• Sample1<br />
– 69682 Mpileup/Annovar Calls<br />
– 80188 GATK Calls<br />
– 67149 calls in common (96%, 83%)<br />
• Sample2<br />
– 69092 Mpileup/Annovar Calls<br />
– 80008 GATK Calls<br />
– 66855 calls in common (97%, 84%)
• Call breakdown<br />
Somatic changes<br />
– GATK: good quality, tumor specific: 2263 (407 not<br />
in dbSNP)<br />
– Mpileup/Annovar: good quality, tumor specifc:<br />
1105 calls (116 not in dbSNP)<br />
– Common calls 445 (43 not in dbSNP)<br />
• Common exonic, somatic, novel calls: 0<br />
• Common exonic, somatic: 7
Study Design
Raw SNP Calls, Human Genome<br />
Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs<br />
Sample1 3994410 3589636 404774 10970 93 1227 25<br />
Sample2 3783797 3400926 382871 10416 90 1142 23<br />
Even focusing only on the novel missense and nonsense<br />
SNPs leaves a list of several hundreds of SNPs!
Raw SNP Calls, Human Genome<br />
Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs<br />
Sample1 3994410 3589636 404774 10970 93 1227 25<br />
Sample2 3783797 3400926 382871 10416 90 1142 23<br />
Cross-comparisons!<br />
At least Tumor vs. Germline, even better multiple matched<br />
tumor-germline pairs showing a similar phenotype.
Comparing / Annotating Variant Calls<br />
Matched<br />
Tumor – Normal<br />
data<br />
Multiple Tumor<br />
Probes<br />
Time-course<br />
data<br />
1) Initial Tumor<br />
2) Remission<br />
3) Relapse<br />
4) Remission<br />
5) Secondary<br />
Tumor<br />
Pedigree data<br />
Population Scale Constraints<br />
• Tumor heterogeneity<br />
• Tumor purity<br />
• Aneuploidy<br />
• Sample availability
Computational Methods to Detect Genomic<br />
Rearrangements<br />
using Next-Generation Sequencing Data<br />
Tobias Rausch<br />
July 2011
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
Deletion<br />
α β γ<br />
α γ
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
• Copy number variants<br />
– Deletion<br />
– Duplication<br />
Deletion<br />
Duplication<br />
α β γ<br />
α γ<br />
α β β<br />
γ
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
• Copy number variants<br />
– Deletion<br />
– Duplication<br />
• Insertion<br />
Deletion<br />
Duplication<br />
Insertion<br />
α β γ<br />
α γ<br />
α β β<br />
γ<br />
α β δ γ
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
• Copy number variants<br />
– Deletion<br />
– Duplication<br />
• Insertion, Inversion<br />
Deletion<br />
Duplication<br />
Insertion<br />
α β γ<br />
α γ<br />
α β β<br />
γ<br />
α β δ γ<br />
Inversion γ β α
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
• Copy number variants<br />
– Deletion<br />
– Duplication<br />
• Insertion, Inversion, Translocation<br />
Deletion<br />
Duplication<br />
Insertion<br />
α β γ<br />
α γ<br />
α β β<br />
γ<br />
α β δ γ<br />
Inversion γ β α
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
• Copy number variants<br />
– Deletion<br />
– Duplication<br />
• Insertion, Inversion, Translocation<br />
• More abundant than SNPs<br />
…ACGATACG…<br />
…ACGAGACG…<br />
Deletion<br />
Duplication<br />
Insertion<br />
α β γ<br />
α γ<br />
α β β<br />
γ<br />
α β δ γ<br />
Inversion γ β α
Genomic Rearrangements<br />
• 1 Kb to several Mb in size<br />
• Copy number variants<br />
– Deletion<br />
– Duplication<br />
• Insertion, Inversion, Translocation<br />
• More abundant than SNPs<br />
• Either neutral or non-neutral in function<br />
• Non-neutral mechanisms<br />
– Disrupting genes<br />
– Creating fusion genes<br />
– Copy number changes of dosage-sensitive genes<br />
Deletion<br />
Duplication<br />
Insertion<br />
α β γ<br />
α γ<br />
α β β<br />
γ<br />
α β δ γ<br />
Inversion γ β α
Technologies to Discover<br />
Genomic Rearrangements
Technologies<br />
• Fluorescent in situ hybridization (FISH)<br />
– Fluorescent<br />
probes (≈100kb)<br />
detect and<br />
localize the<br />
presence or<br />
absence of<br />
specific DNA<br />
sequence<br />
Perry et al. (2007)
Technologies<br />
• Fluorescent in situ hybridization (FISH)<br />
• Comparative Genomic Hybridization (CGH)<br />
– Test vs. reference sample<br />
– 2.1 million probes<br />
– Different types<br />
• Whole-Genome Tiling Arrays<br />
• Whole-Genome Exon-Focused Arrays<br />
• CNV Arrays
Technologies<br />
• Fluorescent in situ hybridization (FISH)<br />
• Comparative Genomic Hybridization (CGH)<br />
• Genome-Wide Human SNP Array 6.0<br />
– 1.8 million genetic markers<br />
• 906,600 SNPs<br />
• 946,000 probes for CNVs
Technologies<br />
• Fluorescent in situ hybridization (FISH)<br />
• Comparative Genomic Hybridization (CGH)<br />
• Genome-Wide Human SNP Array 6.0<br />
• Human 1M-Duo DNA Analysis BeadChip<br />
– 1.2 million genetic markers<br />
• Markers for SNPs and CNV regions<br />
– Targeted studies<br />
• 60,800 additional custom SNPs<br />
• 60,000 custom CNV-targets
Technologies<br />
• Fluorescent in situ hybridization (FISH)<br />
• Comparative Genomic Hybridization (CGH)<br />
• Genome-Wide Human SNP Array 6.0<br />
• Human 1M-Duo DNA Analysis BeadChip<br />
• Next-Generation Sequencing (NGS)<br />
– Whole-genome sequencing<br />
– Targeted, e.g. RNA-Seq
10 0<br />
Focus on NGS<br />
10 2<br />
Sanger<br />
sequencing<br />
10 4 10 6<br />
• Limitations of Arrays<br />
– Lower resolution for genomic rearrangements<br />
– Balanced events (e.g., inversions) cannot be<br />
detected using signal intensity differences<br />
– No breakpoint information<br />
Arrays<br />
Next-Generation Sequencing<br />
FISH<br />
10 8<br />
Karyotype<br />
Event Size<br />
(in bp)
Computational Methods
Mate-pair or paired-end<br />
mapping abnormalities<br />
Detecting<br />
Genomic Rearrangements<br />
<strong>Read</strong> depth signals<br />
Reference<br />
Split-<strong>Read</strong> alignments
Mate-pair or paired-end<br />
mapping abnormalities<br />
Detecting<br />
Genomic Rearrangements<br />
<strong>Read</strong> depth signals<br />
Reference<br />
Split-<strong>Read</strong> alignments<br />
Unmapped or<br />
single-anchored<br />
reads<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly<br />
Insertions Deletions
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly<br />
Lee et al. (2009)<br />
Korbel et al. (2007)
Mate-pair or paired-end<br />
mapping abnormalities<br />
Tandem<br />
Duplication<br />
Translocation<br />
Large Insertion<br />
Reference<br />
Sequence<br />
Newly<br />
Sequenced<br />
Genome<br />
Reference<br />
Sequences<br />
Newly<br />
Sequenced<br />
Genome<br />
Newly<br />
Sequenced<br />
Genome<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Tandem<br />
Duplication<br />
Translocation<br />
Large Insertion<br />
chrA<br />
chrB<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Tandem<br />
Duplication<br />
Translocation<br />
Large Insertion<br />
Reference<br />
Sequence<br />
Newly<br />
Sequenced<br />
Genome<br />
Reference<br />
Sequences<br />
Newly<br />
Sequenced<br />
Genome<br />
Newly<br />
Sequenced<br />
Genome<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Tandem<br />
Duplication<br />
Translocation<br />
Large Insertion<br />
chrA<br />
chrB<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Tandem<br />
Duplication<br />
Translocation<br />
Large Insertion<br />
Reference<br />
Sequence<br />
Newly<br />
Sequenced<br />
Genome<br />
Reference<br />
Sequences<br />
Newly<br />
Sequenced<br />
Genome<br />
Newly<br />
Sequenced<br />
Genome<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Tandem<br />
Duplication<br />
Translocation<br />
Large Insertion<br />
chrA<br />
chrB<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Local assembly<br />
3 3 1 8 7<br />
1 Copy 1 Copy 0 Copy 2 Copy 2 Copy<br />
Chiang et al. (2009)
Mate-pair or paired-end<br />
mapping abnormalities<br />
Log2 ratio<br />
• Down-Syndrom<br />
<strong>Read</strong>-depth<br />
signals<br />
– Partial Trisomie 21<br />
log 2<br />
#<br />
#<br />
chr21<br />
<strong>Read</strong>s<br />
<strong>Read</strong>s<br />
Split-read<br />
alignments<br />
Disease<br />
Normal<br />
Local assembly<br />
Xie et al. (2009)
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Deletion<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Deletion<br />
Split-read<br />
alignments<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Deletion<br />
Split-read<br />
alignments<br />
0, 0, 0, 0… Initialize top row with 0s<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Deletion<br />
Split-read<br />
alignments<br />
0, 0, 0, 0… Initialize top row with 0s<br />
Favor long gaps<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Deletion<br />
Split-read<br />
alignments<br />
0, 0, 0, 0… Initialize top row with 0s<br />
Favor long gaps<br />
Search last row for maximum score <br />
Local assembly<br />
Ye et al. (2009)<br />
Ameur et al. (2010)
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Split-read<br />
alignments<br />
Large Insertion<br />
Local assembly<br />
Mapped <strong>Read</strong>s Mapped <strong>Read</strong>s<br />
Local Assembly<br />
Rausch et al. (2009)
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Large Insertion<br />
Mapped <strong>Read</strong>s Mapped <strong>Read</strong>s<br />
Local<br />
Assembly<br />
Split-read<br />
alignments<br />
Local<br />
Assembly<br />
Local assembly
Mate-pair or paired-end<br />
mapping abnormalities<br />
Reference<br />
Sequence<br />
<strong>Read</strong>-depth<br />
signals<br />
Large Insertion<br />
Mapped <strong>Read</strong>s Mapped <strong>Read</strong>s<br />
Local<br />
Assembly<br />
Split-read<br />
alignments<br />
Anchoring assembled<br />
contigs of all<br />
unmapped reads<br />
Local<br />
Assembly<br />
Local assembly<br />
Hajirasouliha et al. (2010)
Deletion<br />
Short insertion<br />
(< Insert Size)<br />
Large insertion<br />
(> Insert Size)<br />
Inversion<br />
Tandem duplication<br />
Translocation<br />
Gain/Loss (CNVs)<br />
Computational Methods for De Novo<br />
Genomic Rearrangement Detection<br />
Paired-end mapping <strong>Read</strong>-depth Split-read Local assembly<br />
Region / Breakpoint Region Region Breakpoint Breakpoint
Structural Variant Detection<br />
• <strong>Read</strong>-depth tools<br />
– CNVer, CNVnator, etc.<br />
• Paired-end mapping<br />
– PEMer, Breakdancer<br />
• Split-read<br />
– Age, Pindel<br />
Tools
• Technical Validations<br />
Validations<br />
– PCR, Amplicon sequencing, Exon capture<br />
• Recurrent Validation<br />
– What is the variant allele frequency in a larger<br />
cohort<br />
• Target Enrichment<br />
• Array studies
Variant Aware <strong>Read</strong> <strong>Mapping</strong><br />
<strong>Read</strong> to reference alignment<br />
dbSNP<br />
<strong>Read</strong> depth<br />
Reference<br />
Database of Genomic<br />
Variants<br />
Mate-pair or paired end<br />
information<br />
CNV or SV studies
… CATTTT [C/ T] TTTGAA …<br />
…CATTTT C TTTGAA…<br />
Known SNPs<br />
…CATTTT N TTTGAA…<br />
…CATTTT T TTTGAA…
Deletion<br />
30bp<br />
…GGCT AGCTCC…CTTACT TCAA…<br />
30bp 30bp<br />
30bp<br />
30bp<br />
Insertion<br />
Known Rearrangements<br />
Junction C (60bp)<br />
…GGCTTCAA…<br />
…GGCTAGCTCC… …CTTACTTCAA …<br />
Junction A (60bp) Junction B (60bp)<br />
30bp<br />
Lam et al. (2010)
Target Enrichment<br />
Tobias Rausch<br />
July 2011
Technologies
Technologies
Target Enrichment Analysis<br />
• Quality Control<br />
• On-target / Off-target Analysis<br />
• Coverage Analysis<br />
• GC-Content, Allelic Balance, etc.<br />
• Data Analysis<br />
• SNP, Short Indel and SV Calling<br />
• Relating the Variant Calls to Public Databases
Individual Baits vs. Target Regions
On-target / Off-target Analysis<br />
• How many reads<br />
are on-target?
GA Lane, 50MB Kit<br />
• 37 million reads, 32 million mapped (86%)<br />
• 105bp reads
GA Lane, 50MB Kit
GA Lane, 50MB Kit
HiSeq Lane, 50MB Kit<br />
• 185 million reads, 160 million mapped (86%)<br />
• 105bp paired-end reads
HiSeq Lane, 50MB Kit
HiSeq Lane, 50MB Kit
Independent Comparison to RefSeq
Independent Comparison to RefSeq
1000 Genomes Project
Allelic Balance<br />
chr1, whole genome seq.
Allelic Balance<br />
chr1, exon capture
Allelic Balance for InDels 1bp-10bp<br />
(Whole Genome Seq.)
Allelic Balance for InDels 1bp-5bp<br />
(Target Enrichment)
Allelic Balance for InDels 1bp-5bp<br />
(Target Enrichment)<br />
Caveats:<br />
(i) Short indels tend to occur around tandem repeats<br />
(ii) Alignment is much harder in these regions<br />
(iii) High false discovery rate of all InDel Callers (10% - 40% FDR)
Copy Number Variants
Copy Number Variants<br />
3 3 1 8 7<br />
1 Copy 1 Copy 0 Copy 2 Copy 2 Copy
Log2 ratio<br />
Copy Number Variants<br />
• Down-Syndrom<br />
– Partial Trisomie 21<br />
log 2<br />
#<br />
#<br />
chr21<br />
<strong>Read</strong>s<br />
<strong>Read</strong>s<br />
Disease<br />
Normal
• Sample1<br />
• Sample2<br />
• Sample3<br />
• Sample4<br />
Large Genomic Rearrangements<br />
<strong>Read</strong>-depth approach<br />
Source: Thomas Zichner
Case Study<br />
HapMap Trio
HapMap Trio
HapMap Trio<br />
• NA12878 and NA12891 were sequenced
Individual Baits vs. Target Regions<br />
• All bait and target region coordinates are hg18<br />
• Total length of target regions: 37806033 (≈38MB)<br />
• Total length of bait sequences: 38235516 (≈38MB)<br />
• Approximately 1% of the human genome
Exon Length Distribution<br />
• All baits are<br />
of length<br />
120bp
Target Region Length Distribution<br />
• All baits are<br />
of length<br />
120bp
NA12878 NA12891<br />
• Mapped SOLiD reads<br />
Mapped <strong>Read</strong>s<br />
• NA12878: 133,915,955 mapped reads<br />
• NA12891: 108,092,260 mapped reads
On-target / Off-target Analysis
Avg. Coverage for each Target<br />
Illumina SOLiD<br />
• SOLiD distribution is shifted to the right due to<br />
higher sequencing coverage
Avg. Coverage for each Target<br />
• 3490 Targets<br />
without any<br />
mapped base<br />
Overlap: 825 <br />
• 1280 Targets<br />
without any<br />
mapped base
GC-Content Distribution
GC-Content Distribution
GC-Content Distribution
Histogram of GC-Content of<br />
Unmapped Targets - SOLiD
Histogram of GC-Content of<br />
Unmapped Targets - Illumina
53%<br />
On-target Ratio<br />
79%
59%<br />
On-target Ratio<br />
84%
Insert Size is very important!
Insert Size is very important!
Uniform Coverage across Targets?<br />
• Subdivide each target into 10 non-overlapping<br />
windows<br />
• Calculate coverage for each window<br />
• Get the fractional coverage for each window<br />
compared to the total coverage of the target<br />
• Add up all fractions for the same window<br />
across all targets
Uniform Coverage across Targets?<br />
NA12878 - Illumina
Uniform Coverage across Targets?<br />
NA12891 - Illumina
50 random targets on chrX
• Illumina data<br />
Fraction of <strong>Read</strong>s on chrX<br />
• NA12878: 0.02475<br />
• NA12891: 0.01329<br />
• SOLiD data<br />
• NA12878: 0.02995<br />
• NA12891: 0.01162
To be or not to be<br />
… the same sample
Sample Swaps are very<br />
hard to detect
Creating a SNP Profile<br />
EnsemblDB<br />
API Calls<br />
SNP<br />
Positions<br />
Consensus<br />
calls<br />
Consensus<br />
letter at SNP<br />
locations<br />
rs11510117 A<br />
rs56311179 Y<br />
rs4579752 R<br />
rs11510118 G<br />
…<br />
Sample-specific SNP Profile<br />
Aligned reads
Comparing SNP Profiles among<br />
EnsemblDB<br />
API Calls<br />
SNP<br />
Positions<br />
Samples<br />
Consensus<br />
letter at SNP<br />
locations<br />
Consensus<br />
calls<br />
rs11510117 T<br />
rs56311179 rs11510117 Y A Sample-specific SNP Profile<br />
rs4579752 Rs56311179 T<br />
rs11510117 Y<br />
rs11510118<br />
A<br />
ENSSNP11715197 G<br />
Rs56311179 rs11510117 W C A<br />
…<br />
Rs4579752ENSSNP11715197 rs56311179 T G Y<br />
rs11510118 rs4579752 ENSSNP11715197 G R W<br />
… rs11510118 rs4579752 G R<br />
… rs11510118 G<br />
…<br />
Aligned reads
<strong>Read</strong>-depth Plot Comparisons<br />
• Are characteristic<br />
amplifications/del<br />
etions reoccurring
<strong>Read</strong>-depth Plot Comparisons<br />
• Are characteristic<br />
amplifications/de<br />
letions<br />
reoccurring
General Hints<br />
• Blood/Tumor sample pairs should have the<br />
same sex (chrX/Y read count)<br />
• Germline samples should show less<br />
aberrations than tumor samples<br />
• SNP profiles can possibly be limited to SNPs<br />
that are known to be very polymorphic<br />
Thomas Zichner
Multiplexing
Illumina HiSeq 2000<br />
DNA sequencer<br />
Exponential Data Increase<br />
Source: Science Special Online Collection: Dealing with Data.<br />
Genome sequencing … Transcriptome sequencing … Protein-DNA binding (ChIP-Seq)<br />
Epigenetic analyses … Exome sequencing … Structural variation detection
Millions of <strong>Read</strong>s PF<br />
Number of Single-End <strong>Read</strong>s per Lane<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
Illumina<br />
Arrives<br />
May …<br />
Jun 08<br />
Jul 08<br />
GA II<br />
1.0<br />
0.3<br />
Aug 08<br />
Sep 08<br />
Oct 08<br />
Nov 08<br />
Dec 08<br />
Jan 09<br />
Feb 09<br />
Mar 09<br />
Apr 09<br />
1.3.4<br />
1.3.2<br />
(Genecore)<br />
May …<br />
Jun 09<br />
Jul 09<br />
GA IIx<br />
SC2 2.8<br />
1..4 1.6 pre<br />
release SCS2.6 +V4 Kits<br />
Aug 09<br />
Sept …<br />
Oct 09<br />
Nov 09<br />
Dec 09<br />
Jan 10<br />
Feb 10<br />
Mar 10<br />
Apr 10<br />
May …<br />
June …<br />
July 10<br />
Aug 10<br />
Sept …<br />
Oct 10<br />
Nov 10<br />
Dec 10<br />
Jan 11<br />
Feb 11<br />
Mar 11<br />
April …<br />
May …<br />
June …<br />
GA<br />
HiSeq2000
Data Processing and Alignment:<br />
Custom Pipeline Script<br />
Lane 1<br />
Lane 2<br />
Lane 3<br />
Lane 8<br />
• Illumina Pipeline lacks support<br />
for <strong>EMBL</strong> PBS cluster (no<br />
distributed make)<br />
• HiSeq Flowcell requires a<br />
distributed alignment to be<br />
finished in 2-3 days<br />
• Splitting by lane using our<br />
pipeline script<br />
• Each lane on one node<br />
with 8 CPUs<br />
• 64 CPUs handle alignment
Multiplexing further complicates<br />
automated alignments<br />
Lane 1<br />
Lane 2<br />
Lane 3<br />
Lane 8<br />
Barcode 1<br />
Barcode 2<br />
Barcode n<br />
• With increased data volumes there is a high demand for multiplexing<br />
• Current Status of demultiplexing efforts:<br />
– Split by barcode<br />
– Allow barcode mismatches (enumerating a barcode’s neighborhood)<br />
– Trim-off barcodes<br />
– Providing sets of balanced barcodes (important for less than 10 barcodes per lane)<br />
hg19<br />
dm3
Computed balanced barcode distributions for<br />
various sample sizes
Barcode Performance
• Balanced base<br />
composition<br />
• Pairwise hamming<br />
distance >= 3<br />
• Non-extreme gccontent<br />
Barcode Set Criteria
• Work-in-progress<br />
Work-in-progress<br />
Lane 1<br />
Lane 2<br />
Lane 3<br />
Lane 8<br />
Barcode 1<br />
Barcode 2<br />
Barcode n<br />
– Everything splitted, aligned and barcode-specific statistics<br />
– This implies a parallelization on the barcode level (for both, illumina<br />
barcoded and custom-barcoded)<br />
– Unified handling of custom barcoding and illumina barcoding<br />
hg19<br />
dm3<br />
– Requires significant changes with regard to the Genecore database, illumina<br />
CASAVA pipeline, tracking sheets for the current projects, ….
Variant Calling<br />
SNPs Short Indels Structural Variants
Variant Calling<br />
SNPs Short Indels Structural Variants<br />
(i) ELANDv2e Dindel<br />
(ii) ELANDv2e Mpileup
Short InDels<br />
• Index-based read mappers are notoriously bad<br />
for gapped alignments
Short InDels<br />
• What is obvious for you by looking at the<br />
multiple alignment isn’t obvious to the read<br />
mapper because they have only the local view<br />
– One read against the reference at a time<br />
• Hence, almost all short InDel callers start with<br />
local realignment<br />
– Time consuming (depending on the number of<br />
realignment windows)
Short InDels - Tools<br />
• Open-source tools<br />
– Dindel<br />
– Pindel<br />
– MoDIL<br />
– GATK<br />
• Commercial packages<br />
– Maybe CLC Bio and others<br />
• Indel calling has a higher false positive rate<br />
than SNP calling
Short InDels<br />
Dindel, Albers et al.
Raw Short InDel Calls<br />
Sample #Total #Known_InDels(1kGP) #Novel_InDels #Coding_InDels #Coding_Novel_InDels<br />
Sample1 590642 142354 448288 386 143<br />
Sample2 584975 135403 449572 397 157<br />
Sample3 578649 137349 441300 380 138<br />
Sample4 566162 134539 431623 377 141<br />
Sample5 550717 133800 416917 355 127<br />
Sample6 570955 138643 432312 370 131<br />
Much fewer Indel calls than SNP calls. However, known<br />
Indels are scarce in public databases and thus, crosscomparisons<br />
are still very helpful.
Realignment
Realignment<br />
Reference: ..GACTG--TACT..<br />
<strong>Read</strong> 1: GACTGACTACT<br />
<strong>Read</strong> 2: GACTGC-TACT<br />
<strong>Read</strong> 3: GACTGC-TACT<br />
<strong>Read</strong> 4: GACTGC-TACT<br />
<strong>Read</strong> 5: GACTGACTACT<br />
<strong>Read</strong> 6: GACTGC-TACT<br />
GACTG--TACT<br />
GACTGACTACT<br />
GACTG-CTACT<br />
GACTG-CTACT<br />
GACTG-CTACT<br />
GACTGACTACT<br />
GACTG-CTACT
Realignment<br />
• Improves a crude, initial alignment
Realignment<br />
• Improves a crude, initial alignment<br />
• Usually employs some scheme of multiple<br />
alignment although a full k-dimensional DP is<br />
prohibitive<br />
O(n k )
Realignment<br />
• Improves a crude, initial alignment<br />
• Usually employs some scheme of multiple<br />
alignment although a full n-dimensional DP is<br />
prohibitive<br />
• To save time, carried out only on “suspicious”<br />
regions
ReAligner: Anson and Myers<br />
• Objective: Minimize the consensus score for <br />
= {a,c,g,t}:
ReAlignment Algorithm
Derive a consensus
Select a read for realignment
Cut it out of the consensus
Cut it out of the consensus<br />
Bandwidth
ReAlignment<br />
• Using a weighted score<br />
– Consensus score<br />
– Fractional score
ReAlignment<br />
• Using a weighted score<br />
– Consensus score<br />
– Fractional score<br />
• Iterate for all other reads
ReAlignment<br />
• Using a weighted score<br />
– Consensus score<br />
– Fractional score<br />
• Iterate for all other reads<br />
• Redo this realignment loop until score is non-decreasing
Two Scoring Functions<br />
• Consensus Score<br />
• Fractional Score<br />
• Final Score
Consensus Score
Fractional Score
Assembly Algorithms<br />
Tobias Rausch<br />
July 2011
De novo vs. Reference-based Assembly<br />
• De novo assemblers<br />
– Classical assemblers<br />
• Overlap - Layout - Consensus assembler<br />
– Short-read assemblers<br />
• De Bruijn graph assembler<br />
• Reference-based assembler
Assembly Types<br />
• Whole genome assembly<br />
• Transcriptome assembly<br />
• Metagenome assembly
History<br />
• 1994: Bacteria H.influenzae, 1.8Mb genome<br />
• 2000: Drosophila, 130Mb genome<br />
• 2001: Human genome, 3Gb genome<br />
• Currently >6000 sequenced genomes
Whole Genome Sequencing Costs
Overlap - Layout – Consensus Assembler<br />
Overlap<br />
Layout<br />
Consensus
Assembly<br />
• Input: Set of paired-end, mate-pair and/or long<br />
reads<br />
• Output: Set of assembled contigs
Overlap-Layout-Consensus Assembly<br />
• Overlap phase<br />
– For each pair of reads get a potential overlap<br />
– The number of possible relations doubles, when we<br />
also consider the reverse read
Dynamic programming
Overlap-Layout-Consensus Assembly<br />
Human genome assembly:<br />
About 27 million reads<br />
Number of required comparisons:
Seed and Extend Approach
• Potential overlaps<br />
• Are stored in a graph<br />
Layout Phase
Layout Phase<br />
• A simple heuristic: Spanning tree<br />
• In fact, it will be a spanning forest due to repeats<br />
or low-coverage regions
• Layout the reads<br />
• Result: A set of contigs<br />
Layout Phase
Consensus Phase<br />
• Scaffold the contigs with the help of mate-pairs<br />
• Result: A set of scaffolds
• Overlap<br />
• Layout<br />
Summary: Classical Assembly<br />
• Consensus
Celera Assembler: Some Statistics
Short <strong>Read</strong> Assembly<br />
• Classical assemblers are inappropriate for short<br />
read assembly because<br />
1. <strong>Read</strong>s are too short to compute a reliable pairwise<br />
overlap<br />
2. <strong>Read</strong>s are too numerous to compute all pairwise<br />
overlaps efficiently<br />
• Open question: How long do short reads need<br />
to be to apply the classical assemblers?
de Bruijn Graph<br />
• <strong>Read</strong>s are too short to compute a reliable<br />
pairwise overlap
de Bruijn Graph Construction<br />
• Dk = (V,E)<br />
– V = All length-k subfragments<br />
– E = Directed edges between consecutive<br />
subfragments<br />
• Nodes overlap by k-1 characters
de Bruijn Graph<br />
- Local Sequence Similarity -
de Bruijn Graph<br />
• Assembly quality greatly depends on the chosen<br />
k-mer size<br />
• Assembly quality measure: N50<br />
– Sort contigs by length<br />
– Add the contig’s lengths until the summed length<br />
exceeds 50% of the total length of all contigs<br />
– Example:<br />
• 100+80+70+50=300<br />
• N50=80 because 100+80 > 300/2
Choice of k<br />
• “It’s a trade-off between specificity and<br />
sensitivity. Longer k-mers bring you more<br />
specificity (i.e. less spurious overlaps) but<br />
lower coverage … so there’s a sweet spot to be<br />
found with time and experience.” (D. Zerbino,<br />
Velvet)
Obvious Hints<br />
• k must be smaller than the read length<br />
• Usually an odd number to avoid palindromes<br />
(e.g. Otto or “Ein Esel lese nie”)<br />
• Sometimes
K-mer Uniqueness Ratio
Whole Genome Assembly<br />
• Short-read assemblers<br />
– EULER-SR<br />
– Velvet<br />
– ABySS<br />
– SOAPdenovo<br />
– ALLPATHS-LG<br />
– Newbler (454 Data)<br />
• Overlap-layout-consensus assemblers<br />
– Celera assembler<br />
– Arachne<br />
– PCAP<br />
– Sanger String Graph Assembler
• Oases<br />
Transcriptome Assembly<br />
– http://www.ebi.ac.uk/~zerbino/oases/<br />
• Trinity<br />
– http://trinityrnaseq.sourceforge.net/<br />
• Rnnotator
Metagenomics Assembly<br />
• Highly heterogeneous data<br />
– Unequal representation of member species<br />
– Clusters of closely related strains<br />
– Very difficult to disentangle into one assembly for<br />
one member species
• MIRA3<br />
Reference-based Assembly<br />
– www.chevreux.org/mira_downloads.html<br />
• AMOScmp<br />
– amos.sourceforge.net/docs/pipeline/AMOScmp.html
Thanks!