03.04.2013 Views

Read Mapping - EMBL

Read Mapping - EMBL

Read Mapping - EMBL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Read</strong> <strong>Mapping</strong><br />

Tobias Rausch<br />

July 2011


Data Analysis<br />

<strong>Read</strong><br />

<strong>Mapping</strong><br />

Assembly<br />

Reference<br />

Overlap<br />

Layout<br />

Consensus


Target<br />

genome<br />

Fragments<br />

Paired-End Sequencing<br />

...<br />

Sequenced<br />

reads<br />

Paired-end


Paired-End Libraries


Paired-End Libraries<br />

R1 R2


Mate-Pair Libraries


Mate-Pair Libraries<br />

R1 R2<br />

R2 R1


<strong>Read</strong> <strong>Mapping</strong><br />

Reference


<strong>Read</strong><br />

<strong>Read</strong> <strong>Mapping</strong><br />

Reference<br />

Reference Genome<br />

0 n<br />

m


<strong>Read</strong><br />

<strong>Read</strong> <strong>Mapping</strong><br />

• Dynamic Programming: Quadratic algorithm<br />

– Requires O(m*n) time and space<br />

• Infeasible for millions of short reads<br />

Reference<br />

Reference Genome<br />

0 n<br />

m


Genome<br />

Filtering<br />

Preprocess<br />

Index


<strong>Read</strong><br />

Genome<br />

Filter<br />

Algorithm<br />

Filtering<br />

Preprocess<br />

Index<br />

Filtration Phase<br />

Potential Matches


<strong>Read</strong><br />

Genome<br />

Filter<br />

Algorithm<br />

Exact<br />

Algorithm<br />

Filtering<br />

Preprocess<br />

Index<br />

Filtration Phase<br />

Potential Matches<br />

Verification Phase<br />

True Matches<br />

False Matches


Simple k-mer Index, k=3<br />

S = ACGAAAACTCGATTACTCGACC<br />

Hitlist Hitlist Hitlist<br />

AAA ACC CGA<br />

AAC ACG …<br />

AAG ACT GAA<br />

AAT AGA …<br />

ACA … TTT<br />

• Size of that table: 4 3 = 64 entries = |Σ| k


Simple k-mer Index, k=3<br />

S = ACGAAAACTCGATTACTCGACC<br />

Hitlist Hitlist Hitlist<br />

AAA ACC CGA<br />

AAC ACG 0 …<br />

AAG ACT GAA<br />

AAT AGA …<br />

ACA … TTT<br />

• Size of that table: 4 3 = 64 entries = |Σ| k


Simple k-mer Index, k=3<br />

S = ACGAAAACTCGATTACTCGACC<br />

Hitlist Hitlist Hitlist<br />

AAA ACC CGA 1<br />

AAC ACG 0 …<br />

AAG ACT GAA<br />

AAT AGA …<br />

ACA … TTT<br />

• Size of that table: 4 3 = 64 entries = |Σ| k


Simple k-mer Index, k=3<br />

S = ACGAAAACTCGATTACTCGACC<br />

Hitlist Hitlist Hitlist<br />

AAA ACC CGA 1<br />

AAC ACG 0 …<br />

AAG ACT GAA 2<br />

AAT AGA …<br />

ACA … TTT<br />

• Size of that table: 4 3 = 64 entries = |Σ| k


Simple k-mer Index, k=3<br />

S = ACGAAAACTCGATTACTCGACC<br />

Hitlist Hitlist Hitlist<br />

AAA 3,4 ACC 19 CGA 1<br />

AAC 5 ACG 0 … …<br />

AAG Empty ACT 6,14 GAA 2<br />

AAT Empty AGA … … …<br />

ACA Empty … … TTT Empty


Searching a <strong>Read</strong><br />

Hitlist Hitlist Hitlist<br />

AAA 3,4 ACC 19 CGA 1<br />

AAC 5 ACG 0 … …<br />

AAG Empty ACT 6,14 GAA 2<br />

AAT Empty AGA … … …<br />

ACA Empty … … TTT Empty<br />

• <strong>Read</strong> Sequence: ACTG<br />

– Potential match at position 6 and 14


<strong>Read</strong><br />

Verification Algorithm<br />

Banded Dynamic Programming<br />

Reference Genome<br />

0 n<br />

m<br />

Bandwidth


• Needleman-Wunsch<br />

Pairwise Alignment<br />

- Global Alignment -<br />

• Constant gap penalty s


Global Alignment<br />

Reference


• Smith-Waterman<br />

Pairwise Alignment<br />

- Local Alignment -<br />

• Constant gap penalty s


Pairwise Alignments - Variants<br />

• Overlap alignments<br />

• Semi-global alignments<br />

• Banded alignments<br />

ACGTTAGTTAGC<br />

ACTTAGCAACTTG<br />

...ACGGCCAACTTAGTTAGC...<br />

AA-TTTGT<br />

0<br />

S 1<br />

m<br />

0<br />

S 0<br />

k<br />

n


Banded Alignment


Banded vs. non-banded Alignment


<strong>Read</strong> <strong>Mapping</strong><br />

Reference<br />

Source: illumina


Techniques<br />

• Index<br />

– Hash tables, k-mer Index<br />

– Suffix trees, suffix arrays<br />

– Burrows-Wheeler-Transformation (BWT) of a suffix array<br />

• Filtering Algorithms<br />

– Single or multiple seeds<br />

– Pigeonhole principle<br />

– q-gram filtering<br />

• Verification<br />

– Simple seed-and-extend<br />

– Banded dynamic programming<br />

– Quality-based dynamic programming


Variant Calling<br />

SNPs Short Indels Structural Variants


SNP Calling


Set of reads<br />

SNP Calling


Set of reads<br />

SNP Calling<br />

Forward


Set of reads<br />

SNP Calling<br />

Reverse


SNP Calling<br />

Consensus sequence


SNP Calling<br />

Variations: Indels & SNPs


SNP Calling<br />

Sequencing errors: Insertions, deletions & basecalling errors


Variant Calling<br />

SNPs Short Indels Structural Variants<br />

(i) ELANDv2e GATK<br />

(ii) ELANDv2e Mpileup Annovar


• Tools<br />

SNP Calling<br />

– GATK (Genome Analysis Toolkit)<br />

– SAMtools mpileup (MAQ SNP Caller)<br />

– CASAVA SNP Caller<br />

– Pyrobayes (454)<br />

– GigaBayes<br />

– Commercial packages (CLC Bio, Genomatix, etc.)<br />

• My rough guess for the open-source community<br />

is that about 80% or 90% use GATK or SAMtools<br />

mpileup


• Differentiating<br />

SNP Annotation<br />

– Coding/Non-coding SNPs<br />

– Known/Unknown SNPs (dbSNP, 1kGP)<br />

– Homozygous/Heterozygous<br />

• Annotating coding SNPs<br />

– Affected Gene/Transcript names<br />

– Silent/Non-silent mutations<br />

– Amino acid change<br />

• Predicting possible impacts of an amino acid<br />

substitution (Sift, Polyphen, …)


Annotation Tools<br />

• Annotation of coding SNPs<br />

• GATK (Genome Analysis Toolkit)<br />

• ANNOVAR<br />

• Commercial packages (CLC Bio, Genomatix, etc.)<br />

• Predicting possible impacts of an amino acid<br />

substitution<br />

• PolyPhen2, Mutation Taster, Sift, etc.


GATK<br />

• Java library<br />

• Easily extendable<br />

• <strong>Read</strong>y-to-use tools are scarce<br />

• SNP Calling + Annotation is difficult to get<br />

running, long runtime (chr-by-chr works)<br />

• Supports VCF<br />

Software library for tool development


GATK – SNP Calling<br />

Genome Analysis Toolkit, McKenna et al.


SAMtools mpileup<br />

• Call & Filter SNPs and INDELs<br />

• Very fast compared to GATK<br />

• Output is in VCF<br />

• SNP Calling<br />

samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf<br />

• SNP Filtering<br />

bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf


• Set of perl scripts<br />

Annovar Annotation<br />

• Based on the UCSC annotation database<br />

• Generic across different species<br />

• SNP Calling + Annotation is easy to get running<br />

• VCF is supported as input<br />

Easy to use SNP annotation tool


• Drosophila<br />

Generic Annotation<br />

chrX.fa 2527302 A T hom 222 48 60 exonic<br />

CG10260 nonsynonymous SNV CG10260:NM_130658:exon10:c.6509A>T:p.Q2170L,<br />

• Human<br />

chr10.fa 93603 C T hom 43.3 17 42 exonic<br />

TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P,<br />

• Format<br />

chr pos ref alt hom/het snp qual. depth mq …


• Human<br />

Gene-based Annotation<br />

chr10.fa 93603 C T hom 43.3 17 42 exonic<br />

TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P,<br />

• Type<br />

– exonic, splicing, ncRNA, UTR5, UTR3, intronic,<br />

upstream, downstream, intergenic<br />

• Gene, Neighboring gene<br />

– If the variant is exonic/intronic/ncRNA<br />

• Gene name<br />

– Otherwise<br />

• Neighboring genes


• Human<br />

Exonic Annotation<br />

chr10.fa 93603 C T hom 43.3 17 42 exonic<br />

TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P,<br />

• Type<br />

– frameshift insertion, frameshift deletion, frameshift block<br />

substitution, stopgain, stoploss, nonframeshift insertion,<br />

nonframeshift deletion, nonframeshift block substitution,<br />

nonsynonymous SNV, synonymous SNV, unknown<br />

• Gene, transcript, sequence change in the<br />

corresponding transcript and amino acid<br />

change


Additional Annotation for Human<br />

chr15.fa 45404066 G A hom 104 16 60 snp132<br />

rs2001616 exonic DUOX2 nonsynonymous SNV<br />

DUOX2:NM_014080:exon5:c.413C>T:p.P138L, avsift 0.02 ljb_pp2 0.474<br />

segdup Score=0.973842;Name=chr15:45427292 tfbs<br />

Score=818;Name=V$MYCMAX_03 mce46way Score=317;Name=lod=26 band<br />

15q21.1 dgv Name=3961<br />

• Further Information (in bold)<br />

– dbSNP (v132), SIFT and PolyPhen2<br />

– Segmental duplication<br />

– Transcription factor binding site<br />

– Conserved element<br />

– Cytogenetic band<br />

– Structural variants in DGV and GWAS


Variant Calling by Consensus


Mpileup/Annovar<br />

SNP Calls, chr19<br />

Sample #Total #dbSNPs #Novel_SNVs #Synonymous_SNVs #Nonsynonymous_SNVs #Novel_synonymous_SNVs #Novel_nonsynonymous_SNVs<br />

Sample1 69682 66122 3560 895 880 31 62<br />

Sample2 69092 65590 3502 897 869 28 56<br />

GATK<br />

Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs<br />

Sample1 80188 71113 9075 879 6 86 1<br />

Sample2 80008 71133 8875 884 5 76 1


SNP Call Overlap<br />

• Sample1<br />

– 69682 Mpileup/Annovar Calls<br />

– 80188 GATK Calls<br />

– 67149 calls in common (96%, 83%)<br />

• Sample2<br />

– 69092 Mpileup/Annovar Calls<br />

– 80008 GATK Calls<br />

– 66855 calls in common (97%, 84%)


• Call breakdown<br />

Somatic changes<br />

– GATK: good quality, tumor specific: 2263 (407 not<br />

in dbSNP)<br />

– Mpileup/Annovar: good quality, tumor specifc:<br />

1105 calls (116 not in dbSNP)<br />

– Common calls 445 (43 not in dbSNP)<br />

• Common exonic, somatic, novel calls: 0<br />

• Common exonic, somatic: 7


Study Design


Raw SNP Calls, Human Genome<br />

Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs<br />

Sample1 3994410 3589636 404774 10970 93 1227 25<br />

Sample2 3783797 3400926 382871 10416 90 1142 23<br />

Even focusing only on the novel missense and nonsense<br />

SNPs leaves a list of several hundreds of SNPs!


Raw SNP Calls, Human Genome<br />

Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs<br />

Sample1 3994410 3589636 404774 10970 93 1227 25<br />

Sample2 3783797 3400926 382871 10416 90 1142 23<br />

Cross-comparisons!<br />

At least Tumor vs. Germline, even better multiple matched<br />

tumor-germline pairs showing a similar phenotype.


Comparing / Annotating Variant Calls<br />

Matched<br />

Tumor – Normal<br />

data<br />

Multiple Tumor<br />

Probes<br />

Time-course<br />

data<br />

1) Initial Tumor<br />

2) Remission<br />

3) Relapse<br />

4) Remission<br />

5) Secondary<br />

Tumor<br />

Pedigree data<br />

Population Scale Constraints<br />

• Tumor heterogeneity<br />

• Tumor purity<br />

• Aneuploidy<br />

• Sample availability


Computational Methods to Detect Genomic<br />

Rearrangements<br />

using Next-Generation Sequencing Data<br />

Tobias Rausch<br />

July 2011


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

Deletion<br />

α β γ<br />

α γ


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

• Copy number variants<br />

– Deletion<br />

– Duplication<br />

Deletion<br />

Duplication<br />

α β γ<br />

α γ<br />

α β β<br />

γ


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

• Copy number variants<br />

– Deletion<br />

– Duplication<br />

• Insertion<br />

Deletion<br />

Duplication<br />

Insertion<br />

α β γ<br />

α γ<br />

α β β<br />

γ<br />

α β δ γ


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

• Copy number variants<br />

– Deletion<br />

– Duplication<br />

• Insertion, Inversion<br />

Deletion<br />

Duplication<br />

Insertion<br />

α β γ<br />

α γ<br />

α β β<br />

γ<br />

α β δ γ<br />

Inversion γ β α


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

• Copy number variants<br />

– Deletion<br />

– Duplication<br />

• Insertion, Inversion, Translocation<br />

Deletion<br />

Duplication<br />

Insertion<br />

α β γ<br />

α γ<br />

α β β<br />

γ<br />

α β δ γ<br />

Inversion γ β α


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

• Copy number variants<br />

– Deletion<br />

– Duplication<br />

• Insertion, Inversion, Translocation<br />

• More abundant than SNPs<br />

…ACGATACG…<br />

…ACGAGACG…<br />

Deletion<br />

Duplication<br />

Insertion<br />

α β γ<br />

α γ<br />

α β β<br />

γ<br />

α β δ γ<br />

Inversion γ β α


Genomic Rearrangements<br />

• 1 Kb to several Mb in size<br />

• Copy number variants<br />

– Deletion<br />

– Duplication<br />

• Insertion, Inversion, Translocation<br />

• More abundant than SNPs<br />

• Either neutral or non-neutral in function<br />

• Non-neutral mechanisms<br />

– Disrupting genes<br />

– Creating fusion genes<br />

– Copy number changes of dosage-sensitive genes<br />

Deletion<br />

Duplication<br />

Insertion<br />

α β γ<br />

α γ<br />

α β β<br />

γ<br />

α β δ γ<br />

Inversion γ β α


Technologies to Discover<br />

Genomic Rearrangements


Technologies<br />

• Fluorescent in situ hybridization (FISH)<br />

– Fluorescent<br />

probes (≈100kb)<br />

detect and<br />

localize the<br />

presence or<br />

absence of<br />

specific DNA<br />

sequence<br />

Perry et al. (2007)


Technologies<br />

• Fluorescent in situ hybridization (FISH)<br />

• Comparative Genomic Hybridization (CGH)<br />

– Test vs. reference sample<br />

– 2.1 million probes<br />

– Different types<br />

• Whole-Genome Tiling Arrays<br />

• Whole-Genome Exon-Focused Arrays<br />

• CNV Arrays


Technologies<br />

• Fluorescent in situ hybridization (FISH)<br />

• Comparative Genomic Hybridization (CGH)<br />

• Genome-Wide Human SNP Array 6.0<br />

– 1.8 million genetic markers<br />

• 906,600 SNPs<br />

• 946,000 probes for CNVs


Technologies<br />

• Fluorescent in situ hybridization (FISH)<br />

• Comparative Genomic Hybridization (CGH)<br />

• Genome-Wide Human SNP Array 6.0<br />

• Human 1M-Duo DNA Analysis BeadChip<br />

– 1.2 million genetic markers<br />

• Markers for SNPs and CNV regions<br />

– Targeted studies<br />

• 60,800 additional custom SNPs<br />

• 60,000 custom CNV-targets


Technologies<br />

• Fluorescent in situ hybridization (FISH)<br />

• Comparative Genomic Hybridization (CGH)<br />

• Genome-Wide Human SNP Array 6.0<br />

• Human 1M-Duo DNA Analysis BeadChip<br />

• Next-Generation Sequencing (NGS)<br />

– Whole-genome sequencing<br />

– Targeted, e.g. RNA-Seq


10 0<br />

Focus on NGS<br />

10 2<br />

Sanger<br />

sequencing<br />

10 4 10 6<br />

• Limitations of Arrays<br />

– Lower resolution for genomic rearrangements<br />

– Balanced events (e.g., inversions) cannot be<br />

detected using signal intensity differences<br />

– No breakpoint information<br />

Arrays<br />

Next-Generation Sequencing<br />

FISH<br />

10 8<br />

Karyotype<br />

Event Size<br />

(in bp)


Computational Methods


Mate-pair or paired-end<br />

mapping abnormalities<br />

Detecting<br />

Genomic Rearrangements<br />

<strong>Read</strong> depth signals<br />

Reference<br />

Split-<strong>Read</strong> alignments


Mate-pair or paired-end<br />

mapping abnormalities<br />

Detecting<br />

Genomic Rearrangements<br />

<strong>Read</strong> depth signals<br />

Reference<br />

Split-<strong>Read</strong> alignments<br />

Unmapped or<br />

single-anchored<br />

reads<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly<br />

Insertions Deletions


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly<br />

Lee et al. (2009)<br />

Korbel et al. (2007)


Mate-pair or paired-end<br />

mapping abnormalities<br />

Tandem<br />

Duplication<br />

Translocation<br />

Large Insertion<br />

Reference<br />

Sequence<br />

Newly<br />

Sequenced<br />

Genome<br />

Reference<br />

Sequences<br />

Newly<br />

Sequenced<br />

Genome<br />

Newly<br />

Sequenced<br />

Genome<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Tandem<br />

Duplication<br />

Translocation<br />

Large Insertion<br />

chrA<br />

chrB<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Tandem<br />

Duplication<br />

Translocation<br />

Large Insertion<br />

Reference<br />

Sequence<br />

Newly<br />

Sequenced<br />

Genome<br />

Reference<br />

Sequences<br />

Newly<br />

Sequenced<br />

Genome<br />

Newly<br />

Sequenced<br />

Genome<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Tandem<br />

Duplication<br />

Translocation<br />

Large Insertion<br />

chrA<br />

chrB<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Tandem<br />

Duplication<br />

Translocation<br />

Large Insertion<br />

Reference<br />

Sequence<br />

Newly<br />

Sequenced<br />

Genome<br />

Reference<br />

Sequences<br />

Newly<br />

Sequenced<br />

Genome<br />

Newly<br />

Sequenced<br />

Genome<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Tandem<br />

Duplication<br />

Translocation<br />

Large Insertion<br />

chrA<br />

chrB<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Local assembly<br />

3 3 1 8 7<br />

1 Copy 1 Copy 0 Copy 2 Copy 2 Copy<br />

Chiang et al. (2009)


Mate-pair or paired-end<br />

mapping abnormalities<br />

Log2 ratio<br />

• Down-Syndrom<br />

<strong>Read</strong>-depth<br />

signals<br />

– Partial Trisomie 21<br />

log 2<br />

#<br />

#<br />

chr21<br />

<strong>Read</strong>s<br />

<strong>Read</strong>s<br />

Split-read<br />

alignments<br />

Disease<br />

Normal<br />

Local assembly<br />

Xie et al. (2009)


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Deletion<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Deletion<br />

Split-read<br />

alignments<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Deletion<br />

Split-read<br />

alignments<br />

0, 0, 0, 0… Initialize top row with 0s<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Deletion<br />

Split-read<br />

alignments<br />

0, 0, 0, 0… Initialize top row with 0s<br />

Favor long gaps<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Deletion<br />

Split-read<br />

alignments<br />

0, 0, 0, 0… Initialize top row with 0s<br />

Favor long gaps<br />

Search last row for maximum score <br />

Local assembly<br />

Ye et al. (2009)<br />

Ameur et al. (2010)


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Split-read<br />

alignments<br />

Large Insertion<br />

Local assembly<br />

Mapped <strong>Read</strong>s Mapped <strong>Read</strong>s<br />

Local Assembly<br />

Rausch et al. (2009)


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Large Insertion<br />

Mapped <strong>Read</strong>s Mapped <strong>Read</strong>s<br />

Local<br />

Assembly<br />

Split-read<br />

alignments<br />

Local<br />

Assembly<br />

Local assembly


Mate-pair or paired-end<br />

mapping abnormalities<br />

Reference<br />

Sequence<br />

<strong>Read</strong>-depth<br />

signals<br />

Large Insertion<br />

Mapped <strong>Read</strong>s Mapped <strong>Read</strong>s<br />

Local<br />

Assembly<br />

Split-read<br />

alignments<br />

Anchoring assembled<br />

contigs of all<br />

unmapped reads<br />

Local<br />

Assembly<br />

Local assembly<br />

Hajirasouliha et al. (2010)


Deletion<br />

Short insertion<br />

(< Insert Size)<br />

Large insertion<br />

(> Insert Size)<br />

Inversion<br />

Tandem duplication<br />

Translocation<br />

Gain/Loss (CNVs)<br />

Computational Methods for De Novo<br />

Genomic Rearrangement Detection<br />

Paired-end mapping <strong>Read</strong>-depth Split-read Local assembly<br />

Region / Breakpoint Region Region Breakpoint Breakpoint


Structural Variant Detection<br />

• <strong>Read</strong>-depth tools<br />

– CNVer, CNVnator, etc.<br />

• Paired-end mapping<br />

– PEMer, Breakdancer<br />

• Split-read<br />

– Age, Pindel<br />

Tools


• Technical Validations<br />

Validations<br />

– PCR, Amplicon sequencing, Exon capture<br />

• Recurrent Validation<br />

– What is the variant allele frequency in a larger<br />

cohort<br />

• Target Enrichment<br />

• Array studies


Variant Aware <strong>Read</strong> <strong>Mapping</strong><br />

<strong>Read</strong> to reference alignment<br />

dbSNP<br />

<strong>Read</strong> depth<br />

Reference<br />

Database of Genomic<br />

Variants<br />

Mate-pair or paired end<br />

information<br />

CNV or SV studies


… CATTTT [C/ T] TTTGAA …<br />

…CATTTT C TTTGAA…<br />

Known SNPs<br />

…CATTTT N TTTGAA…<br />

…CATTTT T TTTGAA…


Deletion<br />

30bp<br />

…GGCT AGCTCC…CTTACT TCAA…<br />

30bp 30bp<br />

30bp<br />

30bp<br />

Insertion<br />

Known Rearrangements<br />

Junction C (60bp)<br />

…GGCTTCAA…<br />

…GGCTAGCTCC… …CTTACTTCAA …<br />

Junction A (60bp) Junction B (60bp)<br />

30bp<br />

Lam et al. (2010)


Target Enrichment<br />

Tobias Rausch<br />

July 2011


Technologies


Technologies


Target Enrichment Analysis<br />

• Quality Control<br />

• On-target / Off-target Analysis<br />

• Coverage Analysis<br />

• GC-Content, Allelic Balance, etc.<br />

• Data Analysis<br />

• SNP, Short Indel and SV Calling<br />

• Relating the Variant Calls to Public Databases


Individual Baits vs. Target Regions


On-target / Off-target Analysis<br />

• How many reads<br />

are on-target?


GA Lane, 50MB Kit<br />

• 37 million reads, 32 million mapped (86%)<br />

• 105bp reads


GA Lane, 50MB Kit


GA Lane, 50MB Kit


HiSeq Lane, 50MB Kit<br />

• 185 million reads, 160 million mapped (86%)<br />

• 105bp paired-end reads


HiSeq Lane, 50MB Kit


HiSeq Lane, 50MB Kit


Independent Comparison to RefSeq


Independent Comparison to RefSeq


1000 Genomes Project


Allelic Balance<br />

chr1, whole genome seq.


Allelic Balance<br />

chr1, exon capture


Allelic Balance for InDels 1bp-10bp<br />

(Whole Genome Seq.)


Allelic Balance for InDels 1bp-5bp<br />

(Target Enrichment)


Allelic Balance for InDels 1bp-5bp<br />

(Target Enrichment)<br />

Caveats:<br />

(i) Short indels tend to occur around tandem repeats<br />

(ii) Alignment is much harder in these regions<br />

(iii) High false discovery rate of all InDel Callers (10% - 40% FDR)


Copy Number Variants


Copy Number Variants<br />

3 3 1 8 7<br />

1 Copy 1 Copy 0 Copy 2 Copy 2 Copy


Log2 ratio<br />

Copy Number Variants<br />

• Down-Syndrom<br />

– Partial Trisomie 21<br />

log 2<br />

#<br />

#<br />

chr21<br />

<strong>Read</strong>s<br />

<strong>Read</strong>s<br />

Disease<br />

Normal


• Sample1<br />

• Sample2<br />

• Sample3<br />

• Sample4<br />

Large Genomic Rearrangements<br />

<strong>Read</strong>-depth approach<br />

Source: Thomas Zichner


Case Study<br />

HapMap Trio


HapMap Trio


HapMap Trio<br />

• NA12878 and NA12891 were sequenced


Individual Baits vs. Target Regions<br />

• All bait and target region coordinates are hg18<br />

• Total length of target regions: 37806033 (≈38MB)<br />

• Total length of bait sequences: 38235516 (≈38MB)<br />

• Approximately 1% of the human genome


Exon Length Distribution<br />

• All baits are<br />

of length<br />

120bp


Target Region Length Distribution<br />

• All baits are<br />

of length<br />

120bp


NA12878 NA12891<br />

• Mapped SOLiD reads<br />

Mapped <strong>Read</strong>s<br />

• NA12878: 133,915,955 mapped reads<br />

• NA12891: 108,092,260 mapped reads


On-target / Off-target Analysis


Avg. Coverage for each Target<br />

Illumina SOLiD<br />

• SOLiD distribution is shifted to the right due to<br />

higher sequencing coverage


Avg. Coverage for each Target<br />

• 3490 Targets<br />

without any<br />

mapped base<br />

Overlap: 825 <br />

• 1280 Targets<br />

without any<br />

mapped base


GC-Content Distribution


GC-Content Distribution


GC-Content Distribution


Histogram of GC-Content of<br />

Unmapped Targets - SOLiD


Histogram of GC-Content of<br />

Unmapped Targets - Illumina


53%<br />

On-target Ratio<br />

79%


59%<br />

On-target Ratio<br />

84%


Insert Size is very important!


Insert Size is very important!


Uniform Coverage across Targets?<br />

• Subdivide each target into 10 non-overlapping<br />

windows<br />

• Calculate coverage for each window<br />

• Get the fractional coverage for each window<br />

compared to the total coverage of the target<br />

• Add up all fractions for the same window<br />

across all targets


Uniform Coverage across Targets?<br />

NA12878 - Illumina


Uniform Coverage across Targets?<br />

NA12891 - Illumina


50 random targets on chrX


• Illumina data<br />

Fraction of <strong>Read</strong>s on chrX<br />

• NA12878: 0.02475<br />

• NA12891: 0.01329<br />

• SOLiD data<br />

• NA12878: 0.02995<br />

• NA12891: 0.01162


To be or not to be<br />

… the same sample


Sample Swaps are very<br />

hard to detect


Creating a SNP Profile<br />

EnsemblDB<br />

API Calls<br />

SNP<br />

Positions<br />

Consensus<br />

calls<br />

Consensus<br />

letter at SNP<br />

locations<br />

rs11510117 A<br />

rs56311179 Y<br />

rs4579752 R<br />

rs11510118 G<br />

…<br />

Sample-specific SNP Profile<br />

Aligned reads


Comparing SNP Profiles among<br />

EnsemblDB<br />

API Calls<br />

SNP<br />

Positions<br />

Samples<br />

Consensus<br />

letter at SNP<br />

locations<br />

Consensus<br />

calls<br />

rs11510117 T<br />

rs56311179 rs11510117 Y A Sample-specific SNP Profile<br />

rs4579752 Rs56311179 T<br />

rs11510117 Y<br />

rs11510118<br />

A<br />

ENSSNP11715197 G<br />

Rs56311179 rs11510117 W C A<br />

…<br />

Rs4579752ENSSNP11715197 rs56311179 T G Y<br />

rs11510118 rs4579752 ENSSNP11715197 G R W<br />

… rs11510118 rs4579752 G R<br />

… rs11510118 G<br />

…<br />

Aligned reads


<strong>Read</strong>-depth Plot Comparisons<br />

• Are characteristic<br />

amplifications/del<br />

etions reoccurring


<strong>Read</strong>-depth Plot Comparisons<br />

• Are characteristic<br />

amplifications/de<br />

letions<br />

reoccurring


General Hints<br />

• Blood/Tumor sample pairs should have the<br />

same sex (chrX/Y read count)<br />

• Germline samples should show less<br />

aberrations than tumor samples<br />

• SNP profiles can possibly be limited to SNPs<br />

that are known to be very polymorphic<br />

Thomas Zichner


Multiplexing


Illumina HiSeq 2000<br />

DNA sequencer<br />

Exponential Data Increase<br />

Source: Science Special Online Collection: Dealing with Data.<br />

Genome sequencing … Transcriptome sequencing … Protein-DNA binding (ChIP-Seq)<br />

Epigenetic analyses … Exome sequencing … Structural variation detection


Millions of <strong>Read</strong>s PF<br />

Number of Single-End <strong>Read</strong>s per Lane<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Illumina<br />

Arrives<br />

May …<br />

Jun 08<br />

Jul 08<br />

GA II<br />

1.0<br />

0.3<br />

Aug 08<br />

Sep 08<br />

Oct 08<br />

Nov 08<br />

Dec 08<br />

Jan 09<br />

Feb 09<br />

Mar 09<br />

Apr 09<br />

1.3.4<br />

1.3.2<br />

(Genecore)<br />

May …<br />

Jun 09<br />

Jul 09<br />

GA IIx<br />

SC2 2.8<br />

1..4 1.6 pre<br />

release SCS2.6 +V4 Kits<br />

Aug 09<br />

Sept …<br />

Oct 09<br />

Nov 09<br />

Dec 09<br />

Jan 10<br />

Feb 10<br />

Mar 10<br />

Apr 10<br />

May …<br />

June …<br />

July 10<br />

Aug 10<br />

Sept …<br />

Oct 10<br />

Nov 10<br />

Dec 10<br />

Jan 11<br />

Feb 11<br />

Mar 11<br />

April …<br />

May …<br />

June …<br />

GA<br />

HiSeq2000


Data Processing and Alignment:<br />

Custom Pipeline Script<br />

Lane 1<br />

Lane 2<br />

Lane 3<br />

Lane 8<br />

• Illumina Pipeline lacks support<br />

for <strong>EMBL</strong> PBS cluster (no<br />

distributed make)<br />

• HiSeq Flowcell requires a<br />

distributed alignment to be<br />

finished in 2-3 days<br />

• Splitting by lane using our<br />

pipeline script<br />

• Each lane on one node<br />

with 8 CPUs<br />

• 64 CPUs handle alignment


Multiplexing further complicates<br />

automated alignments<br />

Lane 1<br />

Lane 2<br />

Lane 3<br />

Lane 8<br />

Barcode 1<br />

Barcode 2<br />

Barcode n<br />

• With increased data volumes there is a high demand for multiplexing<br />

• Current Status of demultiplexing efforts:<br />

– Split by barcode<br />

– Allow barcode mismatches (enumerating a barcode’s neighborhood)<br />

– Trim-off barcodes<br />

– Providing sets of balanced barcodes (important for less than 10 barcodes per lane)<br />

hg19<br />

dm3


Computed balanced barcode distributions for<br />

various sample sizes


Barcode Performance


• Balanced base<br />

composition<br />

• Pairwise hamming<br />

distance >= 3<br />

• Non-extreme gccontent<br />

Barcode Set Criteria


• Work-in-progress<br />

Work-in-progress<br />

Lane 1<br />

Lane 2<br />

Lane 3<br />

Lane 8<br />

Barcode 1<br />

Barcode 2<br />

Barcode n<br />

– Everything splitted, aligned and barcode-specific statistics<br />

– This implies a parallelization on the barcode level (for both, illumina<br />

barcoded and custom-barcoded)<br />

– Unified handling of custom barcoding and illumina barcoding<br />

hg19<br />

dm3<br />

– Requires significant changes with regard to the Genecore database, illumina<br />

CASAVA pipeline, tracking sheets for the current projects, ….


Variant Calling<br />

SNPs Short Indels Structural Variants


Variant Calling<br />

SNPs Short Indels Structural Variants<br />

(i) ELANDv2e Dindel<br />

(ii) ELANDv2e Mpileup


Short InDels<br />

• Index-based read mappers are notoriously bad<br />

for gapped alignments


Short InDels<br />

• What is obvious for you by looking at the<br />

multiple alignment isn’t obvious to the read<br />

mapper because they have only the local view<br />

– One read against the reference at a time<br />

• Hence, almost all short InDel callers start with<br />

local realignment<br />

– Time consuming (depending on the number of<br />

realignment windows)


Short InDels - Tools<br />

• Open-source tools<br />

– Dindel<br />

– Pindel<br />

– MoDIL<br />

– GATK<br />

• Commercial packages<br />

– Maybe CLC Bio and others<br />

• Indel calling has a higher false positive rate<br />

than SNP calling


Short InDels<br />

Dindel, Albers et al.


Raw Short InDel Calls<br />

Sample #Total #Known_InDels(1kGP) #Novel_InDels #Coding_InDels #Coding_Novel_InDels<br />

Sample1 590642 142354 448288 386 143<br />

Sample2 584975 135403 449572 397 157<br />

Sample3 578649 137349 441300 380 138<br />

Sample4 566162 134539 431623 377 141<br />

Sample5 550717 133800 416917 355 127<br />

Sample6 570955 138643 432312 370 131<br />

Much fewer Indel calls than SNP calls. However, known<br />

Indels are scarce in public databases and thus, crosscomparisons<br />

are still very helpful.


Realignment


Realignment<br />

Reference: ..GACTG--TACT..<br />

<strong>Read</strong> 1: GACTGACTACT<br />

<strong>Read</strong> 2: GACTGC-TACT<br />

<strong>Read</strong> 3: GACTGC-TACT<br />

<strong>Read</strong> 4: GACTGC-TACT<br />

<strong>Read</strong> 5: GACTGACTACT<br />

<strong>Read</strong> 6: GACTGC-TACT<br />

GACTG--TACT<br />

GACTGACTACT<br />

GACTG-CTACT<br />

GACTG-CTACT<br />

GACTG-CTACT<br />

GACTGACTACT<br />

GACTG-CTACT


Realignment<br />

• Improves a crude, initial alignment


Realignment<br />

• Improves a crude, initial alignment<br />

• Usually employs some scheme of multiple<br />

alignment although a full k-dimensional DP is<br />

prohibitive<br />

O(n k )


Realignment<br />

• Improves a crude, initial alignment<br />

• Usually employs some scheme of multiple<br />

alignment although a full n-dimensional DP is<br />

prohibitive<br />

• To save time, carried out only on “suspicious”<br />

regions


ReAligner: Anson and Myers<br />

• Objective: Minimize the consensus score for <br />

= {a,c,g,t}:


ReAlignment Algorithm


Derive a consensus


Select a read for realignment


Cut it out of the consensus


Cut it out of the consensus<br />

Bandwidth


ReAlignment<br />

• Using a weighted score<br />

– Consensus score<br />

– Fractional score


ReAlignment<br />

• Using a weighted score<br />

– Consensus score<br />

– Fractional score<br />

• Iterate for all other reads


ReAlignment<br />

• Using a weighted score<br />

– Consensus score<br />

– Fractional score<br />

• Iterate for all other reads<br />

• Redo this realignment loop until score is non-decreasing


Two Scoring Functions<br />

• Consensus Score<br />

• Fractional Score<br />

• Final Score


Consensus Score


Fractional Score


Assembly Algorithms<br />

Tobias Rausch<br />

July 2011


De novo vs. Reference-based Assembly<br />

• De novo assemblers<br />

– Classical assemblers<br />

• Overlap - Layout - Consensus assembler<br />

– Short-read assemblers<br />

• De Bruijn graph assembler<br />

• Reference-based assembler


Assembly Types<br />

• Whole genome assembly<br />

• Transcriptome assembly<br />

• Metagenome assembly


History<br />

• 1994: Bacteria H.influenzae, 1.8Mb genome<br />

• 2000: Drosophila, 130Mb genome<br />

• 2001: Human genome, 3Gb genome<br />

• Currently >6000 sequenced genomes


Whole Genome Sequencing Costs


Overlap - Layout – Consensus Assembler<br />

Overlap<br />

Layout<br />

Consensus


Assembly<br />

• Input: Set of paired-end, mate-pair and/or long<br />

reads<br />

• Output: Set of assembled contigs


Overlap-Layout-Consensus Assembly<br />

• Overlap phase<br />

– For each pair of reads get a potential overlap<br />

– The number of possible relations doubles, when we<br />

also consider the reverse read


Dynamic programming


Overlap-Layout-Consensus Assembly<br />

Human genome assembly:<br />

About 27 million reads<br />

Number of required comparisons:


Seed and Extend Approach


• Potential overlaps<br />

• Are stored in a graph<br />

Layout Phase


Layout Phase<br />

• A simple heuristic: Spanning tree<br />

• In fact, it will be a spanning forest due to repeats<br />

or low-coverage regions


• Layout the reads<br />

• Result: A set of contigs<br />

Layout Phase


Consensus Phase<br />

• Scaffold the contigs with the help of mate-pairs<br />

• Result: A set of scaffolds


• Overlap<br />

• Layout<br />

Summary: Classical Assembly<br />

• Consensus


Celera Assembler: Some Statistics


Short <strong>Read</strong> Assembly<br />

• Classical assemblers are inappropriate for short<br />

read assembly because<br />

1. <strong>Read</strong>s are too short to compute a reliable pairwise<br />

overlap<br />

2. <strong>Read</strong>s are too numerous to compute all pairwise<br />

overlaps efficiently<br />

• Open question: How long do short reads need<br />

to be to apply the classical assemblers?


de Bruijn Graph<br />

• <strong>Read</strong>s are too short to compute a reliable<br />

pairwise overlap


de Bruijn Graph Construction<br />

• Dk = (V,E)<br />

– V = All length-k subfragments<br />

– E = Directed edges between consecutive<br />

subfragments<br />

• Nodes overlap by k-1 characters


de Bruijn Graph<br />

- Local Sequence Similarity -


de Bruijn Graph<br />

• Assembly quality greatly depends on the chosen<br />

k-mer size<br />

• Assembly quality measure: N50<br />

– Sort contigs by length<br />

– Add the contig’s lengths until the summed length<br />

exceeds 50% of the total length of all contigs<br />

– Example:<br />

• 100+80+70+50=300<br />

• N50=80 because 100+80 > 300/2


Choice of k<br />

• “It’s a trade-off between specificity and<br />

sensitivity. Longer k-mers bring you more<br />

specificity (i.e. less spurious overlaps) but<br />

lower coverage … so there’s a sweet spot to be<br />

found with time and experience.” (D. Zerbino,<br />

Velvet)


Obvious Hints<br />

• k must be smaller than the read length<br />

• Usually an odd number to avoid palindromes<br />

(e.g. Otto or “Ein Esel lese nie”)<br />

• Sometimes


K-mer Uniqueness Ratio


Whole Genome Assembly<br />

• Short-read assemblers<br />

– EULER-SR<br />

– Velvet<br />

– ABySS<br />

– SOAPdenovo<br />

– ALLPATHS-LG<br />

– Newbler (454 Data)<br />

• Overlap-layout-consensus assemblers<br />

– Celera assembler<br />

– Arachne<br />

– PCAP<br />

– Sanger String Graph Assembler


• Oases<br />

Transcriptome Assembly<br />

– http://www.ebi.ac.uk/~zerbino/oases/<br />

• Trinity<br />

– http://trinityrnaseq.sourceforge.net/<br />

• Rnnotator


Metagenomics Assembly<br />

• Highly heterogeneous data<br />

– Unequal representation of member species<br />

– Clusters of closely related strains<br />

– Very difficult to disentangle into one assembly for<br />

one member species


• MIRA3<br />

Reference-based Assembly<br />

– www.chevreux.org/mira_downloads.html<br />

• AMOScmp<br />

– amos.sourceforge.net/docs/pipeline/AMOScmp.html


Thanks!

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!