Read Mapping - EMBL

Read Mapping 

Tobias Rausch 

July 2011

Data Analysis 

Read 

Mapping 

Assembly 

Reference 

Overlap 

Layout 

Consensus

Target 

genome 

Fragments 

Paired-End Sequencing 

... 

Sequenced 

reads 

Paired-end

Paired-End Libraries

Paired-End Libraries 

R1 R2

Mate-Pair Libraries

Mate-Pair Libraries 

R1 R2 

R2 R1


Reference



Reference 

Reference Genome 

0 n 

m



• Dynamic Programming: Quadratic algorithm 

– Requires O(m*n) time and space 

• Infeasible for millions of short reads 

Reference 


0 n 

m

Genome 

Filtering 

Preprocess 

Index


Genome 

Filter 

Algorithm 

Filtering 

Preprocess 

Index 

Filtration Phase 

Potential Matches


Genome 

Filter 

Algorithm 

Exact 

Algorithm 

Filtering 

Preprocess 

Index 

Filtration Phase 

Potential Matches 

Verification Phase 

True Matches 

False Matches

Simple k-mer Index, k=3 

S = ACGAAAACTCGATTACTCGACC 

Hitlist Hitlist Hitlist 

AAA ACC CGA 

AAC ACG … 

AAG ACT GAA 

AAT AGA … 

ACA … TTT 

• Size of that table: 4 3 = 64 entries = |Σ| k




AAA ACC CGA 

AAC ACG 0 … 

AAG ACT GAA 

AAT AGA … 

ACA … TTT 





AAA ACC CGA 1 

AAC ACG 0 … 

AAG ACT GAA 

AAT AGA … 

ACA … TTT 





AAA ACC CGA 1 

AAC ACG 0 … 

AAG ACT GAA 2 

AAT AGA … 

ACA … TTT 





AAA 3,4 ACC 19 CGA 1 

AAC 5 ACG 0 … … 

AAG Empty ACT 6,14 GAA 2 

AAT Empty AGA … … … 

ACA Empty … … TTT Empty

Searching a Read 


AAA 3,4 ACC 19 CGA 1 

AAC 5 ACG 0 … … 

AAG Empty ACT 6,14 GAA 2 

AAT Empty AGA … … … 

ACA Empty … … TTT Empty 

• Read Sequence: ACTG 

– Potential match at position 6 and 14


Verification Algorithm 

Banded Dynamic Programming 


0 n 

m 

Bandwidth

• Needleman-Wunsch 

Pairwise Alignment 

- Global Alignment - 

• Constant gap penalty s

Global Alignment 

Reference

• Smith-Waterman 

Pairwise Alignment 

- Local Alignment - 

• Constant gap penalty s

Pairwise Alignments - Variants 

• Overlap alignments 

• Semi-global alignments 

• Banded alignments 

ACGTTAGTTAGC 

ACTTAGCAACTTG 

...ACGGCCAACTTAGTTAGC... 

AA-TTTGT 

0 

S 1 

m 

0 

S 0 

k 

n

Banded Alignment

Banded vs. non-banded Alignment


Reference 

Source: illumina

Techniques 

• Index 

– Hash tables, k-mer Index 

– Suffix trees, suffix arrays 

– Burrows-Wheeler-Transformation (BWT) of a suffix array 

• Filtering Algorithms 

– Single or multiple seeds 

– Pigeonhole principle 

– q-gram filtering 

• Verification 

– Simple seed-and-extend 

– Banded dynamic programming 

– Quality-based dynamic programming

Variant Calling 

SNPs Short Indels Structural Variants

SNP Calling

Set of reads 

SNP Calling

Set of reads 

SNP Calling 

Forward

Set of reads 

SNP Calling 

Reverse

SNP Calling 

Consensus sequence

SNP Calling 

Variations: Indels & SNPs

SNP Calling 

Sequencing errors: Insertions, deletions & basecalling errors


SNPs Short Indels Structural Variants 

(i) ELANDv2e GATK 

(ii) ELANDv2e Mpileup Annovar

• Tools 

SNP Calling 

– GATK (Genome Analysis Toolkit) 

– SAMtools mpileup (MAQ SNP Caller) 

– CASAVA SNP Caller 

– Pyrobayes (454) 

– GigaBayes 

– Commercial packages (CLC Bio, Genomatix, etc.) 

• My rough guess for the open-source community 

is that about 80% or 90% use GATK or SAMtools 

mpileup

• Differentiating 

SNP Annotation 

– Coding/Non-coding SNPs 

– Known/Unknown SNPs (dbSNP, 1kGP) 

– Homozygous/Heterozygous 

• Annotating coding SNPs 

– Affected Gene/Transcript names 

– Silent/Non-silent mutations 

– Amino acid change 

• Predicting possible impacts of an amino acid 

substitution (Sift, Polyphen, …)

Annotation Tools 

• Annotation of coding SNPs 

• GATK (Genome Analysis Toolkit) 

• ANNOVAR 

• Commercial packages (CLC Bio, Genomatix, etc.) 

• Predicting possible impacts of an amino acid 

substitution 

• PolyPhen2, Mutation Taster, Sift, etc.

GATK 

• Java library 

• Easily extendable 

• Ready-to-use tools are scarce 

• SNP Calling + Annotation is difficult to get 

running, long runtime (chr-by-chr works) 

• Supports VCF 

Software library for tool development

GATK – SNP Calling 

Genome Analysis Toolkit, McKenna et al.

SAMtools mpileup 

• Call & Filter SNPs and INDELs 

• Very fast compared to GATK 

• Output is in VCF 

• SNP Calling 

samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf 

• SNP Filtering 

bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

• Set of perl scripts 

Annovar Annotation 

• Based on the UCSC annotation database 

• Generic across different species 

• SNP Calling + Annotation is easy to get running 

• VCF is supported as input 

Easy to use SNP annotation tool

• Drosophila 

Generic Annotation 

chrX.fa 2527302 A T hom 222 48 60 exonic 

CG10260 nonsynonymous SNV CG10260:NM_130658:exon10:c.6509A>T:p.Q2170L, 

• Human 

chr10.fa 93603 C T hom 43.3 17 42 exonic 

TUBB8 synonymous SNV TUBB8:NM_177987:exon4:c.729G>A:p.P243P, 

• Format 

chr pos ref alt hom/het snp qual. depth mq …

• Human 

Gene-based Annotation 



• Type 

– exonic, splicing, ncRNA, UTR5, UTR3, intronic, 

upstream, downstream, intergenic 

• Gene, Neighboring gene 

– If the variant is exonic/intronic/ncRNA 

• Gene name 

– Otherwise 

• Neighboring genes

• Human 

Exonic Annotation 



• Type 

– frameshift insertion, frameshift deletion, frameshift block 

substitution, stopgain, stoploss, nonframeshift insertion, 

nonframeshift deletion, nonframeshift block substitution, 

nonsynonymous SNV, synonymous SNV, unknown 

• Gene, transcript, sequence change in the 

corresponding transcript and amino acid 

change

Additional Annotation for Human 

chr15.fa 45404066 G A hom 104 16 60 snp132 

rs2001616 exonic DUOX2 nonsynonymous SNV 

DUOX2:NM_014080:exon5:c.413C>T:p.P138L, avsift 0.02 ljb_pp2 0.474 

segdup Score=0.973842;Name=chr15:45427292 tfbs 

Score=818;Name=V$MYCMAX_03 mce46way Score=317;Name=lod=26 band 

15q21.1 dgv Name=3961 

• Further Information (in bold) 

– dbSNP (v132), SIFT and PolyPhen2 

– Segmental duplication 

– Transcription factor binding site 

– Conserved element 

– Cytogenetic band 

– Structural variants in DGV and GWAS

Variant Calling by Consensus

Mpileup/Annovar 

SNP Calls, chr19 

Sample #Total #dbSNPs #Novel_SNVs #Synonymous_SNVs #Nonsynonymous_SNVs #Novel_synonymous_SNVs #Novel_nonsynonymous_SNVs 

Sample1 69682 66122 3560 895 880 31 62 

Sample2 69092 65590 3502 897 869 28 56 

GATK 

Sample #Total #dbSNPs #Novel_SNPs #Missense_SNPs #Nonsense_SNPs #Novel_missense_SNPs #Novel_nonsense_SNPs 

Sample1 80188 71113 9075 879 6 86 1 

Sample2 80008 71133 8875 884 5 76 1

SNP Call Overlap 

• Sample1 

– 69682 Mpileup/Annovar Calls 

– 80188 GATK Calls 

– 67149 calls in common (96%, 83%) 

• Sample2 

– 69092 Mpileup/Annovar Calls 

– 80008 GATK Calls 

– 66855 calls in common (97%, 84%)

• Call breakdown 

Somatic changes 

– GATK: good quality, tumor specific: 2263 (407 not 

in dbSNP) 

– Mpileup/Annovar: good quality, tumor specifc: 

1105 calls (116 not in dbSNP) 

– Common calls 445 (43 not in dbSNP) 

• Common exonic, somatic, novel calls: 0 

• Common exonic, somatic: 7

Study Design

Raw SNP Calls, Human Genome 


Sample1 3994410 3589636 404774 10970 93 1227 25 

Sample2 3783797 3400926 382871 10416 90 1142 23 

Even focusing only on the novel missense and nonsense 

SNPs leaves a list of several hundreds of SNPs!

Raw SNP Calls, Human Genome 


Sample1 3994410 3589636 404774 10970 93 1227 25 

Sample2 3783797 3400926 382871 10416 90 1142 23 

Cross-comparisons! 

At least Tumor vs. Germline, even better multiple matched 

tumor-germline pairs showing a similar phenotype.

Comparing / Annotating Variant Calls 

Matched 

Tumor – Normal 

data 

Multiple Tumor 

Probes 

Time-course 

data 

1) Initial Tumor 

2) Remission 

3) Relapse 

4) Remission 

5) Secondary 

Tumor 

Pedigree data 

Population Scale Constraints 

• Tumor heterogeneity 

• Tumor purity 

• Aneuploidy 

• Sample availability

Computational Methods to Detect Genomic 

Rearrangements 

using Next-Generation Sequencing Data 

Tobias Rausch 

July 2011

Genomic Rearrangements 

• 1 Kb to several Mb in size 

Deletion 

α β γ 

α γ



• Copy number variants 

– Deletion 

– Duplication 

Deletion 

Duplication 

α β γ 

α γ 

α β β 

γ




– Deletion 


• Insertion 

Deletion 

Duplication 

Insertion 

α β γ 

α γ 

α β β 

γ 

α β δ γ




– Deletion 


• Insertion, Inversion 

Deletion 

Duplication 

Insertion 

α β γ 

α γ 

α β β 

γ 

α β δ γ 

Inversion γ β α




– Deletion 


• Insertion, Inversion, Translocation 

Deletion 

Duplication 

Insertion 

α β γ 

α γ 

α β β 

γ 

α β δ γ 

Inversion γ β α




– Deletion 



• More abundant than SNPs 

…ACGATACG… 

…ACGAGACG… 

Deletion 

Duplication 

Insertion 

α β γ 

α γ 

α β β 

γ 

α β δ γ 

Inversion γ β α




– Deletion 



• More abundant than SNPs 

• Either neutral or non-neutral in function 

• Non-neutral mechanisms 

– Disrupting genes 

– Creating fusion genes 

– Copy number changes of dosage-sensitive genes 

Deletion 

Duplication 

Insertion 

α β γ 

α γ 

α β β 

γ 

α β δ γ 

Inversion γ β α

Technologies to Discover 

Genomic Rearrangements

Technologies 

• Fluorescent in situ hybridization (FISH) 

– Fluorescent 

probes (≈100kb) 

detect and 

localize the 

presence or 

absence of 

specific DNA 

sequence 

Perry et al. (2007)

Technologies 


• Comparative Genomic Hybridization (CGH) 

– Test vs. reference sample 

– 2.1 million probes 

– Different types 

• Whole-Genome Tiling Arrays 

• Whole-Genome Exon-Focused Arrays 

• CNV Arrays

Technologies 



• Genome-Wide Human SNP Array 6.0 

– 1.8 million genetic markers 

• 906,600 SNPs 

• 946,000 probes for CNVs

Technologies 




• Human 1M-Duo DNA Analysis BeadChip 

– 1.2 million genetic markers 

• Markers for SNPs and CNV regions 

– Targeted studies 

• 60,800 additional custom SNPs 

• 60,000 custom CNV-targets

Technologies 




• Human 1M-Duo DNA Analysis BeadChip 

• Next-Generation Sequencing (NGS) 

– Whole-genome sequencing 

– Targeted, e.g. RNA-Seq

10 0 

Focus on NGS 

10 2 

Sanger 

sequencing 

10 4 10 6 

• Limitations of Arrays 

– Lower resolution for genomic rearrangements 

– Balanced events (e.g., inversions) cannot be 

detected using signal intensity differences 

– No breakpoint information 

Arrays 

Next-Generation Sequencing 

FISH 

10 8 

Karyotype 

Event Size 

(in bp)

Computational Methods

Mate-pair or paired-end 

mapping abnormalities 

Detecting 


Read depth signals 

Reference 

Split-Read alignments



Detecting 


Read depth signals 

Reference 

Split-Read alignments 

Unmapped or 

single-anchored 

reads 

Local assembly



Read-depth 

signals 

Split-read 

alignments 

Local assembly




signals 

Split-read 

alignments 

Local assembly




signals 

Split-read 

alignments 

Local assembly 

Insertions Deletions




signals 

Split-read 

alignments 

Local assembly




signals 

Split-read 

alignments 


Lee et al. (2009) 

Korbel et al. (2007)



Tandem 

Duplication 

Translocation 

Large Insertion 

Reference 

Sequence 

Newly 

Sequenced 

Genome 

Reference 

Sequences 

Newly 

Sequenced 

Genome 

Newly 

Sequenced 

Genome 

Reference 

Sequence 


signals 

Split-read 

alignments 

Tandem 

Duplication 

Translocation 


chrA 

chrB 

Local assembly



Tandem 

Duplication 

Translocation 


Reference 

Sequence 

Newly 

Sequenced 

Genome 

Reference 

Sequences 

Newly 

Sequenced 

Genome 

Newly 

Sequenced 

Genome 

Reference 

Sequence 


signals 

Split-read 

alignments 

Tandem 

Duplication 

Translocation 


chrA 

chrB 

Local assembly



Tandem 

Duplication 

Translocation 


Reference 

Sequence 

Newly 

Sequenced 

Genome 

Reference 

Sequences 

Newly 

Sequenced 

Genome 

Newly 

Sequenced 

Genome 

Reference 

Sequence 


signals 

Split-read 

alignments 

Tandem 

Duplication 

Translocation 


chrA 

chrB 

Local assembly




signals 

Split-read 

alignments 

Local assembly




signals 

Split-read 

alignments 

Local assembly




signals 

Split-read 

alignments 


3 3 1 8 7 

1 Copy 1 Copy 0 Copy 2 Copy 2 Copy 

Chiang et al. (2009)



Log2 ratio 

• Down-Syndrom 


signals 

– Partial Trisomie 21 

log 2 

# 

# 

chr21 

Reads 


Split-read 

alignments 

Disease 

Normal 


Xie et al. (2009)



Reference 

Sequence 


signals 

Deletion 

Split-read 

alignments 

Local assembly



Reference 

Sequence 


signals 

Deletion 

Split-read 

alignments 

Local assembly



Reference 

Sequence 


signals 

Deletion 

Split-read 

alignments 

0, 0, 0, 0… Initialize top row with 0s 

Local assembly



Reference 

Sequence 


signals 

Deletion 

Split-read 

alignments 


Favor long gaps 

Local assembly



Reference 

Sequence 


signals 

Deletion 

Split-read 

alignments 


Favor long gaps 

Search last row for maximum score 


Ye et al. (2009) 

Ameur et al. (2010)



Reference 

Sequence 


signals 

Split-read 

alignments 



Mapped Reads Mapped Reads 

Local Assembly 

Rausch et al. (2009)



Reference 

Sequence 


signals 



Local 

Assembly 

Split-read 

alignments 

Local 

Assembly 

Local assembly



Reference 

Sequence 


signals 



Local 

Assembly 

Split-read 

alignments 

Anchoring assembled 

contigs of all 

unmapped reads 

Local 

Assembly 


Hajirasouliha et al. (2010)

Deletion 

Short insertion 

(< Insert Size) 

Large insertion 

(> Insert Size) 

Inversion 

Tandem duplication 

Translocation 

Gain/Loss (CNVs) 

Computational Methods for De Novo 

Genomic Rearrangement Detection 

Paired-end mapping Read-depth Split-read Local assembly 

Region / Breakpoint Region Region Breakpoint Breakpoint

Structural Variant Detection 

• Read-depth tools 

– CNVer, CNVnator, etc. 

• Paired-end mapping 

– PEMer, Breakdancer 

• Split-read 

– Age, Pindel 

Tools

• Technical Validations 

Validations 

– PCR, Amplicon sequencing, Exon capture 

• Recurrent Validation 

– What is the variant allele frequency in a larger 

cohort 

• Target Enrichment 

• Array studies

Variant Aware Read Mapping 

Read to reference alignment 

dbSNP 

Read depth 

Reference 

Database of Genomic 

Variants 

Mate-pair or paired end 

information 

CNV or SV studies

… CATTTT [C/ T] TTTGAA … 

…CATTTT C TTTGAA… 

Known SNPs 

…CATTTT N TTTGAA… 

…CATTTT T TTTGAA…

Deletion 

30bp 

…GGCT AGCTCC…CTTACT TCAA… 

30bp 30bp 

30bp 

30bp 

Insertion 

Known Rearrangements 

Junction C (60bp) 

…GGCTTCAA… 

…GGCTAGCTCC… …CTTACTTCAA … 

Junction A (60bp) Junction B (60bp) 

30bp 

Lam et al. (2010)

Target Enrichment 

Tobias Rausch 

July 2011

Technologies

Technologies

Target Enrichment Analysis 

• Quality Control 

• On-target / Off-target Analysis 

• Coverage Analysis 

• GC-Content, Allelic Balance, etc. 

• Data Analysis 

• SNP, Short Indel and SV Calling 

• Relating the Variant Calls to Public Databases

Individual Baits vs. Target Regions

On-target / Off-target Analysis 

• How many reads 

are on-target?

GA Lane, 50MB Kit 

• 37 million reads, 32 million mapped (86%) 

• 105bp reads

GA Lane, 50MB Kit

GA Lane, 50MB Kit

HiSeq Lane, 50MB Kit 

• 185 million reads, 160 million mapped (86%) 

• 105bp paired-end reads

HiSeq Lane, 50MB Kit

HiSeq Lane, 50MB Kit

Independent Comparison to RefSeq

Independent Comparison to RefSeq

1000 Genomes Project

Allelic Balance 

chr1, whole genome seq.

Allelic Balance 

chr1, exon capture

Allelic Balance for InDels 1bp-10bp 

(Whole Genome Seq.)


(Target Enrichment)


(Target Enrichment) 

Caveats: 

(i) Short indels tend to occur around tandem repeats 

(ii) Alignment is much harder in these regions 

(iii) High false discovery rate of all InDel Callers (10% - 40% FDR)

Copy Number Variants

Copy Number Variants 

3 3 1 8 7 

1 Copy 1 Copy 0 Copy 2 Copy 2 Copy

Log2 ratio 

Copy Number Variants 

• Down-Syndrom 

– Partial Trisomie 21 

log 2 

# 

# 

chr21 



Disease 

Normal

• Sample1 

• Sample2 

• Sample3 

• Sample4 

Large Genomic Rearrangements 

Read-depth approach 

Source: Thomas Zichner

Case Study 

HapMap Trio

HapMap Trio

HapMap Trio 

• NA12878 and NA12891 were sequenced

Individual Baits vs. Target Regions 

• All bait and target region coordinates are hg18 

• Total length of target regions: 37806033 (≈38MB) 

• Total length of bait sequences: 38235516 (≈38MB) 

• Approximately 1% of the human genome

Exon Length Distribution 

• All baits are 

of length 

120bp

Target Region Length Distribution 

• All baits are 

of length 

120bp

NA12878 NA12891 

• Mapped SOLiD reads 

Mapped Reads 

• NA12878: 133,915,955 mapped reads 

• NA12891: 108,092,260 mapped reads

On-target / Off-target Analysis

Avg. Coverage for each Target 

Illumina SOLiD 

• SOLiD distribution is shifted to the right due to 

higher sequencing coverage

Avg. Coverage for each Target 

• 3490 Targets 

without any 

mapped base 

Overlap: 825 

• 1280 Targets 

without any 

mapped base

GC-Content Distribution



Histogram of GC-Content of 

Unmapped Targets - SOLiD

Histogram of GC-Content of 

Unmapped Targets - Illumina

53% 

On-target Ratio 

79%

59% 

On-target Ratio 

84%

Insert Size is very important!

Insert Size is very important!

Uniform Coverage across Targets? 

• Subdivide each target into 10 non-overlapping 

windows 

• Calculate coverage for each window 

• Get the fractional coverage for each window 

compared to the total coverage of the target 

• Add up all fractions for the same window 

across all targets


NA12878 - Illumina


NA12891 - Illumina

50 random targets on chrX

• Illumina data 

Fraction of Reads on chrX 

• NA12878: 0.02475 

• NA12891: 0.01329 

• SOLiD data 

• NA12878: 0.02995 

• NA12891: 0.01162

To be or not to be 

… the same sample

Sample Swaps are very 

hard to detect

Creating a SNP Profile 

EnsemblDB 

API Calls 

SNP 

Positions 

Consensus 

calls 

Consensus 

letter at SNP 

locations 

rs11510117 A 

rs56311179 Y 

rs4579752 R 

rs11510118 G 

… 

Sample-specific SNP Profile 

Aligned reads

Comparing SNP Profiles among 

EnsemblDB 

API Calls 

SNP 

Positions 

Samples 

Consensus 

letter at SNP 

locations 

Consensus 

calls 

rs11510117 T 

rs56311179 rs11510117 Y A Sample-specific SNP Profile 

rs4579752 Rs56311179 T 

rs11510117 Y 

rs11510118 

A 

ENSSNP11715197 G 

Rs56311179 rs11510117 W C A 

… 

Rs4579752ENSSNP11715197 rs56311179 T G Y 

rs11510118 rs4579752 ENSSNP11715197 G R W 

… rs11510118 rs4579752 G R 

… rs11510118 G 

… 

Aligned reads

Read-depth Plot Comparisons 

• Are characteristic 

amplifications/del 

etions reoccurring

Read-depth Plot Comparisons 

• Are characteristic 

amplifications/de 

letions 

reoccurring

General Hints 

• Blood/Tumor sample pairs should have the 

same sex (chrX/Y read count) 

• Germline samples should show less 

aberrations than tumor samples 

• SNP profiles can possibly be limited to SNPs 

that are known to be very polymorphic 

Thomas Zichner

Multiplexing

Illumina HiSeq 2000 

DNA sequencer 

Exponential Data Increase 

Source: Science Special Online Collection: Dealing with Data. 

Genome sequencing … Transcriptome sequencing … Protein-DNA binding (ChIP-Seq) 

Epigenetic analyses … Exome sequencing … Structural variation detection

Millions of Reads PF 

Number of Single-End Reads per Lane 

140 

120 

100 

80 

60 

40 

20 

0 

Illumina 

Arrives 

May … 

Jun 08 

Jul 08 

GA II 

1.0 

0.3 

Aug 08 

Sep 08 

Oct 08 

Nov 08 

Dec 08 

Jan 09 

Feb 09 

Mar 09 

Apr 09 

1.3.4 

1.3.2 

(Genecore) 

May … 

Jun 09 

Jul 09 

GA IIx 

SC2 2.8 

1..4 1.6 pre 

release SCS2.6 +V4 Kits 

Aug 09 

Sept … 

Oct 09 

Nov 09 

Dec 09 

Jan 10 

Feb 10 

Mar 10 

Apr 10 

May … 

June … 

July 10 

Aug 10 

Sept … 

Oct 10 

Nov 10 

Dec 10 

Jan 11 

Feb 11 

Mar 11 

April … 

May … 

June … 

GA 

HiSeq2000

Data Processing and Alignment: 

Custom Pipeline Script 

Lane 1 

Lane 2 

Lane 3 

Lane 8 

• Illumina Pipeline lacks support 

for EMBL PBS cluster (no 

distributed make) 

• HiSeq Flowcell requires a 

distributed alignment to be 

finished in 2-3 days 

• Splitting by lane using our 

pipeline script 

• Each lane on one node 

with 8 CPUs 

• 64 CPUs handle alignment

Multiplexing further complicates 

automated alignments 

Lane 1 

Lane 2 

Lane 3 

Lane 8 

Barcode 1 

Barcode 2 

Barcode n 

• With increased data volumes there is a high demand for multiplexing 

• Current Status of demultiplexing efforts: 

– Split by barcode 

– Allow barcode mismatches (enumerating a barcode’s neighborhood) 

– Trim-off barcodes 

– Providing sets of balanced barcodes (important for less than 10 barcodes per lane) 

hg19 

dm3

Computed balanced barcode distributions for 

various sample sizes

Barcode Performance

• Balanced base 

composition 

• Pairwise hamming 

distance >= 3 

• Non-extreme gccontent 

Barcode Set Criteria

• Work-in-progress 

Work-in-progress 

Lane 1 

Lane 2 

Lane 3 

Lane 8 

Barcode 1 

Barcode 2 

Barcode n 

– Everything splitted, aligned and barcode-specific statistics 

– This implies a parallelization on the barcode level (for both, illumina 

barcoded and custom-barcoded) 

– Unified handling of custom barcoding and illumina barcoding 

hg19 

dm3 

– Requires significant changes with regard to the Genecore database, illumina 

CASAVA pipeline, tracking sheets for the current projects, ….


SNPs Short Indels Structural Variants


SNPs Short Indels Structural Variants 

(i) ELANDv2e Dindel 

(ii) ELANDv2e Mpileup

Short InDels 

• Index-based read mappers are notoriously bad 

for gapped alignments

Short InDels 

• What is obvious for you by looking at the 

multiple alignment isn’t obvious to the read 

mapper because they have only the local view 

– One read against the reference at a time 

• Hence, almost all short InDel callers start with 

local realignment 

– Time consuming (depending on the number of 

realignment windows)

Short InDels - Tools 

• Open-source tools 

– Dindel 

– Pindel 

– MoDIL 

– GATK 

• Commercial packages 

– Maybe CLC Bio and others 

• Indel calling has a higher false positive rate 

than SNP calling

Short InDels 

Dindel, Albers et al.

Raw Short InDel Calls 

Sample #Total #Known_InDels(1kGP) #Novel_InDels #Coding_InDels #Coding_Novel_InDels 

Sample1 590642 142354 448288 386 143 

Sample2 584975 135403 449572 397 157 

Sample3 578649 137349 441300 380 138 

Sample4 566162 134539 431623 377 141 

Sample5 550717 133800 416917 355 127 

Sample6 570955 138643 432312 370 131 

Much fewer Indel calls than SNP calls. However, known 

Indels are scarce in public databases and thus, crosscomparisons 

are still very helpful.

Realignment

Realignment 

Reference: ..GACTG--TACT.. 

Read 1: GACTGACTACT 

Read 2: GACTGC-TACT 



Read 5: GACTGACTACT 


GACTG--TACT 

GACTGACTACT 

GACTG-CTACT 

GACTG-CTACT 

GACTG-CTACT 

GACTGACTACT 

GACTG-CTACT

Realignment 

• Improves a crude, initial alignment

Realignment 

• Improves a crude, initial alignment 

• Usually employs some scheme of multiple 

alignment although a full k-dimensional DP is 

prohibitive 

O(n k )

Realignment 

• Improves a crude, initial alignment 

• Usually employs some scheme of multiple 

alignment although a full n-dimensional DP is 

prohibitive 

• To save time, carried out only on “suspicious” 

regions

ReAligner: Anson and Myers 

• Objective: Minimize the consensus score for 

= {a,c,g,t}:

ReAlignment Algorithm

Derive a consensus

Select a read for realignment

Cut it out of the consensus

Cut it out of the consensus 

Bandwidth

ReAlignment 

• Using a weighted score 

– Consensus score 

– Fractional score

ReAlignment 



– Fractional score 

• Iterate for all other reads

ReAlignment 



– Fractional score 

• Iterate for all other reads 

• Redo this realignment loop until score is non-decreasing

Two Scoring Functions 

• Consensus Score 

• Fractional Score 

• Final Score

Consensus Score

Fractional Score

Assembly Algorithms 

Tobias Rausch 

July 2011

De novo vs. Reference-based Assembly 

• De novo assemblers 

– Classical assemblers 

• Overlap - Layout - Consensus assembler 

– Short-read assemblers 

• De Bruijn graph assembler 

• Reference-based assembler

Assembly Types 

• Whole genome assembly 

• Transcriptome assembly 

• Metagenome assembly

History 

• 1994: Bacteria H.influenzae, 1.8Mb genome 

• 2000: Drosophila, 130Mb genome 

• 2001: Human genome, 3Gb genome 

• Currently >6000 sequenced genomes

Whole Genome Sequencing Costs

Overlap - Layout – Consensus Assembler 

Overlap 

Layout 

Consensus

Assembly 

• Input: Set of paired-end, mate-pair and/or long 

reads 

• Output: Set of assembled contigs

Overlap-Layout-Consensus Assembly 

• Overlap phase 

– For each pair of reads get a potential overlap 

– The number of possible relations doubles, when we 

also consider the reverse read

Dynamic programming

Overlap-Layout-Consensus Assembly 

Human genome assembly: 

About 27 million reads 

Number of required comparisons:

Seed and Extend Approach

• Potential overlaps 

• Are stored in a graph 

Layout Phase

Layout Phase 

• A simple heuristic: Spanning tree 

• In fact, it will be a spanning forest due to repeats 

or low-coverage regions

• Layout the reads 

• Result: A set of contigs 

Layout Phase

Consensus Phase 

• Scaffold the contigs with the help of mate-pairs 

• Result: A set of scaffolds

• Overlap 

• Layout 

Summary: Classical Assembly 

• Consensus

Celera Assembler: Some Statistics

Short Read Assembly 

• Classical assemblers are inappropriate for short 

read assembly because 

1. Reads are too short to compute a reliable pairwise 

overlap 

2. Reads are too numerous to compute all pairwise 

overlaps efficiently 

• Open question: How long do short reads need 

to be to apply the classical assemblers?

de Bruijn Graph 

• Reads are too short to compute a reliable 

pairwise overlap

de Bruijn Graph Construction 

• Dk = (V,E) 

– V = All length-k subfragments 

– E = Directed edges between consecutive 

subfragments 

• Nodes overlap by k-1 characters


- Local Sequence Similarity -


• Assembly quality greatly depends on the chosen 

k-mer size 

• Assembly quality measure: N50 

– Sort contigs by length 

– Add the contig’s lengths until the summed length 

exceeds 50% of the total length of all contigs 

– Example: 

• 100+80+70+50=300 

• N50=80 because 100+80 > 300/2

Choice of k 

• “It’s a trade-off between specificity and 

sensitivity. Longer k-mers bring you more 

specificity (i.e. less spurious overlaps) but 

lower coverage … so there’s a sweet spot to be 

found with time and experience.” (D. Zerbino, 

Velvet)

Obvious Hints 

• k must be smaller than the read length 

• Usually an odd number to avoid palindromes 

(e.g. Otto or “Ein Esel lese nie”) 

• Sometimes

K-mer Uniqueness Ratio

Whole Genome Assembly 

• Short-read assemblers 

– EULER-SR 

– Velvet 

– ABySS 

– SOAPdenovo 

– ALLPATHS-LG 

– Newbler (454 Data) 

• Overlap-layout-consensus assemblers 

– Celera assembler 

– Arachne 

– PCAP 

– Sanger String Graph Assembler

• Oases 

Transcriptome Assembly 

– http://www.ebi.ac.uk/~zerbino/oases/ 

• Trinity 

– http://trinityrnaseq.sourceforge.net/ 

• Rnnotator

Metagenomics Assembly 

• Highly heterogeneous data 

– Unequal representation of member species 

– Clusters of closely related strains 

– Very difficult to disentangle into one assembly for 

one member species

• MIRA3 

Reference-based Assembly 

– www.chevreux.org/mira_downloads.html 

• AMOScmp 

– amos.sourceforge.net/docs/pipeline/AMOScmp.html

Thanks!

Read Mapping - EMBL

Create successful ePaper yourself

Delete template?

Save as template?