next-generation sequencing & analysis - Chagall

Next-‐gen. sequencing and analysis 

ì 

Paul Zumbo, Laboratory of Christopher E. Mason, Ph.D. (04/18/2013) 

Credit: www.biocomicals.com, Alper Uzun, PhD.

Outline 

I. Next-‐generaIon sequencing technologies 

II. 

Alignment 

III. Brief overview of a few applicaIons 

I. Whole genome sequencing 

II. ChromaIn immunoprecipitaIon sequencing 

III. MethylaIon sequencing 

IV. Transcriptome sequencing 

V. Closing

Two major (interconnected) themes 

1) Bioinforma1cs is not divorced from molecular 

biology: an understanding of molecular biology is a 

necessary condiIon for being able to effec$vely 

perform bioinformaIcs. 

2) Methodological rela1vism – there is a dependency 

of the result on the biochemical or bioinformaIc 

method (or both) employed in obtaining the result. 

Biochemical/ 

bioinformaIc 

manipulaIon 

output

Sequencing is still new

Chain termination 

DeoxynucleoIdes (dATP) 

DideoxynucleoIde (ddATP) 

Wkipedia

Sanger Sequencing

Sanger sequencing – simple example 

Template: AGCT 

RXN tube A: A 

RXN tube G: AG 

RXN tube C: AGC 

RXN tube T: AGCT 

size 

T 

C 

G 

A 

A G C T

Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010) 

Template Immobilization Strategies

Sequencing by ligation (ABI Solid)

Di-‐base encoding 

AT


Sequencing by synthesis


Reversible chain-‐terminators

Ion semiconductor sequencing 

(Ion torrent) 

Lindsay, J Phys Condens MaNer, 24(16):164201 (2012)


Pyrosequencing (454)

Single-‐molecule real-‐time (SMRT) sequencing 

ì Pacific Biosciences 

Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010)

Glenn, Molecular Ecology, 11(5):759-‐769 (2011) 

Throughput and cost

Cluster 

amplificaIon 

1st 

cut 

FLOWCELL 

Linearize DNA 

2nd 

cut 

FLOWCELL 

Strand re-‐synthesis 

FLOWCELL 

Read 1 

Sequence 1st strand 

Read 2 

Paired-‐end sequencing 

FLOWCELL 

FLOWCELL 

Linearize DNA 

Sequence 2nd strand 

© Illumina

Paired-‐end sequencing is good for… 

• Assembly 

• DetecIng gene fusions 

• Characterize novel splice isoforms

Barcoding 

ì 

© Illumina

Auer et al., Gene1cs, 185(2):405-‐16 (2010) 

Barcoding balances technical 

effects

Different chemistries yield different 

error profiles 

ì

Substitution errors are platform dependent 

Wang et al., BMC Systems Biology, 6(Suppl 3):S21 (2012)

(Pre-‐)Phasing a source of errors 

ì 

Nakamura et al., Nucleic Acid Research, 39(13):e90 (2011)

Glenn, Molecular Ecology, 11(5):759-‐769 (2011) 

Sequencing platforms

CLC Bio, Annual Survey (2012) 

Illumina dominates the market

CLC Bio, Annual Survey (2012) 

Primary application focus is WGS

NGS mostly applied toward basic research 

CLC Bio, Annual Survey (2012)

Aligners and alignment 

ì 

Image reproduced by permission of Dr. Anthony Fejes

FASTQ format 

@SEQ_ID 

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC 

+ 

!''*((((***+))%%%++)(%%%%).1***-+*''))**

Illumina sequence identifiers (CASAVA 1.8.2) 

@PC140529:266:C1THYACXX:6:2213:9313:32696 1:N:0:TAATGCGCGGCTCTGA 

@:::::: ::: 

Illumina CASAVA 1.8.2 User Guide 

PC140529 

Instrument name 

266 Run # on instrument 

C1THYACXX 

Flowcell ID 

6 Lane # 

2213 Tile # 

9313 X coordinate of cluster 

32696 Y coordinate of cluster 

1 Read number 

N 

Y if the read is filtered 

0 0 unless read idenIfied 

as control 

TAATGCGCGGCTCTGA 

Barcode sequence

Phred quality scores are linked to error 

probabilities 

Phred quality score 

Probability that the 

base is called wrong 

10 1 in 10 90% 

20 1 in 100 99% 

30 1 in 1,000 99.9% 

40 1 in 10,000 99.99% 

50 1 in 100,000 99.999% 

Accuracy of the base 

call

Quality scores useful for… 

ì Assessment of sequence quality 

ì RecogniIon and removal of low-‐quality sequence 

ì DeterminaIon of accurate consensus sequence

Wikipedia 

Quality scores vary by platform

Two main types of alignment aglorthims 

1. Algorithms based on hash-‐tables 

2. Algorithms based on suffix/prefix tries

Hash table 

Hash FuncIon 

Key 1 

Value 3 

Key 2 

Value 2 

Key 3 

Value 1

Bananas 

ananas 

nanas 

anas 

nas 

as 

s 

Suffix trie

Spaced-‐seed 

• Cut each posiIon in reference into 

equal-‐sized pieces called “seeds” 

• Store pairs of spaced seeds in a lookup 

table (index) 

• Look up spaced seeds for each read 

• For each match, confirm remaining 

posiIons 

• Report alignment to user

Burrows-‐Wheeler 

transform 

• IniIally developed for data 

compression 

• Store enIre BW-‐transformed 

reference in memory

Ambiguous reads 

read 

reference

Sequence Alignment/Map (SAM) format 

The SAM Format SpecicaIon (v1.4-‐r985)

0-‐based vs. 1-‐based coordinate systems 

System First-‐base Interval 

type 

Nota1on Meaning Examples 

0-‐based 0 half-‐open [a,b) coordinate 

does not 

include b 

BAM, BED 

1-‐based 1 closed [a,b] coordinate 

includes b 

SAM, GTF/ 

GFF, 

wiggle, VCF

70 short-‐read aligners 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

BWA 

u 

BarraCUDA u 

Bfast 

u 

BioScope u 

Bow1e/Bow1e2 u 

CLC bio u 

CloudBurst u 

cross_match u 

Eland/ELAND2 u 

GEM 

u 

GMAP/GSNAP u 

GenomeMapper u 

GensearchNGS u 

GnuMap u 

Karma 

LAST 

MAQ 

MOM 

Mosaik 

MrFast/MrsFast 

NovoAlign 

PASS 

PerM 

RMAP 

RazerS 

SHRiMP 

SHrec 

SOAP/SOAP2 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

u 

SSAHA2 

STAR 

SToRM 

Segemehl 

SeqMap 

Slider/SliderII 

Srprism 

Stampy 

Vmatch 

ZOOM 

rNA 

subread

Direct comparison of aligners difficult 

u Not all aligners have the same opIons 

u Comparisons are osen on the basis of default 

parameters 

u SyntheIc reads osen used, which do not posses the 

same error profile as ‘real’ reads 

u Benchmarking common

Koboldt, MassGenomics 

10 aligners compared

Simulated 2 mil . C. elegans read pairs 

• Used MAQ read sImulator 

• From a ‘mutated’ version of the C. elegans genome 

that contained ~90,000 SNps and ~10,000 indels 

• Seq IDs contained ‘real’ read locaIon: 

>chrI_7433167_7433425_1/1 

TCGTATTATAACGACGAATACGCGTCAAGTGGGAGT 

Koboldt, MassGenomics

Koboldt, MassGenomics 

CPU time

cross_match & NovoAlign PE place the 

highest number of reads correctly 


SNPs greater influence than influence 

than indels on accuracy 


Repeated with real data from the 1000 

Genomes Project 


Read Placement Results for ~2 Million 

Reads 


Top Performing Aligners 

Alignment Need 

Proven FuncIonality 

Speed 

SensiIvity 

Usability 

Flexibility 

Variant DetecIon 

Recommenda1on 

Maq 

BowIe, SOAP 

Novoalign, cross_match 

CLC bio, SeqMap 

BFAST 

SOAP, Novoalign 


Mark Watson, Roslin Ins1tute 

The ‘Watson square’

Bioo Scien1fic Corp. 

Whole genome sequencing

Fragmentation method biases coverage

Read depth as a function of 

fragmentation method

Commonly Used Framework for 

Calling Mutations – the GATK 

ì

Genotype Calling -‐ GATK

Genotype calling -‐ Illumina 

• CASAVA (ELANDv2, mulI-‐seeded aligner): 

• SNVs 

• Base calls are ignored where more than 2 mismatches to the reference 

sequence occur within 20 bases of the call. Note that this filter treats each 

inserIon or deleIon as a single mismatch. 

• If the call occurs within the first or last 20 bases of a read then the mismatch 

limit is applied to the 41 base window at the corresponding end of the read. 

• The mismatch limit is applied to the enIre read when the read length is 41 or 

shorter. 

• Indels: callSmallVariants module (local realignment) 

• GROUPER (large indels, CNVs) 

• TranslocaIons/DuplicaIons: 

• ClusterMerger (chimeric read pair filter) 

• ReadBroker (builds evidence by merging clusters) 

• AlignConIg (conIg re-‐alignment for re-‐arrangements)

GATK vs. Illumina

SNP calling – best practices 

1) Sufficient coverage (20x) 

2) Reads from forward and reverse strands 

3) Variable star sites 

4) ? High Q-‐scores ? 

AGCTTTCGTACGATACCCATGACTATACTA

ChIP sequencing

Motif finding in tag enriched 

regions

Two fundamental types of peaks 

Kidder et al., Nature Immunology, 12:918-‐922 (2011)

ChIP-‐seq peak calling programs 

Wilbanks et al., PLoS One, 5(7):e11471 (2010)

Different number of peaks identified 

Wilbanks et al., PLoS One, 5(7):e11471 (2010)

Wilbanks et al., PLoS One, 5(7):e11471 (2010) 

Sensitivity assessment

Lister et al., Genome Research, 19:959-‐966 (2009) 

Methylation sequencing

Bisulfite conversion 

ì

Conversion yields up to four 

potentially different DNA fragments 

ì 

Krueger et al., Nature Methods, 9:145-‐151 (2012)

Krueger & Andrews, Bioinforma1cs, 27(11):1571-‐1572 (2011) 

Bismark for mapping

Akalin et al., Genome Biology, 13:R87 (2012) 

methylKit for the analysis

Transcriptome sequencing 

ì 

a.k.a RNA-‐seq 

Image reproduced by permission of Dr. Anthony Fejes

Definitions 

ì RNA = class of nucleic acids characterized by the 

sugar ribose and the pyrimidine uracil; involved in 

protein synthesis and in the transmission of geneIc 

informaIon 

ì Transcriptome = complete set of RNA molecules 

produced in one or a populaIon of cells, and their 

quanIty, for a specific developmental stage or 

physiological condiIon 

ì RNA-‐seq = sequencing of the transcriptome

Technique workflow 

RNA isolaIon 

• phenol-chloroform 

 

extracIon 

• silica membrane 

RNA 

fracIonaIon 

(opIonal) 

• poly(A)+ 

• ribo-‐minus 

• DSN 

Library 

preparaIon 

• fragmentaIon 

• direcIonal 

Sequencing 

• Illumina 

• Solid 

• Helicos 

Analysis 

• reference-based 

 

• assembly

RNA molecules 

1. fragmentation of RNA 

2. random priming to make sscDNA 

rst-strand synthesis) 

3. construction of dscDNA 

(second-strand synthesis) 

4. size selection 

RNA fragments 

sscDNA 

dscDNA 

Most typical RNA-‐seq 

short 

long 

Gel cutout 

5. sequencing 

sense 

RNA sequence 

paired-end read 

Roberts et al., Genome Biology, 12(3):R22 (2011) 

6. mapping 

anti-sense

Major classes of RNA 

Type (% in cell) Example Func1on Size (nt) 

Transfer RNA 

(15%) 

Ribosomal RNA 

(80%) 

Messenger RNA 

(5%) 

tRNA ala 

tRNA leu 

5S rRNA 

5.8S rRNA 

18S rRNA 

28S rRNA 

GAPDH 

Transferring of alanine during 

protein synthesis. 

Transferring of leucine during 

protein synthesis. 

Components of a ribosome in 

eukaryotes 

glyceraldehyde-‐3-‐phosphate 

dehydrogenase, var. 1, mRNA 

73 

83 

121 

156 

1869 

5070 

1401 

ACTB 

beta acIn, mRNA 

1852 

Lowe et al., Nucleic Acids Research 25(5), 955-‐964 (1997) 

NCBI Reference Sequences: NR_023379.1, NR_003285.2, NR_003286.2, NR_003287.2 

NCBI RefSeqGene IDs: 2597 (NM_002046.4), 60 (NM_001101.3) 

Lodish et al., Molecular Cell Biology. 4 th ed. New York: W. H. Freeman (2000)

Composition of RNA depends upon method of preparation 

ì 

In view of the apparent dependence of the composi@on 

of ribonucleic acids upon the methods of prepara@on as 

well as the certain considera$ons appear necessary for a 

proper assessment of physical and chemical descrip$ons 

of ribonucleic acid prepara$ons : (a) The prepara$ve 

procedure and source, (b) a statement of the physical-chemical 

homogeneity of the ribonucleic acid, (c) a 

descrip$on of the degrada$on of the ribonucleic acid by 

ribonuclease, and (d) a statement of the mononucleo$de 

composi$on of the ribonucleic acid. It is with reference 

to these considera$ons that mammalian ribonucleic 

acids prepared by the guanidine salt procedure 

hereinaBer described have been studied. 

Volkin & Carter, J. of American Chemical Society 73(4), 1516-‐1519 (1951)

Different extraction techniques 

GT:phenol:chloroform 

extracIon 

Silica membrane

Silica membrane enriches for RNA molecules > 200 nt 

a 

A 

A 

A 

A 

A 

afd 

d 

Mraz et al., Biochem Biophys Res Commun., 390(1): 1-‐4 (2009) 

Haimov-‐Kochman et al., Clin Chem., 52(1):159-‐60 (2006)

“RNA was extracted from the cell lines using standard phenol:chloroform 

extraction followed by ethanol precipitation” 

Kedzierski & Porter, BioTechniques 10(2), 210-‐214 (1991)

Zeugin JA & Hartley JL, Focus 7(4), 1-‐4 (1985) 

Ethanol Precipitation

Schroeder et al., BMC Mol Biol. 7(3) (2006) 

RNA integrity

Genebody coverage across degraded poly(A)+ samples 

RIN0 RIN3 RIN6 RIN9 

2 

1.5 

Coverage (%) 

1 

0.5 

0 

5' 3' 

Genebody

Removal of ribosomal RNA 

Indirect 

Direct 

Credit: Invitrogen

Gene-‐expression comparison of protein-‐coding genes 

Cui et al., Genomics 96(5), 259-‐265 (2010)

Abundance of non-‐coding transcripts 

Indirect 

Direct 

Cui et al., Genomics 96(5), 259-‐265 (2010)

Capturing stem-‐loop mRNA 

20 

18 

ribo-‐ 

poly(A)+ 

16 

log(gene count+1, 2) 

14 

12 

10 

8 

6 

4 

2 

0 

HIST4H4 

HIST3H3 

HIST3H2A 

HIST2H3D 

HIST2H2AC 

HIST2H2BC 

HIST2H2BA 

HIST2H2BF 

HIST2H2AB 

HIST3H2BB 

HIST2H2BE 

HIST1H4J 

HIST1H4E 

HIST1H2AA 

HIST1H3I 

HIST1H1E 

HIST1H2AE 

HIST1H3B 

HIST1H4G 

HIST1H2BC 

HIST1H2BL 

HIST1H2BJ 

HIST1H2AL 

HIST1H1A 

HIST1H2BN 

HIST1H2BB 

HIST1H4L 

HIST1H3E 

HIST1H3G 

HIST1H2AG 

HIST1H2BE 

HIST1H2AI 

HIST1H4K 

HIST1H4B 

HIST1H1B 

HIST1H3H 

HIST1H2AM 

HIST1H1T 

HIST1H2BG 

HIST1H2AJ 

HIST1H4D 

HIST1H2AK 

HIST1H2BI 

HIST1H4F 

HIST1H2AC 

HIST1H3C 

HIST1H2BA 

HIST1H2BK 

HIST1H2AB 

HIST1H2AH 

HIST1H3D 

HIST1H4H 

HIST1H2BM 

HIST1H2BD 

HIST1H1D 

HIST1H3A 

HIST1H1C 

HIST1H2BH 

HIST1H2BO 

HIST1H4A 

HIST1H3J 

HIST1H4C 

HIST1H3F 

HIST1H2BF

Poly(A)+ vs. ribo-‐: genebody coverage (RIN ~ 2) 

ribo-‐ 

poly(A)+ 

2 

1.5 

Coverage (%) 

1 

0.5 

0 

5' 3' 

Genebody

Bogdanova et al., Molecular BioSystems, 4: 205–212 (2008) 

Duplex-‐specific nuclease

Yi et al., Nucleic Acid Research, 39(20):e140 (2011) 

DSN effective at removing rRNA

Abundant transcript measurements could be 

affected by DSN treatment 

Yi et al., Nucleic Acid Research, 39(20):e140 (2011)

Divalent cations promote RNA degradation 

Mg 2+ 

Silverman, Nucleic Acid Research, 33(19): 6151–6163 (2005)

Fragmentation of oligo-‐dT primed cDNA is more 

biased towards the 3' end of the transcript 

Wang et al., Nature Gene1cs 10, 57-‐63 (2009)

Not-‐so-‐random ‘random’ hexamers 

ì 

SEQC−ILM−NVS−A−1_AD0902ACXX_ATCACG_L01 

Nucleotide frequency 

0.0 0.1 0.2 0.3 0.4 0.5 

● 

● 

● 

● ● ● 

● 

● 

● 

● 

● 

● ● ●● ● ● 

● ●●●●● ●● 

● 

●●●● ● ●●●●● ● ●● ● ●●●●● ● ●● ● ●●●●● ●● ● ●● ●●●●● ● ●●●●● ● ●● ● ● ● ●●●●● ●●●●● ●●●●●● ● ●●●●● ●●●● ●●● ●● ● ●● ● ●●●●●●●● ●●● ● ●● ● ●● ●●●● ●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●● ●●● ● ● 

● 

● 

● ● ● ●●● ● ●● ● ● ●● ●●● ●●●● ●●●● ● ●●●● ●●●● ●● ● ●● ● ●● ●●●● ●●●●●●●● ● ● ●● ● ●● ●●●●●●● ●● ● 

●● ● ● ●●● ●●● ●●●●● ● ● ●● ●●●●●●●● ● ●●●●●●●● ● ● ●● ●●●●● ● ● ●● ●● ● ●● ● ●● 

● ● ● ● ●●●●● ● ●● 

● ● ● ●● 

● ●● ● 

● ● 

● ● 

● ● 

●● 

● 

● ● ● 

● 

● 

● 

● 

● 

● 

● 

● 

●● 

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 

● 

● 

● 

● 

● 

A 

T 

G 

C 

N 

0 20 40 60 80 100 

Read position

Positional bias caused by choice of primers 

ì 

Hansen et al., Nucleic Acid Research, 38(12):e131 (2010)

Levin et al., Nature Methods 7, 709–715 (2010) 

Methods for strand-‐specific RNA-‐seq

Quality assesment 

ì 

Levin et al., Nature Methods 7, 709–715 (2010)

dUTP most correlated with arrays 

Levin et al., Nature Methods 7, 709–715 (2010)

Did you 

QC the 

data? 

no 

yes 

QC the 

data 

Do you 

have a 

reference 

genome? 

RSeQC 

FastQC 

FASTX-toolkit 

no 

yes 

Assemble 

the data 

Are you 

interested 

in novel 

genes/ 

junctions/ 

isoforms? 

Velvet/Oases 

Trans-ABySS 

Trinity 

yes 

no 

Align to 

genome 

with a de 

novo splice 

aligner 

Do you just 

want gene 

expression 

values? 

TopHat 

STAR 

SoapSplice 

no 

yes 

Align to 

genome 

Align to 

transcriptome 

TopHat with GTF 

rSeq 

Decision Tree 

A guide to your analyIcal approach.

Trapnell et al., Bioinforma1cs, 1(25):1105-‐1111 (2009) 

The TopHat pipeline

Dobin et al., Bioinforma1cs, 29(1):15-‐21, 2013 

STAR

STAR vs. TopHat

STAR vs Tophat 

ì 

ì 

ì 

STAR maps %17 more reads 

than TopHat 

Of the reads mapped in 

common, 80% concordance 

Differences osen due to 1-‐ 

offs

Mar1n et al., Nature Reviews Gene1cs, 12:671-‐682 (2011) 

Reference-‐based transcriptome assembly

Mar1n et al., Nature Reviews Gene1cs, 12:671-‐682 (2011) 

De novo transcriptome assembly

TopHat manual 

Read the docs!

Methods of quantifying gene expression 

Wilhelm et al., Methods 48(3), 249-‐257 (2009)

Credit: Simon Anders (HTSeq) 

Overlap resolution modes

Accurate quantification requires 

significant depth 

ì 

Toung et al., Genome Res., 21(6):991-‐998 (2011)

Reads per kilobase of exon model per million mapped reads 

AssumpIon: the sensiIvity of RNA-‐Seq will be a funcIon of both molar 

concentraIon and transcript length 

RPKM: where C = number of mappable reads that fell onto the gene's exons, N 

= total number of mappable reads in the experiment, and L is the sum of the 

exons in base pairs: 

“When these RNA standards are used in conjuncIon with informaIon on 

cellular RNA content, absolute transcript levels per cell can also be calculated. 

For example, on the basis of literature values for the mRNA content of a liver 

cell and the RNA standards, we esImated that 3 RPKM corresponds to about 

one transcript per liver cell. For C2C12 Issue culture cells, for which we know 

the starIng cell number and RNA preparaIon yields needed to make the 

calculaIon, a transcript of 1 RPKM corresponds to approximately one 

transcript per cell.” 

Mortazavi et al., Nature Methods, 5: 621-‐628 (2008)

More sophisticated normalization methods are 

necessary 

• Imagine we sequenced two RNA populaIons, 

A and B. 

• Suppose every gene that is expressed in B is 

expressed in A with the same number of 

transcripts. 

• Assume that sample A also contains a set of 

genes equal in number and expression that 

are not expressed in B. Thus, sample A has 

twice as many total expressed genes as 

sample B, that is, its RNA producIon is twice 

the size of sample B. 

• Suppose that each sample is then sequenced 

to the same depth. Without any addiIonal 

adjustment, a gene expressed in both 

samples will have, on average, half the 

number of reads from sample A, since the 

reads are spread over twice as many genes. 

A B 

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 

-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 

R(green) = 1/4 = 0.25 

R(green) = 2/4 = 0.5 

Robinson & Oshlack, Genome Biology, 11(3): R25 (2010)

Normalization procedure has a profound influence on 

which genes are detected as differentially expressed 

Hoffman et al., Genome Biology, 3(7):research0033.1-‐research0033.11 (2002)

Different normalization methods 

1. Total count (TC): Gene counts are divided by the total number of mapped reads (or library size) 

associated with their lane and mulIplied by the mean total count across all the samples of the 

dataset. 

2. Upper Quar1le (UQ): Very similar in principle to TC, the total counts are replaced by the upper 

quarIle of counts different from 0 in the computaIon of the normalizaIon factors. 

3. Median (Med): Also similar to TC, the total counts are replaced by the median counts different 

from 0 in the computaIon of the normalizaIon factors. 

4. DESeq: This normalizaIon method is included in the DESeq Bioconductor package (version 

1.6.0) and is based on the hypothesis that most genes are not DE. 

5. Trimmed Mean of M-‐values (TMM): This normalizaIon method is implemented in the edgeR 

Bioconductor package (version 2.4.0). It is also based on the hypothesis that most genes are not 

DE. 

6. Quan1le (Q): First proposed in the context of microarray data, this normalizaIon method 

consists in matching distribuIons of gene counts across lanes. 

7. Reads Per Kilobase per Million mapped reads (RPKM): This approach was iniIally introduced to 

facilitate comparisons between genes within a sample and combines between-‐ and within-sample 

normalizaIon.

Dillies et al., Brief Bioinforma1cs, Epub (2012) 

Comparison of normalization 

methods

DEGs in common for each of the normalization 

methods 

Dillies et al., Brief Bioinforma1cs, Epub (2012)

Summary of the normalization methods 

Dillies et al., Brief Bioinforma1cs, Epub (2012)

GAPDH isn’t much of a ‘housekeeper’ 

Barber et al., Physiol Genomics, 21:389-‐395 (2005)

Soneson & Delorenzi, BMC Bioinforma1cs, 14:91 (2013) 

DEG dependent upon package

'The property of being one and the property of 

being many are contraries.’ 

RNA-‐SEQ 

Protocol-‐1 Protocol-‐2 Protocol-‐n

Broadly useful Linux tools / shell features 

Linux tools 

• awk -‐ pa ern scanning and processing language 

• sed -‐ stream editor for filtering and transforming text 

• tr – translate or delete characters 

• diff -‐ find differences between two files 

• comm -‐ compare two sorted files line by line 

• grep -‐ print lines matching a pa ern 

• cut -‐ remove secIons from each line of files 

• rename – rename files 

• uniq -‐ report or omit repeated lines 

• sort -‐ sort lines of text files 

• parallel – execute shell commands in parallel 

Shell features 

• process subsItuIon 

• comm -‐12

Script limitations 

1) Linear execuIon of commands 

2) Aborted scripts leave truncated files 

3) No convenient way to resume a script 

4) Poor audit trail

Structure of a makefile 

! 

General structure 

target …: dependency …! 

!command! 

!…! 

!… !! 

! 

Example 

%.sam: %.fastq.gz! 

!bowtie --sam 

reference $< $>! 

www.gnu.org/sooware/make

Maintain a bioinformatics journal 

The problem 

Paul: “I was just playing around 

with the data and ….” 

… 8 months later: 

Chris: Do you remember when 

you sent me …? Can we do that 

again on …? 

Paul: 

The soluIon 

• Maintain a journal 

• Record where and when you 

retrieved the data 

• annotaIon databases, e.g., 

update osen 

• Describe acIons in enough detail 

that others can replicate the 

result using the same steps 

• If you think it’s trivial, log it 

anyway: 

• Removed a header? Log it. 

• Sorted a file? Log it. 

• Create an alignment index? Log 

it. 

• Fooling around? Log it.

More basic research to be done

Empirical dialectical method 

p 

negaIon 

-‐p 

empiricism 

q

Closing 

ì Knowledge is not a series of self-‐consistent theories 

that converges toward an ideal view; it is rather an 

ever increasing ocean of mutually incompa$ble 

(and perhaps even incommensurable) alterna$ves, 

each single theory, each fairy tale, each myth that is 

part of the collec$on forcing the others into greater 

ar$cula$on and all of them contribu$ng, via this 

process of compe$$on, to the development of our 

consciousness. 

Paul Feyerabend

next-generation sequencing & analysis - Chagall

Create successful ePaper yourself

Delete template?

Save as template?