14.04.2015 Views

next-generation sequencing & analysis - Chagall

next-generation sequencing & analysis - Chagall

next-generation sequencing & analysis - Chagall

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Next-­‐gen. <strong>sequencing</strong> and <strong>analysis</strong> <br />

ì <br />

Paul Zumbo, Laboratory of Christopher E. Mason, Ph.D. (04/18/2013) <br />

Credit: www.biocomicals.com, Alper Uzun, PhD.


Outline <br />

I. Next-­‐generaIon <strong>sequencing</strong> technologies <br />

II.<br />

Alignment <br />

III. Brief overview of a few applicaIons <br />

I. Whole genome <strong>sequencing</strong> <br />

II. ChromaIn immunoprecipitaIon <strong>sequencing</strong> <br />

III. MethylaIon <strong>sequencing</strong> <br />

IV. Transcriptome <strong>sequencing</strong> <br />

V. Closing


Two major (interconnected) themes <br />

1) Bioinforma1cs is not divorced from molecular <br />

biology: an understanding of molecular biology is a <br />

necessary condiIon for being able to effec$vely <br />

perform bioinformaIcs. <br />

2) Methodological rela1vism – there is a dependency <br />

of the result on the biochemical or bioinformaIc <br />

method (or both) employed in obtaining the result. <br />

Biochemical/<br />

bioinformaIc <br />

manipulaIon <br />

output


Sequencing is still new


Chain termination <br />

DeoxynucleoIdes (dATP) <br />

DideoxynucleoIde (ddATP) <br />

Wkipedia


Sanger Sequencing


Sanger <strong>sequencing</strong> – simple example <br />

Template: AGCT <br />

RXN tube A: A <br />

RXN tube G: AG <br />

RXN tube C: AGC <br />

RXN tube T: AGCT <br />

size <br />

T<br />

C<br />

G<br />

A<br />

A G C T


Metzker, Nat Rev Gene1cs, 11(1):31-­‐46 (2010) <br />

Template Immobilization Strategies


Sequencing by ligation (ABI Solid)


Di-­‐base encoding <br />

AT


Metzker, Nat Rev Gene1cs, 11(1):31-­‐46 (2010) <br />

Sequencing by synthesis


Metzker, Nat Rev Gene1cs, 11(1):31-­‐46 (2010) <br />

Reversible chain-­‐terminators


Ion semiconductor <strong>sequencing</strong> <br />

(Ion torrent) <br />

Lindsay, J Phys Condens MaNer, 24(16):164201 (2012)


Metzker, Nat Rev Gene1cs, 11(1):31-­‐46 (2010) <br />

Pyro<strong>sequencing</strong> (454)


Single-­‐molecule real-­‐time (SMRT) <strong>sequencing</strong> <br />

ì Pacific Biosciences <br />

Metzker, Nat Rev Gene1cs, 11(1):31-­‐46 (2010)


Glenn, Molecular Ecology, 11(5):759-­‐769 (2011) <br />

Throughput and cost


Cluster <br />

amplificaIon <br />

1st <br />

cut <br />

FLOWCELL <br />

Linearize DNA <br />

2nd <br />

cut <br />

FLOWCELL <br />

Strand re-­‐synthesis <br />

FLOWCELL <br />

Read 1 <br />

Sequence 1st strand <br />

Read 2 <br />

Paired-­‐end <strong>sequencing</strong> <br />

FLOWCELL <br />

FLOWCELL <br />

Linearize DNA <br />

Sequence 2nd strand <br />

© Illumina


Paired-­‐end <strong>sequencing</strong> is good for… <br />

• Assembly <br />

• DetecIng gene fusions <br />

• Characterize novel splice isoforms


Barcoding <br />

ì <br />

© Illumina


Auer et al., Gene1cs, 185(2):405-­‐16 (2010) <br />

Barcoding balances technical <br />

effects


Different chemistries yield different <br />

error profiles <br />

ì


Substitution errors are platform dependent <br />

Wang et al., BMC Systems Biology, 6(Suppl 3):S21 (2012)


(Pre-­‐)Phasing a source of errors <br />

ì <br />

Nakamura et al., Nucleic Acid Research, 39(13):e90 (2011)


Glenn, Molecular Ecology, 11(5):759-­‐769 (2011) <br />

Sequencing platforms


CLC Bio, Annual Survey (2012) <br />

Illumina dominates the market


CLC Bio, Annual Survey (2012) <br />

Primary application focus is WGS


NGS mostly applied toward basic research <br />

CLC Bio, Annual Survey (2012)


Aligners and alignment <br />

ì <br />

Image reproduced by permission of Dr. Anthony Fejes


FASTQ format <br />

@SEQ_ID<br />

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC<br />

+<br />

!''*((((***+))%%%++)(%%%%).1***-+*''))**


Illumina sequence identifiers (CASAVA 1.8.2) <br />

@PC140529:266:C1THYACXX:6:2213:9313:32696 1:N:0:TAATGCGCGGCTCTGA <br />

@:::::: ::: <br />

Illumina CASAVA 1.8.2 User Guide <br />

PC140529 <br />

Instrument name <br />

266 Run # on instrument <br />

C1THYACXX <br />

Flowcell ID <br />

6 Lane # <br />

2213 Tile # <br />

9313 X coordinate of cluster <br />

32696 Y coordinate of cluster <br />

1 Read number <br />

N <br />

Y if the read is filtered <br />

0 0 unless read idenIfied <br />

as control <br />

TAATGCGCGGCTCTGA <br />

Barcode sequence


Phred quality scores are linked to error <br />

probabilities <br />

Phred quality score <br />

Probability that the <br />

base is called wrong <br />

10 1 in 10 90% <br />

20 1 in 100 99% <br />

30 1 in 1,000 99.9% <br />

40 1 in 10,000 99.99% <br />

50 1 in 100,000 99.999% <br />

Accuracy of the base <br />

call


Quality scores useful for… <br />

ì Assessment of sequence quality <br />

ì RecogniIon and removal of low-­‐quality sequence <br />

ì DeterminaIon of accurate consensus sequence


Wikipedia <br />

Quality scores vary by platform


Two main types of alignment aglorthims <br />

1. Algorithms based on hash-­‐tables <br />

2. Algorithms based on suffix/prefix tries


Hash table <br />

Hash FuncIon <br />

Key 1 <br />

Value 3 <br />

Key 2 <br />

Value 2 <br />

Key 3 <br />

Value 1


Bananas <br />

ananas <br />

nanas <br />

anas <br />

nas <br />

as <br />

s <br />

Suffix trie


Spaced-­‐seed <br />

• Cut each posiIon in reference into <br />

equal-­‐sized pieces called “seeds” <br />

• Store pairs of spaced seeds in a lookup <br />

table (index) <br />

• Look up spaced seeds for each read <br />

• For each match, confirm remaining <br />

posiIons <br />

• Report alignment to user


Burrows-­‐Wheeler <br />

transform <br />

• IniIally developed for data <br />

compression <br />

• Store enIre BW-­‐transformed <br />

reference in memory


Ambiguous reads <br />

read <br />

reference


Sequence Alignment/Map (SAM) format <br />

The SAM Format SpecicaIon (v1.4-­‐r985)


0-­‐based vs. 1-­‐based coordinate systems <br />

System First-­‐base Interval <br />

type <br />

Nota1on Meaning Examples <br />

0-­‐based 0 half-­‐open [a,b) coordinate <br />

does not <br />

include b <br />

BAM, BED <br />

1-­‐based 1 closed [a,b] coordinate <br />

includes b <br />

SAM, GTF/<br />

GFF, <br />

wiggle, VCF


70 short-­‐read aligners <br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

BWA <br />

u<br />

BarraCUDA u<br />

Bfast <br />

u<br />

BioScope u<br />

Bow1e/Bow1e2 u<br />

CLC bio u<br />

CloudBurst u<br />

cross_match u<br />

Eland/ELAND2 u<br />

GEM <br />

u<br />

GMAP/GSNAP u<br />

GenomeMapper u<br />

GensearchNGS u<br />

GnuMap u<br />

Karma <br />

LAST <br />

MAQ <br />

MOM <br />

Mosaik <br />

MrFast/MrsFast <br />

NovoAlign <br />

PASS <br />

PerM <br />

RMAP <br />

RazerS <br />

SHRiMP <br />

SHrec <br />

SOAP/SOAP2 <br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

u<br />

SSAHA2 <br />

STAR <br />

SToRM <br />

Segemehl <br />

SeqMap <br />

Slider/SliderII <br />

Srprism <br />

Stampy <br />

Vmatch <br />

ZOOM <br />

rNA <br />

subread


Direct comparison of aligners difficult <br />

u Not all aligners have the same opIons <br />

u Comparisons are osen on the basis of default <br />

parameters <br />

u SyntheIc reads osen used, which do not posses the <br />

same error profile as ‘real’ reads <br />

u Benchmarking common


Koboldt, MassGenomics <br />

10 aligners compared


Simulated 2 mil . C. elegans read pairs <br />

• Used MAQ read sImulator <br />

• From a ‘mutated’ version of the C. elegans genome <br />

that contained ~90,000 SNps and ~10,000 indels <br />

• Seq IDs contained ‘real’ read locaIon: <br />

>chrI_7433167_7433425_1/1 <br />

TCGTATTATAACGACGAATACGCGTCAAGTGGGAGT <br />

Koboldt, MassGenomics


Koboldt, MassGenomics <br />

CPU time


cross_match & NovoAlign PE place the <br />

highest number of reads correctly <br />

Koboldt, MassGenomics


SNPs greater influence than influence <br />

than indels on accuracy <br />

Koboldt, MassGenomics


Repeated with real data from the 1000 <br />

Genomes Project <br />

Koboldt, MassGenomics


Read Placement Results for ~2 Million <br />

Reads <br />

Koboldt, MassGenomics


Top Performing Aligners <br />

Alignment Need <br />

Proven FuncIonality <br />

Speed <br />

SensiIvity <br />

Usability <br />

Flexibility <br />

Variant DetecIon <br />

Recommenda1on <br />

Maq <br />

BowIe, SOAP <br />

Novoalign, cross_match <br />

CLC bio, SeqMap <br />

BFAST <br />

SOAP, Novoalign <br />

Koboldt, MassGenomics


Mark Watson, Roslin Ins1tute <br />

The ‘Watson square’


Bioo Scien1fic Corp. <br />

Whole genome <strong>sequencing</strong>


Fragmentation method biases coverage


Read depth as a function of <br />

fragmentation method


Commonly Used Framework for <br />

Calling Mutations – the GATK <br />

ì


Genotype Calling -­‐ GATK


Genotype calling -­‐ Illumina <br />

• CASAVA (ELANDv2, mulI-­‐seeded aligner): <br />

• SNVs <br />

• Base calls are ignored where more than 2 mismatches to the reference <br />

sequence occur within 20 bases of the call. Note that this filter treats each <br />

inserIon or deleIon as a single mismatch. <br />

• If the call occurs within the first or last 20 bases of a read then the mismatch <br />

limit is applied to the 41 base window at the corresponding end of the read. <br />

• The mismatch limit is applied to the enIre read when the read length is 41 or <br />

shorter. <br />

• Indels: callSmallVariants module (local realignment) <br />

• GROUPER (large indels, CNVs) <br />

• TranslocaIons/DuplicaIons: <br />

• ClusterMerger (chimeric read pair filter) <br />

• ReadBroker (builds evidence by merging clusters) <br />

• AlignConIg (conIg re-­‐alignment for re-­‐arrangements)


GATK vs. Illumina


SNP calling – best practices <br />

1) Sufficient coverage (20x) <br />

2) Reads from forward and reverse strands <br />

3) Variable star sites <br />

4) ? High Q-­‐scores ? <br />

AGCTTTCGTACGATACCCATGACTATACTA


ChIP <strong>sequencing</strong>


Motif finding in tag enriched <br />

regions


Two fundamental types of peaks <br />

Kidder et al., Nature Immunology, 12:918-­‐922 (2011)


ChIP-­‐seq peak calling programs <br />

Wilbanks et al., PLoS One, 5(7):e11471 (2010)


Different number of peaks identified <br />

Wilbanks et al., PLoS One, 5(7):e11471 (2010)


Wilbanks et al., PLoS One, 5(7):e11471 (2010) <br />

Sensitivity assessment


Lister et al., Genome Research, 19:959-­‐966 (2009) <br />

Methylation <strong>sequencing</strong>


Bisulfite conversion <br />

ì


Conversion yields up to four <br />

potentially different DNA fragments <br />

ì <br />

Krueger et al., Nature Methods, 9:145-­‐151 (2012)


Krueger & Andrews, Bioinforma1cs, 27(11):1571-­‐1572 (2011) <br />

Bismark for mapping


Akalin et al., Genome Biology, 13:R87 (2012) <br />

methylKit for the <strong>analysis</strong>


Transcriptome <strong>sequencing</strong> <br />

ì <br />

a.k.a RNA-­‐seq <br />

Image reproduced by permission of Dr. Anthony Fejes


Definitions <br />

ì RNA = class of nucleic acids characterized by the <br />

sugar ribose and the pyrimidine uracil; involved in <br />

protein synthesis and in the transmission of geneIc <br />

informaIon <br />

ì Transcriptome = complete set of RNA molecules <br />

produced in one or a populaIon of cells, and their <br />

quanIty, for a specific developmental stage or <br />

physiological condiIon <br />

ì RNA-­‐seq = <strong>sequencing</strong> of the transcriptome


Technique workflow <br />

RNA isolaIon <br />

• phenol-­chloroform<br />

<br />

extracIon <br />

• silica membrane <br />

RNA <br />

fracIonaIon <br />

(opIonal) <br />

• poly(A)+ <br />

• ribo-­‐minus <br />

• DSN <br />

Library <br />

preparaIon <br />

• fragmentaIon <br />

• direcIonal <br />

Sequencing <br />

• Illumina <br />

• Solid <br />

• Helicos <br />

Analysis <br />

• reference-­based<br />

<br />

• assembly


RNA molecules<br />

1. fragmentation of RNA<br />

2. random priming to make sscDNA<br />

rst-strand synthesis)<br />

3. construction of dscDNA<br />

(second-strand synthesis)<br />

4. size selection<br />

RNA fragments<br />

sscDNA<br />

dscDNA<br />

Most typical RNA-­‐seq <br />

short<br />

long<br />

Gel cutout<br />

5. <strong>sequencing</strong><br />

sense<br />

RNA sequence<br />

paired-end read<br />

Roberts et al., Genome Biology, 12(3):R22 (2011) <br />

6. mapping<br />

anti-sense


Major classes of RNA <br />

Type (% in cell) Example Func1on Size (nt) <br />

Transfer RNA <br />

(15%) <br />

Ribosomal RNA <br />

(80%) <br />

Messenger RNA <br />

(5%) <br />

tRNA ala <br />

tRNA leu <br />

5S rRNA <br />

5.8S rRNA <br />

18S rRNA <br />

28S rRNA <br />

GAPDH <br />

Transferring of alanine during <br />

protein synthesis. <br />

Transferring of leucine during <br />

protein synthesis. <br />

Components of a ribosome in <br />

eukaryotes <br />

glyceraldehyde-­‐3-­‐phosphate <br />

dehydrogenase, var. 1, mRNA <br />

73 <br />

83 <br />

121 <br />

156 <br />

1869 <br />

5070 <br />

1401 <br />

ACTB <br />

beta acIn, mRNA <br />

1852 <br />

Lowe et al., Nucleic Acids Research 25(5), 955-­‐964 (1997) <br />

NCBI Reference Sequences: NR_023379.1, NR_003285.2, NR_003286.2, NR_003287.2 <br />

NCBI RefSeqGene IDs: 2597 (NM_002046.4), 60 (NM_001101.3) <br />

Lodish et al., Molecular Cell Biology. 4 th ed. New York: W. H. Freeman (2000)


Composition of RNA depends upon method of preparation <br />

ì<br />

In view of the apparent dependence of the composi@on <br />

of ribonucleic acids upon the methods of prepara@on as <br />

well as the certain considera$ons appear necessary for a <br />

proper assessment of physical and chemical descrip$ons <br />

of ribonucleic acid prepara$ons : (a) The prepara$ve <br />

procedure and source, (b) a statement of the physical-­chemical<br />

homogeneity of the ribonucleic acid, (c) a <br />

descrip$on of the degrada$on of the ribonucleic acid by <br />

ribonuclease, and (d) a statement of the mononucleo$de <br />

composi$on of the ribonucleic acid. It is with reference <br />

to these considera$ons that mammalian ribonucleic <br />

acids prepared by the guanidine salt procedure <br />

hereinaBer described have been studied. <br />

Volkin & Carter, J. of American Chemical Society 73(4), 1516-­‐1519 (1951)


Different extraction techniques <br />

GT:phenol:chloroform <br />

extracIon <br />

Silica membrane


Silica membrane enriches for RNA molecules > 200 nt <br />

a <br />

A <br />

A <br />

A <br />

A <br />

A <br />

afd <br />

d <br />

Mraz et al., Biochem Biophys Res Commun., 390(1): 1-­‐4 (2009) <br />

Haimov-­‐Kochman et al., Clin Chem., 52(1):159-­‐60 (2006)


“RNA was extracted from the cell lines using standard phenol:chloroform <br />

extraction followed by ethanol precipitation” <br />

Kedzierski & Porter, BioTechniques 10(2), 210-­‐214 (1991)


Zeugin JA & Hartley JL, Focus 7(4), 1-­‐4 (1985) <br />

Ethanol Precipitation


Schroeder et al., BMC Mol Biol. 7(3) (2006) <br />

RNA integrity


Genebody coverage across degraded poly(A)+ samples <br />

RIN0 RIN3 RIN6 RIN9 <br />

2 <br />

1.5 <br />

Coverage (%) <br />

1 <br />

0.5 <br />

0 <br />

5' 3' <br />

Genebody


Removal of ribosomal RNA <br />

Indirect <br />

Direct <br />

Credit: Invitrogen


Gene-­‐expression comparison of protein-­‐coding genes <br />

Cui et al., Genomics 96(5), 259-­‐265 (2010)


Abundance of non-­‐coding transcripts <br />

Indirect <br />

Direct <br />

Cui et al., Genomics 96(5), 259-­‐265 (2010)


Capturing stem-­‐loop mRNA <br />

20 <br />

18 <br />

ribo-­‐ <br />

poly(A)+ <br />

16 <br />

log(gene count+1, 2) <br />

14 <br />

12 <br />

10 <br />

8 <br />

6 <br />

4 <br />

2 <br />

0 <br />

HIST4H4 <br />

HIST3H3 <br />

HIST3H2A <br />

HIST2H3D <br />

HIST2H2AC <br />

HIST2H2BC <br />

HIST2H2BA <br />

HIST2H2BF <br />

HIST2H2AB <br />

HIST3H2BB <br />

HIST2H2BE <br />

HIST1H4J <br />

HIST1H4E <br />

HIST1H2AA <br />

HIST1H3I <br />

HIST1H1E <br />

HIST1H2AE <br />

HIST1H3B <br />

HIST1H4G <br />

HIST1H2BC <br />

HIST1H2BL <br />

HIST1H2BJ <br />

HIST1H2AL <br />

HIST1H1A <br />

HIST1H2BN <br />

HIST1H2BB <br />

HIST1H4L <br />

HIST1H3E <br />

HIST1H3G <br />

HIST1H2AG <br />

HIST1H2BE <br />

HIST1H2AI <br />

HIST1H4K <br />

HIST1H4B <br />

HIST1H1B <br />

HIST1H3H <br />

HIST1H2AM <br />

HIST1H1T <br />

HIST1H2BG <br />

HIST1H2AJ <br />

HIST1H4D <br />

HIST1H2AK <br />

HIST1H2BI <br />

HIST1H4F <br />

HIST1H2AC <br />

HIST1H3C <br />

HIST1H2BA <br />

HIST1H2BK <br />

HIST1H2AB <br />

HIST1H2AH <br />

HIST1H3D <br />

HIST1H4H <br />

HIST1H2BM <br />

HIST1H2BD <br />

HIST1H1D <br />

HIST1H3A <br />

HIST1H1C <br />

HIST1H2BH <br />

HIST1H2BO <br />

HIST1H4A <br />

HIST1H3J <br />

HIST1H4C <br />

HIST1H3F <br />

HIST1H2BF


Poly(A)+ vs. ribo-­‐: genebody coverage (RIN ~ 2) <br />

ribo-­‐ <br />

poly(A)+ <br />

2 <br />

1.5 <br />

Coverage (%) <br />

1 <br />

0.5 <br />

0 <br />

5' 3' <br />

Genebody


Bogdanova et al., Molecular BioSystems, 4: 205–212 (2008) <br />

Duplex-­‐specific nuclease


Yi et al., Nucleic Acid Research, 39(20):e140 (2011) <br />

DSN effective at removing rRNA


Abundant transcript measurements could be <br />

affected by DSN treatment <br />

Yi et al., Nucleic Acid Research, 39(20):e140 (2011)


Divalent cations promote RNA degradation <br />

Mg 2+ <br />

Silverman, Nucleic Acid Research, 33(19): 6151–6163 (2005)


Fragmentation of oligo-­‐dT primed cDNA is more <br />

biased towards the 3' end of the transcript <br />

Wang et al., Nature Gene1cs 10, 57-­‐63 (2009)


Not-­‐so-­‐random ‘random’ hexamers <br />

ì <br />

SEQC−ILM−NVS−A−1_AD0902ACXX_ATCACG_L01<br />

Nucleotide frequency<br />

0.0 0.1 0.2 0.3 0.4 0.5<br />

●<br />

●<br />

●<br />

● ● ●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

● ● ●● ● ●<br />

● ●●●●● ●●<br />

●<br />

●●●● ● ●●●●● ● ●● ● ●●●●● ● ●● ● ●●●●● ●● ● ●● ●●●●● ● ●●●●● ● ●● ● ● ● ●●●●● ●●●●● ●●●●●● ● ●●●●● ●●●● ●●● ●● ● ●● ● ●●●●●●●● ●●● ● ●● ● ●● ●●●● ●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●● ●●● ● ●<br />

●<br />

●<br />

● ● ● ●●● ● ●● ● ● ●● ●●● ●●●● ●●●● ● ●●●● ●●●● ●● ● ●● ● ●● ●●●● ●●●●●●●● ● ● ●● ● ●● ●●●●●●● ●● ●<br />

●● ● ● ●●● ●●● ●●●●● ● ● ●● ●●●●●●●● ● ●●●●●●●● ● ● ●● ●●●●● ● ● ●● ●● ● ●● ● ●●<br />

● ● ● ● ●●●●● ● ●●<br />

● ● ● ●●<br />

● ●● ●<br />

● ●<br />

● ●<br />

● ●<br />

●●<br />

●<br />

● ● ●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●●<br />

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

A<br />

T<br />

G<br />

C<br />

N<br />

0 20 40 60 80 100<br />

Read position


Positional bias caused by choice of primers <br />

ì <br />

Hansen et al., Nucleic Acid Research, 38(12):e131 (2010)


Levin et al., Nature Methods 7, 709–715 (2010) <br />

Methods for strand-­‐specific RNA-­‐seq


Quality assesment <br />

ì <br />

Levin et al., Nature Methods 7, 709–715 (2010)


dUTP most correlated with arrays <br />

Levin et al., Nature Methods 7, 709–715 (2010)


Did you<br />

QC the<br />

data?<br />

no<br />

yes<br />

QC the<br />

data<br />

Do you<br />

have a<br />

reference<br />

genome?<br />

RSeQC<br />

FastQC<br />

FASTX-toolkit<br />

no<br />

yes<br />

Assemble<br />

the data<br />

Are you<br />

interested<br />

in novel<br />

genes/<br />

junctions/<br />

isoforms?<br />

Velvet/Oases<br />

Trans-ABySS<br />

Trinity<br />

yes<br />

no<br />

Align to<br />

genome<br />

with a de<br />

novo splice<br />

aligner<br />

Do you just<br />

want gene<br />

expression<br />

values?<br />

TopHat<br />

STAR<br />

SoapSplice<br />

no<br />

yes<br />

Align to<br />

genome<br />

Align to<br />

transcriptome<br />

TopHat with GTF<br />

rSeq<br />

Decision Tree <br />

A guide to your analyIcal approach.


Trapnell et al., Bioinforma1cs, 1(25):1105-­‐1111 (2009) <br />

The TopHat pipeline


Dobin et al., Bioinforma1cs, 29(1):15-­‐21, 2013 <br />

STAR


STAR vs. TopHat


STAR vs Tophat <br />

ì<br />

ì<br />

ì<br />

STAR maps %17 more reads <br />

than TopHat <br />

Of the reads mapped in <br />

common, 80% concordance <br />

Differences osen due to 1-­‐<br />

offs


Mar1n et al., Nature Reviews Gene1cs, 12:671-­‐682 (2011) <br />

Reference-­‐based transcriptome assembly


Mar1n et al., Nature Reviews Gene1cs, 12:671-­‐682 (2011) <br />

De novo transcriptome assembly


TopHat manual <br />

Read the docs!


Methods of quantifying gene expression <br />

Wilhelm et al., Methods 48(3), 249-­‐257 (2009)


Credit: Simon Anders (HTSeq) <br />

Overlap resolution modes


Accurate quantification requires <br />

significant depth <br />

ì <br />

Toung et al., Genome Res., 21(6):991-­‐998 (2011)


Reads per kilobase of exon model per million mapped reads <br />

AssumpIon: the sensiIvity of RNA-­‐Seq will be a funcIon of both molar <br />

concentraIon and transcript length <br />

RPKM: where C = number of mappable reads that fell onto the gene's exons, N <br />

= total number of mappable reads in the experiment, and L is the sum of the <br />

exons in base pairs: <br />

“When these RNA standards are used in conjuncIon with informaIon on <br />

cellular RNA content, absolute transcript levels per cell can also be calculated. <br />

For example, on the basis of literature values for the mRNA content of a liver <br />

cell and the RNA standards, we esImated that 3 RPKM corresponds to about <br />

one transcript per liver cell. For C2C12 Issue culture cells, for which we know <br />

the starIng cell number and RNA preparaIon yields needed to make the <br />

calculaIon, a transcript of 1 RPKM corresponds to approximately one <br />

transcript per cell.” <br />

Mortazavi et al., Nature Methods, 5: 621-­‐628 (2008)


More sophisticated normalization methods are <br />

necessary <br />

• Imagine we sequenced two RNA populaIons, <br />

A and B. <br />

• Suppose every gene that is expressed in B is <br />

expressed in A with the same number of <br />

transcripts. <br />

• Assume that sample A also contains a set of <br />

genes equal in number and expression that <br />

are not expressed in B. Thus, sample A has <br />

twice as many total expressed genes as <br />

sample B, that is, its RNA producIon is twice <br />

the size of sample B. <br />

• Suppose that each sample is then sequenced <br />

to the same depth. Without any addiIonal <br />

adjustment, a gene expressed in both <br />

samples will have, on average, half the <br />

number of reads from sample A, since the <br />

reads are spread over twice as many genes. <br />

A B <br />

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ <br />

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ <br />

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ <br />

R(green) = 1/4 = 0.25 <br />

R(green) = 2/4 = 0.5 <br />

Robinson & Oshlack, Genome Biology, 11(3): R25 (2010)


Normalization procedure has a profound influence on <br />

which genes are detected as differentially expressed <br />

Hoffman et al., Genome Biology, 3(7):research0033.1-­‐research0033.11 (2002)


Different normalization methods <br />

1. Total count (TC): Gene counts are divided by the total number of mapped reads (or library size) <br />

associated with their lane and mulIplied by the mean total count across all the samples of the <br />

dataset. <br />

2. Upper Quar1le (UQ): Very similar in principle to TC, the total counts are replaced by the upper <br />

quarIle of counts different from 0 in the computaIon of the normalizaIon factors. <br />

3. Median (Med): Also similar to TC, the total counts are replaced by the median counts different <br />

from 0 in the computaIon of the normalizaIon factors. <br />

4. DESeq: This normalizaIon method is included in the DESeq Bioconductor package (version <br />

1.6.0) and is based on the hypothesis that most genes are not DE. <br />

5. Trimmed Mean of M-­‐values (TMM): This normalizaIon method is implemented in the edgeR <br />

Bioconductor package (version 2.4.0). It is also based on the hypothesis that most genes are not <br />

DE. <br />

6. Quan1le (Q): First proposed in the context of microarray data, this normalizaIon method <br />

consists in matching distribuIons of gene counts across lanes. <br />

7. Reads Per Kilobase per Million mapped reads (RPKM): This approach was iniIally introduced to <br />

facilitate comparisons between genes within a sample and combines between-­‐ and within-­sample<br />

normalizaIon.


Dillies et al., Brief Bioinforma1cs, Epub (2012) <br />

Comparison of normalization <br />

methods


DEGs in common for each of the normalization <br />

methods <br />

Dillies et al., Brief Bioinforma1cs, Epub (2012)


Summary of the normalization methods <br />

Dillies et al., Brief Bioinforma1cs, Epub (2012)


GAPDH isn’t much of a ‘housekeeper’ <br />

Barber et al., Physiol Genomics, 21:389-­‐395 (2005)


Soneson & Delorenzi, BMC Bioinforma1cs, 14:91 (2013) <br />

DEG dependent upon package


'The property of being one and the property of <br />

being many are contraries.’ <br />

RNA-­‐SEQ <br />

Protocol-­‐1 Protocol-­‐2 Protocol-­‐n


Broadly useful Linux tools / shell features <br />

Linux tools <br />

• awk -­‐ pa ern scanning and processing language <br />

• sed -­‐ stream editor for filtering and transforming text <br />

• tr – translate or delete characters <br />

• diff -­‐ find differences between two files <br />

• comm -­‐ compare two sorted files line by line <br />

• grep -­‐ print lines matching a pa ern <br />

• cut -­‐ remove secIons from each line of files <br />

• rename – rename files <br />

• uniq -­‐ report or omit repeated lines <br />

• sort -­‐ sort lines of text files <br />

• parallel – execute shell commands in parallel <br />

Shell features <br />

• process subsItuIon <br />

• comm -­‐12


Script limitations <br />

1) Linear execuIon of commands <br />

2) Aborted scripts leave truncated files <br />

3) No convenient way to resume a script <br />

4) Poor audit trail


Structure of a makefile <br />

!<br />

General structure <br />

target …: dependency …!<br />

!command!<br />

!…!<br />

!… !!<br />

!<br />

Example <br />

%.sam: %.fastq.gz!<br />

!bowtie --sam<br />

reference $< $>!<br />

www.gnu.org/sooware/make


Maintain a bioinformatics journal <br />

The problem <br />

Paul: “I was just playing around <br />

with the data and ….” <br />

… 8 months later: <br />

Chris: Do you remember when <br />

you sent me …? Can we do that <br />

again on …? <br />

Paul: <br />

The soluIon <br />

• Maintain a journal <br />

• Record where and when you <br />

retrieved the data <br />

• annotaIon databases, e.g., <br />

update osen <br />

• Describe acIons in enough detail <br />

that others can replicate the <br />

result using the same steps <br />

• If you think it’s trivial, log it <br />

anyway: <br />

• Removed a header? Log it. <br />

• Sorted a file? Log it. <br />

• Create an alignment index? Log <br />

it. <br />

• Fooling around? Log it.


More basic research to be done


Empirical dialectical method <br />

p <br />

negaIon <br />

-­‐p <br />

empiricism <br />

q


Closing <br />

ì Knowledge is not a series of self-­‐consistent theories <br />

that converges toward an ideal view; it is rather an <br />

ever increasing ocean of mutually incompa$ble <br />

(and perhaps even incommensurable) alterna$ves, <br />

each single theory, each fairy tale, each myth that is <br />

part of the collec$on forcing the others into greater <br />

ar$cula$on and all of them contribu$ng, via this <br />

process of compe$$on, to the development of our <br />

consciousness. <br />

Paul Feyerabend

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!