next-generation sequencing & analysis - Chagall
next-generation sequencing & analysis - Chagall
next-generation sequencing & analysis - Chagall
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Next-‐gen. <strong>sequencing</strong> and <strong>analysis</strong> <br />
ì <br />
Paul Zumbo, Laboratory of Christopher E. Mason, Ph.D. (04/18/2013) <br />
Credit: www.biocomicals.com, Alper Uzun, PhD.
Outline <br />
I. Next-‐generaIon <strong>sequencing</strong> technologies <br />
II.<br />
Alignment <br />
III. Brief overview of a few applicaIons <br />
I. Whole genome <strong>sequencing</strong> <br />
II. ChromaIn immunoprecipitaIon <strong>sequencing</strong> <br />
III. MethylaIon <strong>sequencing</strong> <br />
IV. Transcriptome <strong>sequencing</strong> <br />
V. Closing
Two major (interconnected) themes <br />
1) Bioinforma1cs is not divorced from molecular <br />
biology: an understanding of molecular biology is a <br />
necessary condiIon for being able to effec$vely <br />
perform bioinformaIcs. <br />
2) Methodological rela1vism – there is a dependency <br />
of the result on the biochemical or bioinformaIc <br />
method (or both) employed in obtaining the result. <br />
Biochemical/<br />
bioinformaIc <br />
manipulaIon <br />
output
Sequencing is still new
Chain termination <br />
DeoxynucleoIdes (dATP) <br />
DideoxynucleoIde (ddATP) <br />
Wkipedia
Sanger Sequencing
Sanger <strong>sequencing</strong> – simple example <br />
Template: AGCT <br />
RXN tube A: A <br />
RXN tube G: AG <br />
RXN tube C: AGC <br />
RXN tube T: AGCT <br />
size <br />
T<br />
C<br />
G<br />
A<br />
A G C T
Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010) <br />
Template Immobilization Strategies
Sequencing by ligation (ABI Solid)
Di-‐base encoding <br />
AT
Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010) <br />
Sequencing by synthesis
Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010) <br />
Reversible chain-‐terminators
Ion semiconductor <strong>sequencing</strong> <br />
(Ion torrent) <br />
Lindsay, J Phys Condens MaNer, 24(16):164201 (2012)
Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010) <br />
Pyro<strong>sequencing</strong> (454)
Single-‐molecule real-‐time (SMRT) <strong>sequencing</strong> <br />
ì Pacific Biosciences <br />
Metzker, Nat Rev Gene1cs, 11(1):31-‐46 (2010)
Glenn, Molecular Ecology, 11(5):759-‐769 (2011) <br />
Throughput and cost
Cluster <br />
amplificaIon <br />
1st <br />
cut <br />
FLOWCELL <br />
Linearize DNA <br />
2nd <br />
cut <br />
FLOWCELL <br />
Strand re-‐synthesis <br />
FLOWCELL <br />
Read 1 <br />
Sequence 1st strand <br />
Read 2 <br />
Paired-‐end <strong>sequencing</strong> <br />
FLOWCELL <br />
FLOWCELL <br />
Linearize DNA <br />
Sequence 2nd strand <br />
© Illumina
Paired-‐end <strong>sequencing</strong> is good for… <br />
• Assembly <br />
• DetecIng gene fusions <br />
• Characterize novel splice isoforms
Barcoding <br />
ì <br />
© Illumina
Auer et al., Gene1cs, 185(2):405-‐16 (2010) <br />
Barcoding balances technical <br />
effects
Different chemistries yield different <br />
error profiles <br />
ì
Substitution errors are platform dependent <br />
Wang et al., BMC Systems Biology, 6(Suppl 3):S21 (2012)
(Pre-‐)Phasing a source of errors <br />
ì <br />
Nakamura et al., Nucleic Acid Research, 39(13):e90 (2011)
Glenn, Molecular Ecology, 11(5):759-‐769 (2011) <br />
Sequencing platforms
CLC Bio, Annual Survey (2012) <br />
Illumina dominates the market
CLC Bio, Annual Survey (2012) <br />
Primary application focus is WGS
NGS mostly applied toward basic research <br />
CLC Bio, Annual Survey (2012)
Aligners and alignment <br />
ì <br />
Image reproduced by permission of Dr. Anthony Fejes
FASTQ format <br />
@SEQ_ID<br />
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC<br />
+<br />
!''*((((***+))%%%++)(%%%%).1***-+*''))**
Illumina sequence identifiers (CASAVA 1.8.2) <br />
@PC140529:266:C1THYACXX:6:2213:9313:32696 1:N:0:TAATGCGCGGCTCTGA <br />
@:::::: ::: <br />
Illumina CASAVA 1.8.2 User Guide <br />
PC140529 <br />
Instrument name <br />
266 Run # on instrument <br />
C1THYACXX <br />
Flowcell ID <br />
6 Lane # <br />
2213 Tile # <br />
9313 X coordinate of cluster <br />
32696 Y coordinate of cluster <br />
1 Read number <br />
N <br />
Y if the read is filtered <br />
0 0 unless read idenIfied <br />
as control <br />
TAATGCGCGGCTCTGA <br />
Barcode sequence
Phred quality scores are linked to error <br />
probabilities <br />
Phred quality score <br />
Probability that the <br />
base is called wrong <br />
10 1 in 10 90% <br />
20 1 in 100 99% <br />
30 1 in 1,000 99.9% <br />
40 1 in 10,000 99.99% <br />
50 1 in 100,000 99.999% <br />
Accuracy of the base <br />
call
Quality scores useful for… <br />
ì Assessment of sequence quality <br />
ì RecogniIon and removal of low-‐quality sequence <br />
ì DeterminaIon of accurate consensus sequence
Wikipedia <br />
Quality scores vary by platform
Two main types of alignment aglorthims <br />
1. Algorithms based on hash-‐tables <br />
2. Algorithms based on suffix/prefix tries
Hash table <br />
Hash FuncIon <br />
Key 1 <br />
Value 3 <br />
Key 2 <br />
Value 2 <br />
Key 3 <br />
Value 1
Bananas <br />
ananas <br />
nanas <br />
anas <br />
nas <br />
as <br />
s <br />
Suffix trie
Spaced-‐seed <br />
• Cut each posiIon in reference into <br />
equal-‐sized pieces called “seeds” <br />
• Store pairs of spaced seeds in a lookup <br />
table (index) <br />
• Look up spaced seeds for each read <br />
• For each match, confirm remaining <br />
posiIons <br />
• Report alignment to user
Burrows-‐Wheeler <br />
transform <br />
• IniIally developed for data <br />
compression <br />
• Store enIre BW-‐transformed <br />
reference in memory
Ambiguous reads <br />
read <br />
reference
Sequence Alignment/Map (SAM) format <br />
The SAM Format SpecicaIon (v1.4-‐r985)
0-‐based vs. 1-‐based coordinate systems <br />
System First-‐base Interval <br />
type <br />
Nota1on Meaning Examples <br />
0-‐based 0 half-‐open [a,b) coordinate <br />
does not <br />
include b <br />
BAM, BED <br />
1-‐based 1 closed [a,b] coordinate <br />
includes b <br />
SAM, GTF/<br />
GFF, <br />
wiggle, VCF
70 short-‐read aligners <br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
BWA <br />
u<br />
BarraCUDA u<br />
Bfast <br />
u<br />
BioScope u<br />
Bow1e/Bow1e2 u<br />
CLC bio u<br />
CloudBurst u<br />
cross_match u<br />
Eland/ELAND2 u<br />
GEM <br />
u<br />
GMAP/GSNAP u<br />
GenomeMapper u<br />
GensearchNGS u<br />
GnuMap u<br />
Karma <br />
LAST <br />
MAQ <br />
MOM <br />
Mosaik <br />
MrFast/MrsFast <br />
NovoAlign <br />
PASS <br />
PerM <br />
RMAP <br />
RazerS <br />
SHRiMP <br />
SHrec <br />
SOAP/SOAP2 <br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
u<br />
SSAHA2 <br />
STAR <br />
SToRM <br />
Segemehl <br />
SeqMap <br />
Slider/SliderII <br />
Srprism <br />
Stampy <br />
Vmatch <br />
ZOOM <br />
rNA <br />
subread
Direct comparison of aligners difficult <br />
u Not all aligners have the same opIons <br />
u Comparisons are osen on the basis of default <br />
parameters <br />
u SyntheIc reads osen used, which do not posses the <br />
same error profile as ‘real’ reads <br />
u Benchmarking common
Koboldt, MassGenomics <br />
10 aligners compared
Simulated 2 mil . C. elegans read pairs <br />
• Used MAQ read sImulator <br />
• From a ‘mutated’ version of the C. elegans genome <br />
that contained ~90,000 SNps and ~10,000 indels <br />
• Seq IDs contained ‘real’ read locaIon: <br />
>chrI_7433167_7433425_1/1 <br />
TCGTATTATAACGACGAATACGCGTCAAGTGGGAGT <br />
Koboldt, MassGenomics
Koboldt, MassGenomics <br />
CPU time
cross_match & NovoAlign PE place the <br />
highest number of reads correctly <br />
Koboldt, MassGenomics
SNPs greater influence than influence <br />
than indels on accuracy <br />
Koboldt, MassGenomics
Repeated with real data from the 1000 <br />
Genomes Project <br />
Koboldt, MassGenomics
Read Placement Results for ~2 Million <br />
Reads <br />
Koboldt, MassGenomics
Top Performing Aligners <br />
Alignment Need <br />
Proven FuncIonality <br />
Speed <br />
SensiIvity <br />
Usability <br />
Flexibility <br />
Variant DetecIon <br />
Recommenda1on <br />
Maq <br />
BowIe, SOAP <br />
Novoalign, cross_match <br />
CLC bio, SeqMap <br />
BFAST <br />
SOAP, Novoalign <br />
Koboldt, MassGenomics
Mark Watson, Roslin Ins1tute <br />
The ‘Watson square’
Bioo Scien1fic Corp. <br />
Whole genome <strong>sequencing</strong>
Fragmentation method biases coverage
Read depth as a function of <br />
fragmentation method
Commonly Used Framework for <br />
Calling Mutations – the GATK <br />
ì
Genotype Calling -‐ GATK
Genotype calling -‐ Illumina <br />
• CASAVA (ELANDv2, mulI-‐seeded aligner): <br />
• SNVs <br />
• Base calls are ignored where more than 2 mismatches to the reference <br />
sequence occur within 20 bases of the call. Note that this filter treats each <br />
inserIon or deleIon as a single mismatch. <br />
• If the call occurs within the first or last 20 bases of a read then the mismatch <br />
limit is applied to the 41 base window at the corresponding end of the read. <br />
• The mismatch limit is applied to the enIre read when the read length is 41 or <br />
shorter. <br />
• Indels: callSmallVariants module (local realignment) <br />
• GROUPER (large indels, CNVs) <br />
• TranslocaIons/DuplicaIons: <br />
• ClusterMerger (chimeric read pair filter) <br />
• ReadBroker (builds evidence by merging clusters) <br />
• AlignConIg (conIg re-‐alignment for re-‐arrangements)
GATK vs. Illumina
SNP calling – best practices <br />
1) Sufficient coverage (20x) <br />
2) Reads from forward and reverse strands <br />
3) Variable star sites <br />
4) ? High Q-‐scores ? <br />
AGCTTTCGTACGATACCCATGACTATACTA
ChIP <strong>sequencing</strong>
Motif finding in tag enriched <br />
regions
Two fundamental types of peaks <br />
Kidder et al., Nature Immunology, 12:918-‐922 (2011)
ChIP-‐seq peak calling programs <br />
Wilbanks et al., PLoS One, 5(7):e11471 (2010)
Different number of peaks identified <br />
Wilbanks et al., PLoS One, 5(7):e11471 (2010)
Wilbanks et al., PLoS One, 5(7):e11471 (2010) <br />
Sensitivity assessment
Lister et al., Genome Research, 19:959-‐966 (2009) <br />
Methylation <strong>sequencing</strong>
Bisulfite conversion <br />
ì
Conversion yields up to four <br />
potentially different DNA fragments <br />
ì <br />
Krueger et al., Nature Methods, 9:145-‐151 (2012)
Krueger & Andrews, Bioinforma1cs, 27(11):1571-‐1572 (2011) <br />
Bismark for mapping
Akalin et al., Genome Biology, 13:R87 (2012) <br />
methylKit for the <strong>analysis</strong>
Transcriptome <strong>sequencing</strong> <br />
ì <br />
a.k.a RNA-‐seq <br />
Image reproduced by permission of Dr. Anthony Fejes
Definitions <br />
ì RNA = class of nucleic acids characterized by the <br />
sugar ribose and the pyrimidine uracil; involved in <br />
protein synthesis and in the transmission of geneIc <br />
informaIon <br />
ì Transcriptome = complete set of RNA molecules <br />
produced in one or a populaIon of cells, and their <br />
quanIty, for a specific developmental stage or <br />
physiological condiIon <br />
ì RNA-‐seq = <strong>sequencing</strong> of the transcriptome
Technique workflow <br />
RNA isolaIon <br />
• phenol-chloroform<br />
<br />
extracIon <br />
• silica membrane <br />
RNA <br />
fracIonaIon <br />
(opIonal) <br />
• poly(A)+ <br />
• ribo-‐minus <br />
• DSN <br />
Library <br />
preparaIon <br />
• fragmentaIon <br />
• direcIonal <br />
Sequencing <br />
• Illumina <br />
• Solid <br />
• Helicos <br />
Analysis <br />
• reference-based<br />
<br />
• assembly
RNA molecules<br />
1. fragmentation of RNA<br />
2. random priming to make sscDNA<br />
rst-strand synthesis)<br />
3. construction of dscDNA<br />
(second-strand synthesis)<br />
4. size selection<br />
RNA fragments<br />
sscDNA<br />
dscDNA<br />
Most typical RNA-‐seq <br />
short<br />
long<br />
Gel cutout<br />
5. <strong>sequencing</strong><br />
sense<br />
RNA sequence<br />
paired-end read<br />
Roberts et al., Genome Biology, 12(3):R22 (2011) <br />
6. mapping<br />
anti-sense
Major classes of RNA <br />
Type (% in cell) Example Func1on Size (nt) <br />
Transfer RNA <br />
(15%) <br />
Ribosomal RNA <br />
(80%) <br />
Messenger RNA <br />
(5%) <br />
tRNA ala <br />
tRNA leu <br />
5S rRNA <br />
5.8S rRNA <br />
18S rRNA <br />
28S rRNA <br />
GAPDH <br />
Transferring of alanine during <br />
protein synthesis. <br />
Transferring of leucine during <br />
protein synthesis. <br />
Components of a ribosome in <br />
eukaryotes <br />
glyceraldehyde-‐3-‐phosphate <br />
dehydrogenase, var. 1, mRNA <br />
73 <br />
83 <br />
121 <br />
156 <br />
1869 <br />
5070 <br />
1401 <br />
ACTB <br />
beta acIn, mRNA <br />
1852 <br />
Lowe et al., Nucleic Acids Research 25(5), 955-‐964 (1997) <br />
NCBI Reference Sequences: NR_023379.1, NR_003285.2, NR_003286.2, NR_003287.2 <br />
NCBI RefSeqGene IDs: 2597 (NM_002046.4), 60 (NM_001101.3) <br />
Lodish et al., Molecular Cell Biology. 4 th ed. New York: W. H. Freeman (2000)
Composition of RNA depends upon method of preparation <br />
ì<br />
In view of the apparent dependence of the composi@on <br />
of ribonucleic acids upon the methods of prepara@on as <br />
well as the certain considera$ons appear necessary for a <br />
proper assessment of physical and chemical descrip$ons <br />
of ribonucleic acid prepara$ons : (a) The prepara$ve <br />
procedure and source, (b) a statement of the physical-chemical<br />
homogeneity of the ribonucleic acid, (c) a <br />
descrip$on of the degrada$on of the ribonucleic acid by <br />
ribonuclease, and (d) a statement of the mononucleo$de <br />
composi$on of the ribonucleic acid. It is with reference <br />
to these considera$ons that mammalian ribonucleic <br />
acids prepared by the guanidine salt procedure <br />
hereinaBer described have been studied. <br />
Volkin & Carter, J. of American Chemical Society 73(4), 1516-‐1519 (1951)
Different extraction techniques <br />
GT:phenol:chloroform <br />
extracIon <br />
Silica membrane
Silica membrane enriches for RNA molecules > 200 nt <br />
a <br />
A <br />
A <br />
A <br />
A <br />
A <br />
afd <br />
d <br />
Mraz et al., Biochem Biophys Res Commun., 390(1): 1-‐4 (2009) <br />
Haimov-‐Kochman et al., Clin Chem., 52(1):159-‐60 (2006)
“RNA was extracted from the cell lines using standard phenol:chloroform <br />
extraction followed by ethanol precipitation” <br />
Kedzierski & Porter, BioTechniques 10(2), 210-‐214 (1991)
Zeugin JA & Hartley JL, Focus 7(4), 1-‐4 (1985) <br />
Ethanol Precipitation
Schroeder et al., BMC Mol Biol. 7(3) (2006) <br />
RNA integrity
Genebody coverage across degraded poly(A)+ samples <br />
RIN0 RIN3 RIN6 RIN9 <br />
2 <br />
1.5 <br />
Coverage (%) <br />
1 <br />
0.5 <br />
0 <br />
5' 3' <br />
Genebody
Removal of ribosomal RNA <br />
Indirect <br />
Direct <br />
Credit: Invitrogen
Gene-‐expression comparison of protein-‐coding genes <br />
Cui et al., Genomics 96(5), 259-‐265 (2010)
Abundance of non-‐coding transcripts <br />
Indirect <br />
Direct <br />
Cui et al., Genomics 96(5), 259-‐265 (2010)
Capturing stem-‐loop mRNA <br />
20 <br />
18 <br />
ribo-‐ <br />
poly(A)+ <br />
16 <br />
log(gene count+1, 2) <br />
14 <br />
12 <br />
10 <br />
8 <br />
6 <br />
4 <br />
2 <br />
0 <br />
HIST4H4 <br />
HIST3H3 <br />
HIST3H2A <br />
HIST2H3D <br />
HIST2H2AC <br />
HIST2H2BC <br />
HIST2H2BA <br />
HIST2H2BF <br />
HIST2H2AB <br />
HIST3H2BB <br />
HIST2H2BE <br />
HIST1H4J <br />
HIST1H4E <br />
HIST1H2AA <br />
HIST1H3I <br />
HIST1H1E <br />
HIST1H2AE <br />
HIST1H3B <br />
HIST1H4G <br />
HIST1H2BC <br />
HIST1H2BL <br />
HIST1H2BJ <br />
HIST1H2AL <br />
HIST1H1A <br />
HIST1H2BN <br />
HIST1H2BB <br />
HIST1H4L <br />
HIST1H3E <br />
HIST1H3G <br />
HIST1H2AG <br />
HIST1H2BE <br />
HIST1H2AI <br />
HIST1H4K <br />
HIST1H4B <br />
HIST1H1B <br />
HIST1H3H <br />
HIST1H2AM <br />
HIST1H1T <br />
HIST1H2BG <br />
HIST1H2AJ <br />
HIST1H4D <br />
HIST1H2AK <br />
HIST1H2BI <br />
HIST1H4F <br />
HIST1H2AC <br />
HIST1H3C <br />
HIST1H2BA <br />
HIST1H2BK <br />
HIST1H2AB <br />
HIST1H2AH <br />
HIST1H3D <br />
HIST1H4H <br />
HIST1H2BM <br />
HIST1H2BD <br />
HIST1H1D <br />
HIST1H3A <br />
HIST1H1C <br />
HIST1H2BH <br />
HIST1H2BO <br />
HIST1H4A <br />
HIST1H3J <br />
HIST1H4C <br />
HIST1H3F <br />
HIST1H2BF
Poly(A)+ vs. ribo-‐: genebody coverage (RIN ~ 2) <br />
ribo-‐ <br />
poly(A)+ <br />
2 <br />
1.5 <br />
Coverage (%) <br />
1 <br />
0.5 <br />
0 <br />
5' 3' <br />
Genebody
Bogdanova et al., Molecular BioSystems, 4: 205–212 (2008) <br />
Duplex-‐specific nuclease
Yi et al., Nucleic Acid Research, 39(20):e140 (2011) <br />
DSN effective at removing rRNA
Abundant transcript measurements could be <br />
affected by DSN treatment <br />
Yi et al., Nucleic Acid Research, 39(20):e140 (2011)
Divalent cations promote RNA degradation <br />
Mg 2+ <br />
Silverman, Nucleic Acid Research, 33(19): 6151–6163 (2005)
Fragmentation of oligo-‐dT primed cDNA is more <br />
biased towards the 3' end of the transcript <br />
Wang et al., Nature Gene1cs 10, 57-‐63 (2009)
Not-‐so-‐random ‘random’ hexamers <br />
ì <br />
SEQC−ILM−NVS−A−1_AD0902ACXX_ATCACG_L01<br />
Nucleotide frequency<br />
0.0 0.1 0.2 0.3 0.4 0.5<br />
●<br />
●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
● ● ●● ● ●<br />
● ●●●●● ●●<br />
●<br />
●●●● ● ●●●●● ● ●● ● ●●●●● ● ●● ● ●●●●● ●● ● ●● ●●●●● ● ●●●●● ● ●● ● ● ● ●●●●● ●●●●● ●●●●●● ● ●●●●● ●●●● ●●● ●● ● ●● ● ●●●●●●●● ●●● ● ●● ● ●● ●●●● ●●●●● ●●●● ●●● ●●●●●●●● ● ●● ●● ●● ●●● ● ●<br />
●<br />
●<br />
● ● ● ●●● ● ●● ● ● ●● ●●● ●●●● ●●●● ● ●●●● ●●●● ●● ● ●● ● ●● ●●●● ●●●●●●●● ● ● ●● ● ●● ●●●●●●● ●● ●<br />
●● ● ● ●●● ●●● ●●●●● ● ● ●● ●●●●●●●● ● ●●●●●●●● ● ● ●● ●●●●● ● ● ●● ●● ● ●● ● ●●<br />
● ● ● ● ●●●●● ● ●●<br />
● ● ● ●●<br />
● ●● ●<br />
● ●<br />
● ●<br />
● ●<br />
●●<br />
●<br />
● ● ●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●●<br />
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
A<br />
T<br />
G<br />
C<br />
N<br />
0 20 40 60 80 100<br />
Read position
Positional bias caused by choice of primers <br />
ì <br />
Hansen et al., Nucleic Acid Research, 38(12):e131 (2010)
Levin et al., Nature Methods 7, 709–715 (2010) <br />
Methods for strand-‐specific RNA-‐seq
Quality assesment <br />
ì <br />
Levin et al., Nature Methods 7, 709–715 (2010)
dUTP most correlated with arrays <br />
Levin et al., Nature Methods 7, 709–715 (2010)
Did you<br />
QC the<br />
data?<br />
no<br />
yes<br />
QC the<br />
data<br />
Do you<br />
have a<br />
reference<br />
genome?<br />
RSeQC<br />
FastQC<br />
FASTX-toolkit<br />
no<br />
yes<br />
Assemble<br />
the data<br />
Are you<br />
interested<br />
in novel<br />
genes/<br />
junctions/<br />
isoforms?<br />
Velvet/Oases<br />
Trans-ABySS<br />
Trinity<br />
yes<br />
no<br />
Align to<br />
genome<br />
with a de<br />
novo splice<br />
aligner<br />
Do you just<br />
want gene<br />
expression<br />
values?<br />
TopHat<br />
STAR<br />
SoapSplice<br />
no<br />
yes<br />
Align to<br />
genome<br />
Align to<br />
transcriptome<br />
TopHat with GTF<br />
rSeq<br />
Decision Tree <br />
A guide to your analyIcal approach.
Trapnell et al., Bioinforma1cs, 1(25):1105-‐1111 (2009) <br />
The TopHat pipeline
Dobin et al., Bioinforma1cs, 29(1):15-‐21, 2013 <br />
STAR
STAR vs. TopHat
STAR vs Tophat <br />
ì<br />
ì<br />
ì<br />
STAR maps %17 more reads <br />
than TopHat <br />
Of the reads mapped in <br />
common, 80% concordance <br />
Differences osen due to 1-‐<br />
offs
Mar1n et al., Nature Reviews Gene1cs, 12:671-‐682 (2011) <br />
Reference-‐based transcriptome assembly
Mar1n et al., Nature Reviews Gene1cs, 12:671-‐682 (2011) <br />
De novo transcriptome assembly
TopHat manual <br />
Read the docs!
Methods of quantifying gene expression <br />
Wilhelm et al., Methods 48(3), 249-‐257 (2009)
Credit: Simon Anders (HTSeq) <br />
Overlap resolution modes
Accurate quantification requires <br />
significant depth <br />
ì <br />
Toung et al., Genome Res., 21(6):991-‐998 (2011)
Reads per kilobase of exon model per million mapped reads <br />
AssumpIon: the sensiIvity of RNA-‐Seq will be a funcIon of both molar <br />
concentraIon and transcript length <br />
RPKM: where C = number of mappable reads that fell onto the gene's exons, N <br />
= total number of mappable reads in the experiment, and L is the sum of the <br />
exons in base pairs: <br />
“When these RNA standards are used in conjuncIon with informaIon on <br />
cellular RNA content, absolute transcript levels per cell can also be calculated. <br />
For example, on the basis of literature values for the mRNA content of a liver <br />
cell and the RNA standards, we esImated that 3 RPKM corresponds to about <br />
one transcript per liver cell. For C2C12 Issue culture cells, for which we know <br />
the starIng cell number and RNA preparaIon yields needed to make the <br />
calculaIon, a transcript of 1 RPKM corresponds to approximately one <br />
transcript per cell.” <br />
Mortazavi et al., Nature Methods, 5: 621-‐628 (2008)
More sophisticated normalization methods are <br />
necessary <br />
• Imagine we sequenced two RNA populaIons, <br />
A and B. <br />
• Suppose every gene that is expressed in B is <br />
expressed in A with the same number of <br />
transcripts. <br />
• Assume that sample A also contains a set of <br />
genes equal in number and expression that <br />
are not expressed in B. Thus, sample A has <br />
twice as many total expressed genes as <br />
sample B, that is, its RNA producIon is twice <br />
the size of sample B. <br />
• Suppose that each sample is then sequenced <br />
to the same depth. Without any addiIonal <br />
adjustment, a gene expressed in both <br />
samples will have, on average, half the <br />
number of reads from sample A, since the <br />
reads are spread over twice as many genes. <br />
A B <br />
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ <br />
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ <br />
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ <br />
R(green) = 1/4 = 0.25 <br />
R(green) = 2/4 = 0.5 <br />
Robinson & Oshlack, Genome Biology, 11(3): R25 (2010)
Normalization procedure has a profound influence on <br />
which genes are detected as differentially expressed <br />
Hoffman et al., Genome Biology, 3(7):research0033.1-‐research0033.11 (2002)
Different normalization methods <br />
1. Total count (TC): Gene counts are divided by the total number of mapped reads (or library size) <br />
associated with their lane and mulIplied by the mean total count across all the samples of the <br />
dataset. <br />
2. Upper Quar1le (UQ): Very similar in principle to TC, the total counts are replaced by the upper <br />
quarIle of counts different from 0 in the computaIon of the normalizaIon factors. <br />
3. Median (Med): Also similar to TC, the total counts are replaced by the median counts different <br />
from 0 in the computaIon of the normalizaIon factors. <br />
4. DESeq: This normalizaIon method is included in the DESeq Bioconductor package (version <br />
1.6.0) and is based on the hypothesis that most genes are not DE. <br />
5. Trimmed Mean of M-‐values (TMM): This normalizaIon method is implemented in the edgeR <br />
Bioconductor package (version 2.4.0). It is also based on the hypothesis that most genes are not <br />
DE. <br />
6. Quan1le (Q): First proposed in the context of microarray data, this normalizaIon method <br />
consists in matching distribuIons of gene counts across lanes. <br />
7. Reads Per Kilobase per Million mapped reads (RPKM): This approach was iniIally introduced to <br />
facilitate comparisons between genes within a sample and combines between-‐ and within-sample<br />
normalizaIon.
Dillies et al., Brief Bioinforma1cs, Epub (2012) <br />
Comparison of normalization <br />
methods
DEGs in common for each of the normalization <br />
methods <br />
Dillies et al., Brief Bioinforma1cs, Epub (2012)
Summary of the normalization methods <br />
Dillies et al., Brief Bioinforma1cs, Epub (2012)
GAPDH isn’t much of a ‘housekeeper’ <br />
Barber et al., Physiol Genomics, 21:389-‐395 (2005)
Soneson & Delorenzi, BMC Bioinforma1cs, 14:91 (2013) <br />
DEG dependent upon package
'The property of being one and the property of <br />
being many are contraries.’ <br />
RNA-‐SEQ <br />
Protocol-‐1 Protocol-‐2 Protocol-‐n
Broadly useful Linux tools / shell features <br />
Linux tools <br />
• awk -‐ pa ern scanning and processing language <br />
• sed -‐ stream editor for filtering and transforming text <br />
• tr – translate or delete characters <br />
• diff -‐ find differences between two files <br />
• comm -‐ compare two sorted files line by line <br />
• grep -‐ print lines matching a pa ern <br />
• cut -‐ remove secIons from each line of files <br />
• rename – rename files <br />
• uniq -‐ report or omit repeated lines <br />
• sort -‐ sort lines of text files <br />
• parallel – execute shell commands in parallel <br />
Shell features <br />
• process subsItuIon <br />
• comm -‐12
Script limitations <br />
1) Linear execuIon of commands <br />
2) Aborted scripts leave truncated files <br />
3) No convenient way to resume a script <br />
4) Poor audit trail
Structure of a makefile <br />
!<br />
General structure <br />
target …: dependency …!<br />
!command!<br />
!…!<br />
!… !!<br />
!<br />
Example <br />
%.sam: %.fastq.gz!<br />
!bowtie --sam<br />
reference $< $>!<br />
www.gnu.org/sooware/make
Maintain a bioinformatics journal <br />
The problem <br />
Paul: “I was just playing around <br />
with the data and ….” <br />
… 8 months later: <br />
Chris: Do you remember when <br />
you sent me …? Can we do that <br />
again on …? <br />
Paul: <br />
The soluIon <br />
• Maintain a journal <br />
• Record where and when you <br />
retrieved the data <br />
• annotaIon databases, e.g., <br />
update osen <br />
• Describe acIons in enough detail <br />
that others can replicate the <br />
result using the same steps <br />
• If you think it’s trivial, log it <br />
anyway: <br />
• Removed a header? Log it. <br />
• Sorted a file? Log it. <br />
• Create an alignment index? Log <br />
it. <br />
• Fooling around? Log it.
More basic research to be done
Empirical dialectical method <br />
p <br />
negaIon <br />
-‐p <br />
empiricism <br />
q
Closing <br />
ì Knowledge is not a series of self-‐consistent theories <br />
that converges toward an ideal view; it is rather an <br />
ever increasing ocean of mutually incompa$ble <br />
(and perhaps even incommensurable) alterna$ves, <br />
each single theory, each fairy tale, each myth that is <br />
part of the collec$on forcing the others into greater <br />
ar$cula$on and all of them contribu$ng, via this <br />
process of compe$$on, to the development of our <br />
consciousness. <br />
Paul Feyerabend