10.07.2015 Views

Transcript - silico.biotoul.frsilico.biotoul.fr

Transcript - silico.biotoul.frsilico.biotoul.fr

Transcript - silico.biotoul.frsilico.biotoul.fr

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

RNA-Seq data analysisChristophe Klopp1


Summary●What is transcription?●How variable is it?●Spliced alignment●<strong>Transcript</strong> calling●Visualization in IGV2


<strong>Transcript</strong>ion<strong>Transcript</strong>ion is the process of creating a complementary RNAcopy of a sequence of DNA. <strong>Transcript</strong>ion is the first step leadingto gene expression.http://en.wikipedia.org/wiki/<strong>Transcript</strong>ion_(genetics)3


<strong>Transcript</strong>ion productsProtein coding gene: transcribed in mRNAncRNA : highly abundant and functionally important RNA• tRNA,• rRNA,• snoRNAs,• microRNAs,• siRNAs,• piRNAs....http://en.wikipedia.org/wiki/User:Amarchais/RsaOG_RNA4


Isoforms / transcriptsIsoform : produced by alternative splicingAlternative splicing occurs as a normal phenomenon ineukaryotes, where it greatly increases the diversity of proteinsthat can be encoded by the genome in humans, ~95% ofmultiexonic genes are alternatively spliced5


Alternative splicingAlternative splicing (or differential splicing) isa process by which the exons of the RNAproduced by transcription of a gene (aprimary gene transcript or pre-mRNA) arereconnected in multiple ways during RNAsplicing. The resulting different mRNAs maybe translated into different protein isoforms;thus, a single gene may code for multipleproteins.Post-transcriptional modification is aprocess in cell biology by which, in eukaryoticcells, primary transcript RNA is converted intomature RNA. A notable example is theconversion of precursor messenger RNA intomature messenger RNA (mRNA), whichincludes splicing and occurs prior to proteinsynthesis.http://en.wikipedia.org/wiki/Alternative_splicing6


VocabularyGene : functional units of DNA that contain the instructions forgenerating a functional product.UTRIntron Intron UTRPromoter regionTSSExon1Exon2Exon3Exon : coding region of mRNA included in the transcriptIntron : non coding regionTSS : <strong>Transcript</strong>ion Start Site ≠ 1 st amino acid<strong>Transcript</strong> : stretch of DNA transcribed into an RNA molecule7


<strong>Transcript</strong> degradationAfter export to the cytoplasm, mRNA is protected <strong>fr</strong>om degradation by a5’ cap structure and a 3’ poly adenine tail. In the deadenylationdependent mRNA decay pathway, the polyA tail is graduallyshortened by exonucleases. This ultimately attracts the degradationmachinery that rapidly degrades the mRNA in both in the 5’ to 3’direction and in the 3’ to 5’ direction. Additional mechanisms, includingthe nonsense mediated decay pathway, bypass the need fordeadenylation and can remove the mRNA <strong>fr</strong>om the transcriptionalpool independently. Interestingly, the same enzymes are responsiblefor the actual degradation of the mRNA independent of the pathwaytaken (see figure).http://www.eb.tuebingen.mpg.de/research-groups/remco-sprangers8


Cis-natural antisense transcript• Natural antisense transcripts (NATs) are a group of RNAs encodedwithin a cell that have transcript complementarity to other RNAtranscripts.http://en.wikipedia.org/wiki/Cis-natural_antisense_transcript9


Fusion genes• A fusion gene is a hybrid gene formed <strong>fr</strong>om two previouslyseparate genes. It can occur as the result of a translocation,interstitial deletion, or chromosomal inversion. Often, fusion genesare oncogenes.• They often come <strong>fr</strong>om trans-splicing : Trans-splicing is a specialform of RNA processing in eukaryotes where exons <strong>fr</strong>om twodifferent primary RNA transcripts are joined end to end andligated.http://en.wikipedia.org/wiki/Fusion_genehttp://en.wikipedia.org/wiki/Trans-splicing10


<strong>Transcript</strong>ome variability• Number of transcripts⁕possible variation factor between transcripts: 10 6 or more,⁕expression variation between samples.• Many types of transcripts⁕mRNA, ncRNA,...• Isoforms (with non canonical splice sites)• Intron retention⁕The splicing is not always completed⁕Is a new isoform or a transcription error• <strong>Transcript</strong> decay (degradation)• Allele specific expression11http://www.nature.com/emboj/journal/v25/n5/fig_tab/7601023a_F2.html


What is RNA-Seq ?• use of high-throughput sequencing technologies tosequence cDNA in order to get information about a sample'sRNA content• Thanks to the deep coverage and base level resolution providedby next-generation sequencing instruments, RNA-seqprovides researchers with efficient ways to measuretranscriptome data experimentallyhttp://en.wikipedia.org/wiki/RNA-Seq12


What is different with RNA-Seq ?• No prior knowledge of sequence needed• Specificity of what is measured• Increased dynamic range of measure, moresensitive detection• Direct quantification• Good reproducibility• Different levels : genes, transcripts, allelespecificity, structure variations• New feature discovery: transcripts,isoforms, ncRNA, structures (fusion...)• Possible detection of SNPs, ...13


What are we looking for?Identify genes• List new genesIdentify transcripts• List new alternative splice formsQuantify these elements → differential expression14


Analysis workflowDataDataqualityqualitycontrolcontrolSpliced mappingGene and transcript discoveryQuantification15


Hexamer random priming bias• There is a strong distinctive patternin the nucleotide <strong>fr</strong>equencies ofthe first 13 positions at the 50-endof mapped RNA-Seq reads:⁕sequence specificity ofthe polymerase⁕due to the end repairperformed• Reads beginning with a hexamer over-represented in thehexamer distribution at the beginning relative to theend are down-weighted16


Hexamer random priming biasPer base sequence content :Proportion of each base at each positionIn a random library : no differences,parallel linesdifference between A and T, or G and C is greater than 10%in any positiondifference between A and T, or G and C is greater than 20%in any position17


Analyse workflowData quality controlSplicedSplicedmappingmappingGene and transcript discoveryQuantification18


Where to find a reference genome?Fasta fileRetrieving the genome file:• The Genome Reference Consortiumhttp://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/• ! NCBI chromosome naming with « | » not wellsupported by mapping software• Prefer EMBL:http://www.ensembl.org/info/data/ftp/index.htmlThe chromosome name should be the same in the gtffile and fasta file19


Reference transcriptome fileWhat is a GTF file ?:• derived <strong>fr</strong>om GFF (General Feature Format, for description ofgenes and other features)• Gene Transfer Format:http://genome.ucsc.edu/FAQ/FAQformat.html#format4 [attributes] [comments]The [attribute] list must begin with:gene_id value : unique identifier for the genomic source of the sequence.transcript_id value : unique identifier for the predicted transcript.20


Splice sites• Canonical splice site:which accounts for more than 99% of splicingGT and AG for donor and acceptor sites• Non-canonical site:GC-AG splice site pairs, AT-AC pairshttp://en.wikipedia.org/wiki/RNA_splicing• Trans-splicing :splicing that joins two exons that are not within thesame RNA transcript21


Spliced alignment• The recognition of exon/intron junctions can be inferred<strong>fr</strong>om the reads that overlap the splicing sites. Theresulting spliced reads can produce very shortalignments, part of the read will not map contiguouslyto the reference.→ therefore this approach requires a dedicatedalgorithm• Generation :⁕Sim4⁕Seqanswer : http://seqanswers.com/wiki/Software/list• Idea :⁕Database of potential splice junction sequences (known)⁕splice canonical / non canonical site search (seed thenmapping)22


Alignment ToolsTools for splice-mapping:- Tophat- Mapsplice:http://www.netlab.uky.edu/p/bioinfo/MapSplice23


TopHathttp://tophat.cbcb.umd.edu/• Aligns RNA-Seq reads to a reference genomewith Bowtie• splice junction mapper for reads withoutknowledges• identify splice junctions between exons.24


TopHat• TopHat finds junctions by mappingreads to the reference:⁕ all reads are mapped to thereference genome using Bowtie⁕ reads not mapped to the genome are set aside asIUM (initially unmapped)⁕low complexity reads are discarded⁕for each read : allow until 10 alignments25Trapnell C et al. Bioinformatics 2009;25:1105-1111


TopHat• TopHat then assembles themapped reads• Define island: aggregates mappedreads in islands of candidate exons⁕Generate potentialdonor/acceptor splice sitesusing neighbouring exons• Extend islands to cover eventually splice junctions⁕+/‐45 bp <strong>fr</strong>om reference on either side of island26Trapnell C et al. Bioinformatics 2009;25:1105-1111


TopHatTo map reads to splice junction :• Enumerate all canonical donor andacceptor sites in islands⁕long (>= 75 bp) reads:"GT‐ AG","GC‐ AG" and "AT‐ AC" introns⁕Shorter reads:only "GT‐ AG" introns• Find all pairings which produceGT‐AG introns between islands⁕ 70 bp < Intron size < 20,000 bp27Trapnell C et al. Bioinformatics 2009;25:1105-1111


TopHat• Each possible intron is checkedagainst the IUM_→ seed and extend alignment28Trapnell C et al. Bioinformatics 2009;25:1105-1111


TopHat InputsInputs :• Bowtie index of the genomeftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/http://bowtie-bio.sourceforge.net/index.shtml• file fasta (.fa) of the reference or will be build by bowtie,in the index directory• File fastq of the reads! the GTF file and the Bowtie index should have same nameof chromosome or contigCommand line :tophat [options] 29


TopHat OptionsYour own junctions :-G/--GTF -j/--raw-juncs --no-novel-juncs (ignored without -G/-j)Your own insertions/deletions:--insertions/--deletions --no-novel-indels31


TopHat OutputsOutputs :⁕accepted_hits.bam : list of read alignments in SAMformat compressed⁕Junctions.bed : track of junctions,scores : number of alignments spanning thejunction⁕Insertions.bed and deletions.bed : tracks ofinsertions and deletions⁕Logs files32


TopHat Alignment Strategies• Default options :⁕No reference genome• Alignment with reference genome without discovery:⁕Options -G and –no-novel-juncs• Alignment with reference genome and novels transcriptsdiscovery:⁕options33


Sequence alignment and map• SAM (Sequence Alignment/Map) format:⁕Capture all of the critical information about NGS data in asingle indexed and compressed file⁕Sharing : data across and tools⁕Generic alignment format⁕SAMTOOLS: provide variousutilities for manipulating alignments in the SAM format:sorting,merging, indexing...http://samtools.sourceforge.net/http://picard.sourceforge.net/explain-flags.html34


Spliced cigar line• Extend CIGAR strings• Example: intron de 81 basesERR022486.8388510 81 22 32099 255 58M81N18M = 27484 -4772CCTTGGTCTTGCCGAAGTAGATCTCATTGAGAGTGGAGCGGATCTTGTTCTCCATTTCCTCCACCAGGCGTCCGAT :9=>>??GGGG?GGGGGGGG NM:i:0 XS:A:- NH:i:135


Visualizing alignments on IGVhttp://www.broadinstitute.org/igv/home36


Visualizing alignments on IGV• Import your BAM Files37


Visualizing alignments on IGV• Exemple of visualisation of bam and bed files38


ExercisesVisualize transcriptomic data in IGVhttp://bioinfo.genotoul.<strong>fr</strong>/index.php?id=16039


Analyse workflowData quality controlSpliced mappingGeneGeneandandtranscripttranscriptdiscoverydiscoveryQuantificationQuantification40


<strong>Transcript</strong> reconstructionWhat mapping give us?• Bam file with splicing localisation and pairing readsWhat can we do then?• <strong>Transcript</strong> assembly : structure determination• Novel transcript discovery• quantificationhttp://www.fml.tuebingen.mpg.de/raetsch/lectures/RECOMB-mTiM.pdf41


Cufflinks in generalhttp://cufflinks.cbcb.umd.edu/• assembles transcripts• estimates their abundances : based on how many readssupport each one• tests for differential expression in RNA-Seq samples42


Cufflinks tools• Cufflinks (1 bam at the time)⁕Assembling : gene and transcript• Cuffdiff (2 bam together)⁕Differential expressionof⁕Differential promoter usage, splicing• Cuffcompare: (n cufflinks results)⁕Comparing assembled transcripts to referenceannotation⁕Comparing transcripts across time points• Cuffmerge : (n cufflinks results)⁕Merge together several cufflinks assemblies⁕Run Cuffcompare43


Trapnell C et al. Nature Biotechnology 2010;28:511-515Cufflinks transcript assembly• <strong>Transcript</strong>s assembly :⁕Fragments are divided into nonoverlappingloci⁕each locus is assembledindependently :• Cufflinks assembler⁕find the mini nb of transcripts thatexplain the reads⁕find a minimum path cover( Dilworth's theorem) :– nb incompatible read = mini nbof transcripts needed– each path = set of mutuallycompatible <strong>fr</strong>agmentsoverlapping each other44


Trapnell C et al. Nature Biotechnology 2010;28:511-515Cufflinks transcript assembly• <strong>Transcript</strong>s assembly :⁕Identification incompatibles<strong>fr</strong>agments: distinct isoforms⁕Compatibles <strong>fr</strong>agmentsare connected: graphe construction45


Cufflinks expression measurement• <strong>Transcript</strong>s abundances:⁕statistical model to derive a likelihood functionfor the abundances⁕bayesian inference for isoform with lowerabundances, or nb of read very small→ can estimate the abundances of isoforms• Fragment length distribution :⁕single-end read : no way, use approximateGaussian distribution⁕paired-end reads +assembly : learn thedistribution <strong>fr</strong>om reads mapped to singleisoform genes⁕paired-end – no assembly : genomic length ofde paired-end reads (without splice) orGaussian approximation46Trapnell C et al. Nature Biotechnology 2010;28:511-515


Cufflinks read attribution• Violet <strong>fr</strong>agment: <strong>fr</strong>om which transcript?⁕Use of Fragment length distribution47


Trapnell C et al. Nature Biotechnology 2010;28:511-515Cufflinks expression measurement• Fragments attribution• Isoforms abundances estimation:⁕RPKM for single reads⁕FPKM for paired-end reads48


RPKM / FPKM• <strong>Transcript</strong> length bias• RPKM : Reads per kilobase of exon per million mapped reads⁕1kb transcript with 1000 alignments in a sample of 10million reads (out of which 8 million reads can bemapped) will have:RPKM = 1000/(1 * 8) = 125• the transcript length depends on isoform inference• FPKM : for paired-end sequencing⁕A pair of reads constitute one <strong>fr</strong>agment49


Cufflinks inputs and options• Command line:⁕cufflinks [options]* • Some options :-h/--help-o/--output-dir-p/--num-threads-G/--GTF : estimateisoform expression, no assembly novel transcripts-g/--GTF-guide : guideRABT assembly50


Cufflinks RABT assembly option• Some options :-g/--GTF-guide : guideRABT assemblyRoberts A et al. Bioinformatics 2011;27:2325-232951


Cufflinks outputs• transcripts.gtf : contains assembled isoforms(coordinates and abundances)• genes.fpkm_tracking: contains the genes FPKM• isoforms.fpkm_tracking: contains the isoforms FPKM52


Cufflinks GTF description• transcripts.gtf (coordinates and abundances): contains assembledisoforms: can be visualized with a genome viewer⁕GTF format + attributes (ids, FPKM, confidence inteval bounds,depth or read coverage, all introns and exons covered)GTF formatAttributesChr Source Feature Start EndstrandFrameScore:Most abundant isoform = 1000Minor : ratio=minor Fpkm/major FPKMWhether or not all introns andexons were fully covered byReads (with -g)53


ExercisesVisualize cufflinks results in IGV :http://bioinfo.genotoul.<strong>fr</strong>/index.php?id=16054

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!