Presentation on theme: "Expression Analysis of RNA-seq Data"— Presentation transcript:
1 Expression Analysis of RNA-seq Data Manuel CorpasPlant and Animal Genomes Project Leader
2 Generation of Sequence Mapping Reads Identification of splice junctionsAssembly of TranscriptsStatistical Analysis1 Summarization (by exon, by transcript, by gene)2 Normalization (within sample and between sample)3 Differential expression testing (poisson test, negative binomial test)
3 The Tuxedo ToolsDeveloped by Institute of Genetic Medicine at Johns Hopkins University / University of California, Berkeley / Harvard University157 pubmed citationsTophatFast short read aligner (Bowtie)Spliced read identification (Tophat)Cufflinks packageCufflinks – Transcript assemblyCuffmerge – Merges multiple transcript assembliesCuffcompare – Compare transcript assemblies to reference annotationCuffdiff – Identifies differentially expressed genes and transcriptsCummeRbundVisualisation of differential expression results
4 RNA-seq Experimental design Sequencing technology (Solid, Illumina)Hiseq 2000, 150 million read pairs per lane, 100bpSingle end (SE) Paired end (PE), strand specificSE Quantification against known genesPE Novel transcripts, transcript level quantificationRead length (50-100bp)Greater read length aids mapping accuracy, splice variant assignment and identification of novel junctionsNumber of replicatesoften noted to have substantially less technical variabilityBiological replicates should be included (at least 3 and preferably more)Sequencing depthDependent on experimental aims
5 RNA-seq Experimental design Labaj et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics (2011)Toung et al. RNA-sequence analysis of human B-cells. Genome Research (2011) .Extrapolation of the sigmoid shape suggests 20 % of transcripts not expressedFirst saturation effects set in at ~40 million read alignments~240 Million reads achieve 84 % transcript recallFragments per kilobase of exon per million fragmentsReads aligned per kilobases mapped
6 RNA-seq Experimental design General guideQuantify expression of high-moderatly expressed known genes~20 million mapped reads, PE, 2 x 50 bpAssess expression of alternative splice variants, novel transcripts, and strong quantification including low copy transcriptsin excess of 50 million reads, PE, 2 x 100 bpExampleExamine gene expression in 6 different conditions with 3 biological replicates (18 samples)Multiplexing 6 samples per lane on 3 lanes of the HiSeq (50 bp PE)Generates ~25 M reads per sampleAssuming ~80% of reads map/pass additional QC (20 M mapped read per sample)Cost – 3 lanes (£978 x 3 ) 18 libraries (£ 105 x 18), total £4824
10 Step 3 Tuxedo Protocol - TOPHAT Leaf (SAM1)Leaf (SAM2)Flower (SAM3)Flower (SAM4)Read 1Read 2Read 1Read 2Read 1Read 2Read 1Read 2Reads (FASTQ)TOPHAT (Read Mapping)GTF + GenomeAlignments (BAM)Non-spliced reads mapped by bowtieReads mapped directly to transcriptome sequenceSpliced reads identified by tophatInitial mapping used to build a database of spliced junctionsInput reads split into smaller segmentsCoverage islandsPaired end reads map to distinct regionsSegments map in distinct regionsLong reads >=75bp used to identify GT-AG, GC-AG and AT-AC splicings)Bowtie – Extreamly efficient program for aligning non-spliced reads, generates a data structure (index) to store the genomic sequence and enable fast searching.
13 Step 4 Tuxedo Protocol - CUFFLINKS Leaf (SAM1)Leaf (SAM2)Flower (SAM3)Flower (SAM4)Read 1Read 2Read 1Read 2Read 1Read 2Read 1Read 2Reads (FASTQ)TOPHAT (Read Mapping)GTF + GenomeAlignments (BAM)GTFCUFFLINKS (Transcript Assembly)Assemblies (GTF)Accurate quantification of a gene requires identifying which isoform produced each read.Reference Annotation Based Transcript (RABT) assemblySequence bias correction -b/--frag-bias-correct <genome.fa>multi-mapped read correction is enabled (-u/--multi-read-correct)Because a sample may contain reads from multiple splice variants for a given gene, Cufflinks must be able to infer the splicing structure of each gene. However, genes sometimes have multiple alternative splicing events, and there may be many possible reconstructions of the gene model that explain the sequencing data. In fact, it is often not obvious how many splice variants of the gene may be present. Thus, Cufflinks reports a parsi- monious transcriptome assembly of the data. The algorithm reports as few full-length transcript fragments or ‘transfrags’ as are needed to ‘explain’ all the splicing event outcomes in the input data.
16 CUFFDIFF - Summarisation 1234(B)124(C)124A + B + C Grouped at Gene levelB + C Grouped at CDS levelA + C Grouped at Primary transcript levelA, B, C No group at the transcript levelCuffdiff output (11 files)FPKM tracking files (Transcript, Gene, CDS, Primary transcript)Differential expression tests (Transcript, Gene, CDS, Primary transcript)Differential splicing tests – splicing.diffDifferential coding output – cds.diffDifferential promoter use – promoters.diffdifferential splicing is at the primary transcript level, so you will look at each group of transcripts that share the same TSS (more correct definition: that have the same pre mRNA processing transcript, so you are clustering different splicing isoforms), and test if the mix of splicing isoforms is different. The statistical test is based on the Jensen-Shannon divergence, which is a test on the distribution difference, so it will be sensitive if in one sample there is one (or more) splicing isoform is more representative of that primary transcript output than in the other sample; however, the test is not sensitive to difference in primary transcript total volume (you will have to use differential expression tests for that).different CDS output looks at the different coding sequences you produce after splicing, i.e. the different combinations of exons you can produce; it's a proxy for protein output, but of course it does not take into account anything post-mRNA processing. The test is at the gene level, not at the primary transcript level, so it will also factor in alternative TSS usage and alternative promoter usage; also, if you have differential splicing for one primary transcript, but that primary transcript does not have the lion share's of the gene's transcription output, it will scarcely affect the CDS output difference. However, if you have transcripts that do not differ by their exon sequence but differ by UTRs, this difference will not be factored in (as there is no difference in coding sequence). The statistical test is again based on the Jensen-Shannon divergence, so it won't be sensitive to difference in total gene transcription (you will have to use differential expression tests for that).I think this also sheds light on the other questions.In summary: differential CDS and splicing output tests look at difference in distribution over different possible isoforms (of spliced transcripts or coding sequences), whereas differential expression tests look at difference in total level.Look at difference in distribution (rather than total level)
17 A test case – Ricinus Communis (Castor bean) 5 tissues – Aim : identify differences in lipid-metabolic pathways
18 A test case – Ricinus Communis (Castor bean) Cufflinks – Cuffcompare ResultsRNA-Seq reads assembled into transcripts corresponding to ‘genes’Compares to the genes in version 0.1 of the JCVI assembly35587 share at least one splice junction (possible novel splice variant).2847 were located intergenic to the JCVI annotation and hence may represent novel genessplice junctions were identified, supported by at least 10 reads, >300,000 distinct to the JCVI annotation
19 Visualisation Bam files can be converted to wiggle plots CummeRbund for visualisation of Cuffdiff outputBam, Wiggle and GTF files viewed in IGVCummeRbund volcano and scatter plots
20 Thanks David Swarbreck (Genome Analysis Team Leader, TGAC) Mario Caccamo (Head Bioinformatics Division, TGAC)