Presentation is loading. Please wait.

Presentation is loading. Please wait.

Expression Analysis of RNA-seq Data

Similar presentations


Presentation on theme: "Expression Analysis of RNA-seq Data"— Presentation transcript:

1 Expression Analysis of RNA-seq Data
Manuel Corpas Plant and Animal Genomes Project Leader

2 Generation of Sequence Mapping Reads
Identification of splice junctions Assembly of Transcripts Statistical Analysis 1 Summarization (by exon, by transcript, by gene) 2 Normalization (within sample and between sample) 3 Differential expression testing (poisson test, negative binomial test)

3 The Tuxedo Tools Developed by Institute of Genetic Medicine at Johns Hopkins University / University of California, Berkeley / Harvard University 157 pubmed citations Tophat Fast short read aligner (Bowtie) Spliced read identification (Tophat) Cufflinks package Cufflinks – Transcript assembly Cuffmerge – Merges multiple transcript assemblies Cuffcompare – Compare transcript assemblies to reference annotation Cuffdiff – Identifies differentially expressed genes and transcripts CummeRbund Visualisation of differential expression results

4 RNA-seq Experimental design
Sequencing technology (Solid, Illumina) Hiseq 2000, 150 million read pairs per lane, 100bp Single end (SE) Paired end (PE), strand specific SE Quantification against known genes PE Novel transcripts, transcript level quantification Read length (50-100bp) Greater read length aids mapping accuracy, splice variant assignment and identification of novel junctions Number of replicates often noted to have substantially less technical variability Biological replicates should be included (at least 3 and preferably more) Sequencing depth Dependent on experimental aims

5 RNA-seq Experimental design
Labaj et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics (2011) Toung et al. RNA-sequence analysis of human B-cells. Genome Research (2011) . Extrapolation of the sigmoid shape suggests 20 % of transcripts not expressed First saturation effects set in at ~40 million read alignments ~240 Million reads achieve 84 % transcript recall Fragments per kilobase of exon per million fragments Reads aligned per kilobases mapped

6 RNA-seq Experimental design
General guide Quantify expression of high-moderatly expressed known genes ~20 million mapped reads, PE, 2 x 50 bp Assess expression of alternative splice variants, novel transcripts, and strong quantification including low copy transcripts in excess of 50 million reads, PE, 2 x 100 bp Example Examine gene expression in 6 different conditions with 3 biological replicates (18 samples) Multiplexing 6 samples per lane on 3 lanes of the HiSeq (50 bp PE) Generates ~25 M reads per sample Assuming ~80% of reads map/pass additional QC (20 M mapped read per sample) Cost – 3 lanes (£978 x 3 ) 18 libraries (£ 105 x 18), total £4824

7 Step 1 – Preprocessing reads
Sequence data provided as Fastq files QC analaysis – sequence quality, adapter contamination (FASTQC) Quality trimming, adapter removal (FASTX, Prinseq, Sickle)

8 Step 2 – Data sources Reads (Fastq, phred 33)
Genomic reference (fasta TAIR10), or pre built Bowtie index GTF/GFF file gene calls (TAIR10)

9 Tuxedo Protocol TOPHAT (Read Mapping) CUFFLINKS (Transcript Assembly)
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) CUFFMERGE (Final Transcript Assembly) Assembly (GTF) GTF CUFFDIFF (Differential expression results) CUFFCOMPARE (Comparion to reference) Visualisation (PDF) CUMMERBUND (Expression Plots)

10 Step 3 Tuxedo Protocol - TOPHAT
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) Non-spliced reads mapped by bowtie Reads mapped directly to transcriptome sequence Spliced reads identified by tophat Initial mapping used to build a database of spliced junctions Input reads split into smaller segments Coverage islands Paired end reads map to distinct regions Segments map in distinct regions Long reads >=75bp used to identify GT-AG, GC-AG and AT-AC splicings) Bowtie – Extreamly efficient program for aligning non-spliced reads, generates a data structure (index) to store the genomic sequence and enable fast searching.

11 Step 3 Tuxedo Protocol - TOPHAT
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) -i/--min-intron-length <int> 40 -I/--max-intron-length <int> 5000 -a/--min-anchor-length <int> 10 -g/--max-multihits <int> 20 -G/--GTF <GTF/GFF3 file>

12 Tuxedo Protocol TOPHAT (Read Mapping) CUFFLINKS (Transcript Assembly)
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) CUFFMERGE (Final Transcript Assembly) Assembly (GTF) GTF CUFFDIFF (Differential expression results) CUFFCOMPARE (Comparion to reference) Visualisation (PDF) CUMMERBUND (Expression Plots)

13 Step 4 Tuxedo Protocol - CUFFLINKS
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) Accurate quantification of a gene requires identifying which isoform produced each read. Reference Annotation Based Transcript (RABT) assembly Sequence bias correction -b/--frag-bias-correct <genome.fa> multi-mapped read correction is enabled (-u/--multi-read-correct) Because a sample may contain reads from multiple splice variants for a given gene, Cufflinks must be able to infer the splicing structure of each gene. However, genes sometimes have multiple alternative splicing events, and there may be many possible reconstructions of the gene model that explain the sequencing data. In fact, it is often not obvious how many splice variants of the gene may be present. Thus, Cufflinks reports a parsi- monious transcriptome assembly of the data. The algorithm reports as few full-length transcript fragments or ‘transfrags’ as are needed to ‘explain’ all the splicing event outcomes in the input data.

14 Tuxedo Protocol TOPHAT (Read Mapping) CUFFLINKS (Transcript Assembly)
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) CUFFMERGE (Final Transcript Assembly) Assembly (GTF) GTF CUFFDIFF (Differential expression results) CUFFCOMPARE (Comparion to reference) Visualisation (PDF) CUMMERBUND (Expression Plots)

15 Tuxedo Protocol - CUFFDIFF
Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) Alignments (BAM) CUFFMERGE (Final Transcript Assembly) CUFFLINKS (Transcript Assembly) Assembly (GTF) GTF + Genome GTF mask file CUFFDIFF (Differential expression results) CUFFDIFF output – FPKM (fragments per kilobase of transcript per million fragments mapped) values, fold change, test statistic, p-value, significance statement.

16 CUFFDIFF - Summarisation
1 2 3 4 (B) 1 2 4 (C) 1 2 4 A + B + C Grouped at Gene level B + C Grouped at CDS level A + C Grouped at Primary transcript level A, B, C No group at the transcript level Cuffdiff output (11 files) FPKM tracking files (Transcript, Gene, CDS, Primary transcript) Differential expression tests (Transcript, Gene, CDS, Primary transcript) Differential splicing tests – splicing.diff Differential coding output – cds.diff Differential promoter use – promoters.diff differential splicing is at the primary transcript level, so you will look at each group of transcripts that share the same TSS (more correct definition: that have the same pre mRNA processing transcript, so you are clustering different splicing isoforms), and test if the mix of splicing isoforms is different. The statistical test is based on the Jensen-Shannon divergence, which is a test on the distribution difference, so it will be sensitive if in one sample there is one (or more) splicing isoform is more representative of that primary transcript output than in the other sample; however, the test is not sensitive to difference in primary transcript total volume (you will have to use differential expression tests for that). different CDS output looks at the different coding sequences you produce after splicing, i.e. the different combinations of exons you can produce; it's a proxy for protein output, but of course it does not take into account anything post-mRNA processing. The test is at the gene level, not at the primary transcript level, so it will also factor in alternative TSS usage and alternative promoter usage; also, if you have differential splicing for one primary transcript, but that primary transcript does not have the lion share's of the gene's transcription output, it will scarcely affect the CDS output difference. However, if you have transcripts that do not differ by their exon sequence but differ by UTRs, this difference will not be factored in (as there is no difference in coding sequence). The statistical test is again based on the Jensen-Shannon divergence, so it won't be sensitive to difference in total gene transcription (you will have to use differential expression tests for that). I think this also sheds light on the other questions. In summary: differential CDS and splicing output tests look at difference in distribution over different possible isoforms (of spliced transcripts or coding sequences), whereas differential expression tests look at difference in total level. Look at difference in distribution (rather than total level)

17 A test case – Ricinus Communis (Castor bean)
5 tissues – Aim : identify differences in lipid-metabolic pathways

18 A test case – Ricinus Communis (Castor bean)
Cufflinks – Cuffcompare Results RNA-Seq reads assembled into transcripts corresponding to ‘genes’ Compares to the genes in version 0.1 of the JCVI assembly 35587 share at least one splice junction (possible novel splice variant). 2847 were located intergenic to the JCVI annotation and hence may represent novel genes splice junctions were identified, supported by at least 10 reads, >300,000 distinct to the JCVI annotation

19 Visualisation Bam files can be converted to wiggle plots
CummeRbund for visualisation of Cuffdiff output Bam, Wiggle and GTF files viewed in IGV CummeRbund volcano and scatter plots

20 Thanks David Swarbreck (Genome Analysis Team Leader, TGAC)
Mario Caccamo (Head Bioinformatics Division, TGAC)


Download ppt "Expression Analysis of RNA-seq Data"

Similar presentations


Ads by Google