Expression Analysis of RNA-seq Data

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

Simon v2.3 RNA-Seq Analysis Simon v2.3.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
RNAseq analysis Bioinformatics Analysis Team
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
RNA-Seq Visualization
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
Next Generation DNA Sequencing
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq Analysis Simon V4.1.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.
The iPlant Collaborative
RNA-seq: Quantifying the Transcriptome
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
No reference available
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
+ RNAseq for differential gene expression analysis Molly Hammell, PhD
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Transcriptomics History and practice.
Simon v RNA-Seq Analysis Simon v
Introductory RNA-seq Transcriptome Profiling
GCC Workshop 9 RNA-Seq with Galaxy
RNA Quantitation from RNAseq Data
Cancer Genomics Core Lab
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-Seq Transcriptome Profiling
Kallisto: near-optimal RNA seq quantification tool
Transcriptomics History and practice.
Additional file 2: RNA-Seq data analysis pipeline
Sequence Analysis - RNA-Seq 2
Transcriptomics – towards RNASeq – part III
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Expression Analysis of RNA-seq Data Manuel Corpas Plant and Animal Genomes Project Leader manuel.corpas@tgac.ac.uk

Generation of Sequence Mapping Reads Identification of splice junctions Assembly of Transcripts Statistical Analysis 1 Summarization (by exon, by transcript, by gene) 2 Normalization (within sample and between sample) 3 Differential expression testing (poisson test, negative binomial test)

The Tuxedo Tools Developed by Institute of Genetic Medicine at Johns Hopkins University / University of California, Berkeley / Harvard University 157 pubmed citations Tophat Fast short read aligner (Bowtie) Spliced read identification (Tophat) Cufflinks package Cufflinks – Transcript assembly Cuffmerge – Merges multiple transcript assemblies Cuffcompare – Compare transcript assemblies to reference annotation Cuffdiff – Identifies differentially expressed genes and transcripts CummeRbund Visualisation of differential expression results

RNA-seq Experimental design Sequencing technology (Solid, Illumina) Hiseq 2000, 150 million read pairs per lane, 100bp Single end (SE) Paired end (PE), strand specific SE Quantification against known genes PE Novel transcripts, transcript level quantification Read length (50-100bp) Greater read length aids mapping accuracy, splice variant assignment and identification of novel junctions Number of replicates often noted to have substantially less technical variability Biological replicates should be included (at least 3 and preferably more) Sequencing depth Dependent on experimental aims

RNA-seq Experimental design Labaj et al. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics (2011) Toung et al. RNA-sequence analysis of human B-cells. Genome Research (2011) . Extrapolation of the sigmoid shape suggests 20 % of transcripts not expressed First saturation effects set in at ~40 million read alignments ~240 Million reads achieve 84 % transcript recall Fragments per kilobase of exon per million fragments Reads aligned per kilobases mapped

RNA-seq Experimental design General guide Quantify expression of high-moderatly expressed known genes ~20 million mapped reads, PE, 2 x 50 bp Assess expression of alternative splice variants, novel transcripts, and strong quantification including low copy transcripts in excess of 50 million reads, PE, 2 x 100 bp Example Examine gene expression in 6 different conditions with 3 biological replicates (18 samples) Multiplexing 6 samples per lane on 3 lanes of the HiSeq (50 bp PE) Generates ~25 M reads per sample Assuming ~80% of reads map/pass additional QC (20 M mapped read per sample) Cost – 3 lanes (£978 x 3 ) 18 libraries (£ 105 x 18), total £4824

Step 1 – Preprocessing reads Sequence data provided as Fastq files QC analaysis – sequence quality, adapter contamination (FASTQC) Quality trimming, adapter removal (FASTX, Prinseq, Sickle)

Step 2 – Data sources Reads (Fastq, phred 33) Genomic reference (fasta TAIR10), or pre built Bowtie index GTF/GFF file gene calls (TAIR10) http://tophat.cbcb.umd.edu/igenomes.html

Tuxedo Protocol TOPHAT (Read Mapping) CUFFLINKS (Transcript Assembly) Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) CUFFMERGE (Final Transcript Assembly) Assembly (GTF) GTF CUFFDIFF (Differential expression results) CUFFCOMPARE (Comparion to reference) Visualisation (PDF) CUMMERBUND (Expression Plots)

Step 3 Tuxedo Protocol - TOPHAT Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) Non-spliced reads mapped by bowtie Reads mapped directly to transcriptome sequence Spliced reads identified by tophat Initial mapping used to build a database of spliced junctions Input reads split into smaller segments Coverage islands Paired end reads map to distinct regions Segments map in distinct regions Long reads >=75bp used to identify GT-AG, GC-AG and AT-AC splicings) Bowtie – Extreamly efficient program for aligning non-spliced reads, generates a data structure (index) to store the genomic sequence and enable fast searching.

Step 3 Tuxedo Protocol - TOPHAT Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) -i/--min-intron-length <int> 40 -I/--max-intron-length <int> 5000 -a/--min-anchor-length <int> 10 -g/--max-multihits <int> 20 -G/--GTF <GTF/GFF3 file> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3209707/pdf/1471-2164-12-516.pdf

Tuxedo Protocol TOPHAT (Read Mapping) CUFFLINKS (Transcript Assembly) Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) CUFFMERGE (Final Transcript Assembly) Assembly (GTF) GTF CUFFDIFF (Differential expression results) CUFFCOMPARE (Comparion to reference) Visualisation (PDF) CUMMERBUND (Expression Plots)

Step 4 Tuxedo Protocol - CUFFLINKS Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) Accurate quantification of a gene requires identifying which isoform produced each read. Reference Annotation Based Transcript (RABT) assembly Sequence bias correction -b/--frag-bias-correct <genome.fa> multi-mapped read correction is enabled (-u/--multi-read-correct) Because a sample may contain reads from multiple splice variants for a given gene, Cufflinks must be able to infer the splicing structure of each gene. However, genes sometimes have multiple alternative splicing events, and there may be many possible reconstructions of the gene model that explain the sequencing data. In fact, it is often not obvious how many splice variants of the gene may be present. Thus, Cufflinks reports a parsi- monious transcriptome assembly of the data. The algorithm reports as few full-length transcript fragments or ‘transfrags’ as are needed to ‘explain’ all the splicing event outcomes in the input data.

Tuxedo Protocol TOPHAT (Read Mapping) CUFFLINKS (Transcript Assembly) Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) GTF + Genome Alignments (BAM) GTF CUFFLINKS (Transcript Assembly) Assemblies (GTF) CUFFMERGE (Final Transcript Assembly) Assembly (GTF) GTF CUFFDIFF (Differential expression results) CUFFCOMPARE (Comparion to reference) Visualisation (PDF) CUMMERBUND (Expression Plots)

Tuxedo Protocol - CUFFDIFF Leaf (SAM1) Leaf (SAM2) Flower (SAM3) Flower (SAM4) Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Read 1 Read 2 Reads (FASTQ) TOPHAT (Read Mapping) Alignments (BAM) CUFFMERGE (Final Transcript Assembly) CUFFLINKS (Transcript Assembly) Assembly (GTF) GTF + Genome GTF mask file CUFFDIFF (Differential expression results) CUFFDIFF output – FPKM (fragments per kilobase of transcript per million fragments mapped) values, fold change, test statistic, p-value, significance statement.

CUFFDIFF - Summarisation 1 2 3 4 (B) 1 2 4 (C) 1 2 4 A + B + C Grouped at Gene level B + C Grouped at CDS level A + C Grouped at Primary transcript level A, B, C No group at the transcript level Cuffdiff output (11 files) FPKM tracking files (Transcript, Gene, CDS, Primary transcript) Differential expression tests (Transcript, Gene, CDS, Primary transcript) Differential splicing tests – splicing.diff Differential coding output – cds.diff Differential promoter use – promoters.diff differential splicing is at the primary transcript level, so you will look at each group of transcripts that share the same TSS (more correct definition: that have the same pre mRNA processing transcript, so you are clustering different splicing isoforms), and test if the mix of splicing isoforms is different. The statistical test is based on the Jensen-Shannon divergence, which is a test on the distribution difference, so it will be sensitive if in one sample there is one (or more) splicing isoform is more representative of that primary transcript output than in the other sample; however, the test is not sensitive to difference in primary transcript total volume (you will have to use differential expression tests for that). different CDS output looks at the different coding sequences you produce after splicing, i.e. the different combinations of exons you can produce; it's a proxy for protein output, but of course it does not take into account anything post-mRNA processing. The test is at the gene level, not at the primary transcript level, so it will also factor in alternative TSS usage and alternative promoter usage; also, if you have differential splicing for one primary transcript, but that primary transcript does not have the lion share's of the gene's transcription output, it will scarcely affect the CDS output difference. However, if you have transcripts that do not differ by their exon sequence but differ by UTRs, this difference will not be factored in (as there is no difference in coding sequence). The statistical test is again based on the Jensen-Shannon divergence, so it won't be sensitive to difference in total gene transcription (you will have to use differential expression tests for that). I think this also sheds light on the other questions. In summary: differential CDS and splicing output tests look at difference in distribution over different possible isoforms (of spliced transcripts or coding sequences), whereas differential expression tests look at difference in total level. Look at difference in distribution (rather than total level)

A test case – Ricinus Communis (Castor bean) 5 tissues – Aim : identify differences in lipid-metabolic pathways

A test case – Ricinus Communis (Castor bean) Cufflinks – Cuffcompare Results RNA-Seq reads assembled into 75090 transcripts corresponding to 29759 ‘genes’ Compares to the 31221 genes in version 0.1 of the JCVI assembly 35587 share at least one splice junction (possible novel splice variant). 2847 were located intergenic to the JCVI annotation and hence may represent novel genes 218147 splice junctions were identified, 112337 supported by at least 10 reads, >300,000 distinct to the JCVI annotation

Visualisation Bam files can be converted to wiggle plots CummeRbund for visualisation of Cuffdiff output Bam, Wiggle and GTF files viewed in IGV CummeRbund volcano and scatter plots

Thanks David Swarbreck (Genome Analysis Team Leader, TGAC) Mario Caccamo (Head Bioinformatics Division, TGAC)