Download presentation
1
Differential expression analysis with RNA-Seq
G-OnRamp Beta Users Workshop Wilson Leung 07/2017
2
Outline Design considerations for RNA-Seq experiments
Interpret FastQC results Optimize alignment parameters for HISAT Assess alignment statistics with CollectRnaSeqMetrics Assemble transcripts with StringTie Differential expression analysis with DESeq2
3
RNA-Seq overview Griffith M et al. Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud. PLoS Comput Biol Aug 6;11(8):e
4
Use the RNA Integrity Number (RIN) to assess the quality of the RNA sample
Electropherograms produced by the Agilent Bioanalyzer Expects 2:1 ratio between 28S and 18S rRNA RIN values range from 1 (most degraded) to 10 (least degraded) RIN = 10 RIN = 1 Fluorescence Length 28S 18S Length Fluorescence Traditionally measures RNA integrity by the ratio of 28S to 18S rRNA Increase signal in the Fast region as the RNA degrades Prefer RIN >= 8 for Illumina RNA-Seq sequencing Fast region 5S Marker Length Gallego Romero I et al. RNA-Seq: impact of RNA degradation on transcript quantification. BMC Biol May 30;12:42.
5
Common applications of RNA-Seq
Transcriptome profiling Identify novel transcripts (e.g., gene annotations) Quantify expression levels Differential expression Different developmental stages; treatment versus control Alternative splicing Visualization and integration with other datasets Correlate with epigenomic landscape Histone modifications, DNA methylation, etc. Conesa A et al. A survey of best practices for RNA-Seq data analysis. Genome Biol Jan 26;17:13.
6
Using RNA-Seq to identify chimeric transcripts
Often found in cell lines and cancer genomes Structural changes to transcripts would affect the differential expression analysis Many DE tools assume that most genes have similar expression levels Maher CA et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A Jul 28;106(30):
7
The optimal RNA-Seq sequencing and analysis protocols depend on the goals of the study
8
Design considerations for RNA-Seq
Experimental design Number of samples, number of biological and technical replicates Sequencing design Spike-in controls, randomization of library prep and sequencing Quality control Sequencing quality, mapping bias Conesa A et al. A survey of best practices for RNA-Seq data analysis. Genome Biol Jan 26;17:13.
9
Biological and technical replicates
Biological replicates RNA from independent growth of cells and tissues Account for biological variations Technical replicates Different library preparations of the same RNA-Seq sample Account for batch effects from library preparations Sample loading, cluster amplifications, etc. ENCODE long RNA-Seq standards: Blainey P et al. Points of significance: replication. Nat Methods Sep;11(9):
10
Recommended RNA-Seq sequencing depth based on genome size
Differential expression analysis (# reads) Detect rare transcripts / de novo assembly Read length Small (bacteria / fungi) 5 M 30 – 65 M 50 bp Intermediate (D. melanogaster) 10 M 70 – 130 M 50 – 100 bp Large (Human) 15 – 25 M 100 – 200 M > 100 bp HiSeq 3000 generates 313 million reads per lane
11
How many biological replicates?
As many as possible… Analysis of 48 biological replicates in two conditions Requires 20 biological replicates to detect > 85% of all differentially expressed genes Recommend at least six biological replicates per condition Twelve biological replicates needed to detect smaller fold changes (≥ 0.3-fold difference in expression) Three biological replicates per condition can usually detect genes with ≥ 2-fold difference in expression Three replicates detect only 20-40% of differentially expressed genes Use edgeR (exact) if there are less than 12 replicates Use DESeq if there are more than 12 replicates Schurch NJ et al. How many biological replicates are needed in an RNA-Seq experiment and which differential expression tool should you use? RNA Jun;22(6):
12
Power curves for number of biological replicates in each condition
Web interface for RnaSeqSampleSize: # Samples in each condition 10 20 30 40 50 60 0.0 0.2 0.4 0.8 0.6 FDR = 0.05 Power Bottomly – mouse strain comparison – LFC = 0.99, dispersion = 0.035 Statistical power (ability to detect an effect, 1 - Type II error) Power >= 0.8 Sequence more samples (multiplex) = lower coverage FDR = 0.01
13
Using Galaxy to perform RNA-Seq analysis
Quality control with FastQC Trim low quality bases with Trimmomatic Read mapping with HISAT Transcriptome assembly with StringTie Differential expression analysis with DESeq2 Based on RNA-Seq tutorial developed by Mo Heydarian and Mallory Freeberg at Johns Hopkins University:
14
Study design for the differential expression (DE) analysis walkthrough
G1E mouse cell line ES cells hemizygous for Gata1 knockout Megakaryocytes (Mk) Large bone marrow cell Produces platelets for blood clotting Goal: Identify transcripts regulated by GATA1 G1E Differentiation to erythrocytes (red blood cells) or myeloid (bone marrow) Gata1 is a transcription factor Part of a larger study that also examined the ChIP-Seq data for multiple transcription factors Mk Wu W et al. Genome Res Dec; 24(12):
15
Quality control with FastQC
Determine quality encoding of fastq files Assess quality of sequencing sample Identify overrepresented sequences Adapters, potential contamination, etc.
16
FastQC: Per base sequence quality (G1E_R1_f_ds_SRR549355)
van Gurp TP et al. Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One Dec 30;8(12):e85583.
17
Sequence bias at 5’ end caused by random hexamer priming
Sequence content across all bases %T %C %A %G Higher G + C content in coding regions chr19 average: 29% A+T, 21% G+C Hansen KD et al. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res Jul;38(12):e131.
18
GC distribution over all sequences
Use peaks in the Per sequence GC content panel to identify potential contamination GC distribution over all sequences Multiple peaks GC count per read Theoretical distribution FTH1 – storage of iron in soluble and non-toxic state Peak overlaps with an STS marker RNA-Seq signal overlaps with STS marker RH94403 G1E_R1_f_ds_SRR549355
19
Investigate overrepresented sequences
G1E Mk Fth1
20
High RNA-Seq coverage at 5’ UTR of Fth1 overlaps with a STS marker
STS = Sequence-Tagged Sites Single occurrence in the genome, used for constructing genetic and physical maps Sequence %A %C %G %T RH94403 15.3% 45.9% 23.0% 15.8%
21
RNA-Seq read mapping with HISAT
Many alignment parameters available… Which parameters should be changed?
22
Strand-specific RNA-Seq libraries
Standard libraries do not preserve strand information Prefer strand-specific RNA-Seq Transcript reconstruction and quantification Detect antisense transcripts Library prep kits available from Illumina: TruSeq Stranded Total RNA Sample Prep Kit TruSeq Stranded mRNA Sample Prep Kit Zhao S et al. Comparison of stranded and non-stranded RNA-Seq transcriptome profiling and investigation of gene overlap. BMC Genomics Sep 3;16:675.
23
Orientations of RNA-Seq reads depend on the paired-end protocol
TruSeq Strand-Specific Total RNA: First Strand (R/RF) fr-firststrand (F2R1) NuGEN Encore: Second Strand (F/FR) fr-secondstrand (F1R2) NuGEN Ovation V2: FR Unstranded fr-unstranded Use the infer_ RSeQC I = Inward, O= outward, M = matching; S = stranded, U = unstranded; F = forward, R = reverse F2R1 = second read in forward strand, first read in reverse strand (ISR), RF See Supplemental Table S5 in Griffith M et al. PLoS Comput Biol Aug 6;11(8):e
24
Use infer_experiment.py to infer the design of the RNA-Seq experiment
Part of the RSeQC package: Map RNA-Seq reads using default parameters Run infer_experiment.py to infer experimental design Map RNA-Seq reads using the inferred parameters Available through the Galaxy Tool Shed (rseqc): Can also try to infer RIN (RNA integrity)
25
Common changes to HISAT alignment parameters
Minimum and maximum intron lengths Specify strand-specific information GTF file with known splice sites Use known gene annotations to guide read mapping if available Transcriptome assembly reporting
26
Use splice site information during read mapping to improve alignment accuracy
Recommend run STAR and TopHat2 twice: Round 1 to discover junctions; round 2 use these junctions in read mapping HISAT by default make use of splice sites found during the alignment process so that it does not have to run twice (Compare HISATx1, HISAT, and HISATx2) Kim D et al. HISAT: a fast spliced aligner with low memory requirements. Nat Methods Apr;12(4):
27
RNA-Seq alignment strategy for multiple samples
Map reads from each RNA-Seq sample separately Use the --novel-splicesite-outfile option to report splice sites identified in each sample Combine splice junctions from all samples under all conditions into a single splice junction file Filter splice junctions by score Retain junctions that appear in multiple biological replicates Map reads from each RNA-Seq sample using the combined splice junctions file Use the --novel-splicesite-infile option For model organisms / human, could use --known-splicesite-infile directly
28
PCR amplification biases in Illumina data
Major contributors to bias: Fragment size Base composition Bridge amplification cycles Aird D et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12(2):R18. Relative abundance GC content of amplicon Illumina qPCR 100 10 1 0.1 20 40 60 80 Bias toward smaller fragments Problem with GC-rich sequences (secondary structures, higher melting temperature)
29
Identify duplicate reads based on sequence alignments
Assumption: Rare for different sequenced fragments to have the same start and end positions The Picard tool MarkDuplicates only considers the start position RNA-Seq data violates this assumption: Reads map to a smaller portion of the genome Probability of reads with the same start position depends on gene length and expression levels Recommendation: Mark potential duplicates to verify that all samples have similar levels of “duplication” Retain duplicate reads in differential expression analyses Examples: ribosomal, mitochondrial house keeping genes Tools such as Picard MarkDuplicates only examine the 5’ position (after accounting for clipping) 3’ end typically have lower quality Williams AG et al. RNA-Seq Data: Challenges in and Recommendations for Experimental Design and Analysis. Curr Protoc Hum Genet Oct 1;83:
30
MarkDuplicatesWithMateCigar does not work with RNA-Seq data (use too much memory for large gaps)
(not used in production pipeline in Broad) Histogram tries to estimate additional benefits of sequencing (ROI) DEMO: Mark optical and PCR duplicates using the Picard tool MarkDuplicates
31
Assess RNA-Seq read alignments with CollectRnaSeqMetrics
Requires gene annotations: Gene annotations in GTF / GFF format UCSC Genes in refFlat format (from the UCSC Table Browser) BAM dataset collection Reference genome (mm10) CollectRNASeqMetrics – median coverage, 5’/3’ biases, number of reads assigned to correct strand, etc. Gene annotations in refFlat format rRNAs in interval list format Second read transcription strand
32
DEMO: Use CollectRnaSeqMetrics to assess RNA-Seq alignments
33
Identify coverage bias along the transcript
Normalized coverage Normalized distance along transcript RNA-Seq coverage vs. transcript position (G1E_R1) 2.5 2.0 1.5 1.0 0.5 0.0 40 20 60 80 100 cDNA fragmentation DNase I treatment or sonication RNA fragmentation RNA hydrolysis or nebulization RNA-Seq Read Count 1.0 oligo-dT primed cDNA 5’ 3’ Gene Span (5,099 genes) Wang Z et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet Jan;10(1):57-63.
34
Two common approaches to RNA-Seq assembly
Reference-based assembly Map RNA-Seq reads against a reference genome Examples: TopHat2, HISAT Assemble transcripts from mapped RNA-Seq reads Examples: Cufflinks, StringTie De novo transcriptome assembly Assemble transcripts from RNA-Seq reads Examples: Oases, Trinity More computationally expensive Merge assemblies produced by different parameters Advantage of de novo assembly is that it does not require a reference genome
35
Augment mapped RNA-Seq reads with pre-assembled super-reads (SR)
Pertea M et al. StringTie enables improved reconstruction of a transcriptome from RNA-Seq reads. Nat Biotechnol Mar;33(3):290-5.
36
Transcriptome assembly remains an active area of research
Korf I. Genomics: the state of the art in RNA-Seq analysis. Nat Methods Dec;10(12): Steijger T et al. Assessment of transcript reconstruction methods for RNA-Seq. Nat Methods Dec;10(12):
37
DEMO: Assemble transcripts from mapped RNA-Seq reads using StringTie
38
Metrics for quantifying gene expression levels
RPKM Reads Per Kilobase per Million mapped reads Normalize relative to sequencing depth and gene length FPKM Similar to RPKM but count DNA fragments instead of reads Used in paired end RNA-Seq experiments to avoid bias TPM Transcripts Per Million Normalize for gene length, then normalize by sequencing depth FPKM = Count fragments because both paired-end reads are derived from the same transcript TPM = number of transcripts that you would see if you sequence 1M transcripts (given the abundance of other transcripts in the sample) Normalize FPKM so that the total sum is 1M => TPM Wagner GP et al. Measurement of mRNA abundance using RNA-Seq data: RPKM measure is inconsistent among samples. Theory Biosci Dec;131(4):281-5.
39
Most differential expression analysis tools use read counts as input
RPKM, FPKM, and TPM compare relative gene expression levels within a sample Most differential expression analysis tools use read counts as input edgeR uses CPM (TPM) for filtering, TMM normalization for differential expression analysis DESeq2 uses rlog for clustering
40
Create a transcriptome library with StringTie merge
Combine transcripts from multiple samples into a single transcriptome database Sample 1 (A) Sample 2 Sample 3 (B) Sample 4 Reference annotation Use the unique exon to try to quantify transcripts instead of genes Merged assemblies Pertea M et al. Transcript-level expression analysis of RNA-Seq experiments with HISAT, StringTie and Ballgown. Nat Protoc Sep;11(9):
41
DEMO: Use StringTie merge to construct a transcriptome database for G1E and Mk
42
Differential expression analysis tools
Compare genes expression levels: DESeq2 ( edgeR ( Compare transcripts expression levels: Cuffdiff ( Ballgown ( MISO ( Salmon ( Compare both genes and transcripts expression levels: RSEM + EBSeq ( DESeq2 – MIchael Love started to examine behavior of DESeq2 at the transcript-level: DESeq was originally designed for differential expression analysis at the gene level: Recommend using DEXSeq to study differential expression of exons instead of focusing on the isoforms
43
Count the number of reads that overlap with each gene using htseq-count
Three modes of overlap resolution: union intersection_strict intersection_nonempty htseq-count ignores multi-mapped reads
44
featureCounts is a faster alternative to htseq-count
C versus python… Liao Y et al. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics Apr 1;30(7):
45
Issues with the GTF files produced by the UCSC Table Browser
The gene_id and transcript_id fields in the GTF file have the same values GTF format stipulates that different isoforms derived from the same gene should have the same gene_id Two potential solutions: Download genePred file from the UCSC Genome Browser web site, then use genePredToGtf to create the GTF file Use the GTF files from Ensembl Also issue with duplicate entries in the GTF files produced by the UCSC Table Browser (including the GTF file that is part of the Data Library)
46
DEMO: Calculate the read count for each transcript using featureCounts
47
Use Poisson distribution to model the distribution of read counts
Probability of the number of “events” in a fixed amount of time or space Events = RNA-Seq reads Mean = variance = λ Probability distribution is based on binomial distribution Model noise within the same condition Assumptions: Events are independent Rate of events are constant
48
Overdispersion in RNA-Seq data
More highly expressed genes show higher variance log(variance) log(mean) -5 5 10 15 25 20 Poisson Overdispersion Actual Negative binomial distribution: Another problem with RNA-Seq data is heteroscedasticity: More highly expressed genes show more variability Overdispersion is caused primarily by biological noise Noise Low expression: shot noise dominates (solution = sequence deeper) High expression: biological noise dominates (solution = more replicates) Zhou YH et al. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics Oct 1;27(19):
49
Use biological replicates to control Type I errors
Use biological replicates to estimate variance within a condition Identify differentially expressed genes under different conditions -5 5 100 101 102 103 104 105 106 10-1 Mean expression log2 fold change 5 log2 fold change If no variation among biological replicates, then expect 0 fold change Yellow line = Poisson Purple line = Poisson + local regression (DESeq) (local regression for gamma family GLM with correction for library size) Slides available through the Bioconductor web site: -5 100 101 102 103 104 105 Mean expression Huber W. Differential Expression for RNA-Seq.
50
DEMO: Differential expression analysis using DESeq2
51
Verify results using multiple differential expression analysis tools
Impact of read depth on differential expression analysis Use the intersection of differentially expressed genes identified by multiple tools in downstream analyses MAQC = microarray quality control project – human brain reference + universal human reference K_N = mouse neurosphere cells treated by potassium chloride and norepinephreine LCL = lymphoblastoid cell lines – Nigeria human samples from HapMap project 5M 10M 20M 30M Zhang ZH et al. A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One Aug 13;9(8):e
52
Tools for analyzing differentially expressed genes
Gene Ontology (GO) terms enrichment: topGO ( goSTAG ( DAVID ( Pathway analysis: GAGE ( Reactome ( Sample walkthrough: From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline
53
Additional resources Analysis of RNA-Seq data: gene-level exploratory analysis and differential expression Informatics for RNA-Seq: A web resource for analysis on the cloud UC Davis Bioinformatics Core training course So you want to do a: RNA-Seq experiment, Differential Gene Expression Analysis Specific course from UC Davis on RNA-Seq and differential gene expression analysis
54
Summary Statistical power of RNA-Seq depends on the number of biological replicates and sequencing depth Assess quality of RNA-Seq reads with FastQC Assess RNA-Seq alignments with Picard tools Perform transcriptome assembly with StringTie Perform differential expression analyses with DESeq2
55
https://flic.kr/p/bhyT8B
Questions?
57
Power curves for number of biological replicates in each condition
Web interface for RnaSeqSampleSize: Ching T et al. Power analysis and sample size estimation for RNA-Seq differential expression. RNA Nov;20(11): # Samples in each condition 10 20 30 40 50 60 0.0 0.2 0.4 0.8 0.6 FDR = 0.05 FDR = 0.01 Power Bottomly – mouse strain comparison – LFC = 0.99, dispersion = 0.035 Statistical power (ability to detect an effect, 1 - Type II error) Power >= 0.8 Sequence more samples (multiplex) = lower coverage
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.