Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3 RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3.

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Visualising and Exploring BS-Seq Data
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
Ribosomal Profiling Data Handling and Analysis
RNA-seq Analysis in Galaxy
RNA-Seq and RNA Structure Prediction
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
RNAseq analyses -- methods
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Lecture 11. Microarray and RNA-seq II
Next Generation DNA Sequencing
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
RNA-Seq Analysis Simon V4.1.
Transcriptome Analysis
Eran Yanowski, Eran Hornstein’s: Monitor drug impact on the transcriptome of mouse beta cells (primary and cell-line) using Transeq/RNA-Seq Report.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
RNA-seq workshop COUNTING & HTSEQ Erin Osborne Nishimura.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
Introduction to RNAseq
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
The iPlant Collaborative
RNA-seq: Quantifying the Transcriptome
The iPlant Collaborative
No reference available
Lecture 12 RNA – seq analysis.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
+ RNAseq for differential gene expression analysis Molly Hammell, PhD
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Differential Methylation Analysis
Statistics Behind Differential Gene Expression
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
Placental Bioinformatics
Moderní metody analýzy genomu
Biases and their Effect on Biological Interpretation
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Understanding and Validating Experimental Expectations
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
The FASTQ format and quality control
Kallisto: near-optimal RNA seq quantification tool
Analysing ChIP-Seq Data
Transcriptome analysis
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Visualising and Exploring BS-Seq Data
Learning to count: quantifying signal
Assessing changes in data – Part 2, Differential Expression with DESeq2
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3 RNA-Seq Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2.3

RNA-Seq Libraries rRNA depleted mRNA Fragment Random prime + RT 2nd strand synthesis (+ U) A-tailing Adapter Ligation (U strand degradation) Sequencing NNNN u u u u u A T u A T A T

RNA-Seq Analysis QC (Trimming) Mapping Mapped QC Statistical Analysis Quantitation Mapped QC

QC: Raw Data Sequence call quality

QC: Raw Data Sequence bias

QC: Raw Data Duplication level

Mapping Exon 1 Exon 2 Exon 3 Genome Simple mapping within exons Mapping between exons Spliced mapping Can simplify by aligning first to a transcriptome and then translate back to genomic coordinates. Can map unmatched reads to the whole genome.

Spliced Mapping Software Tophat (http://tophat.cbcb.umd.edu/) Star (http://code.google.com/p/rna-star/)

TopHat pipeline Reference FastQ files Indexed Genome Reference GTF Models Indexed Transcriptome Reads Maps to transcriptome? Yes Translate coords and report No Maps to genome? Yes Report No Split map to genome Yes Build consensus and report No Discard

Post Mapping QC Mapping statistics Proportion of reads which are in transcripts Proportion of reads in transcripts in exons Strand specificity Consistency of coverage SeqMonk (RNA-Seq QC plot) RNASeqQC (easiest through GenePattern)

SeqMonk Mapping QC (good)

SeqMonk Mapping QC (bad)

Quantitation Splice form 1 Exon 1 Exon 2 Exon 3 Splice form 2 Exon 1 Definitely splice form 1 Definitely splice form 2 Ambiguous

Options for handling splice variants Ignore them – combine exons and analyse at gene level Simple, powerful, inaccurate in some cases DE-Seq, SeqMonk Assign ambiguous reads based on unique ones – quantitate transcripts and optionally merge to gene level Potentially cleaner more powerful signal High degree of uncertainty Cufflinks, bitSeq, RSEM

Read counting Simple (exon or transcript) Complex (re-assignment) HTSeq (htseq-count) BEDTools (multicov) featureCounts SeqMonk (graphical) Complex (re-assignment) Cufflinks, bitSeq, RSEM

Count Corrections Size of library Length of transcript Other factors

RPKM / FPKM / TPM RPKM (Reads per kilobase of transcript per million reads of library) Corrects for total library coverage Corrects for gene length Comparable between different genes within the same dataset FPKM (Fragments per kilobase of transcript per million reads of library) Only relevant for paired end libraries Pairs are not independent observations RPKM/2 TPM (transcripts per million) Normalises to transcript copies instead of reads Corrects for cases where the average transcript length differs between samples

Normalisation

Normalisation

Filtering Genes Remove things which are uninteresting or shouldn’t be measured Reduces noise – easier to achieve significance Non-coding (miRNA, snoRNA etc) in RNA-Seq Known mis-spliced forms (exon skipping etc) Mitochonidrial genes X/Y chr genes in mixed sex populations Unknown/Unannotated genes

Filtering Mouse mRNAs

Visualising Expression Comparing the same gene in different samples Normalised log2 RPM values Comparing different genes in the same sample Normalised log2 RPKM values

Linear Log2 CD74 Eef1a1 Actb Lars2 Eef2

Differential Expression Microarrays traditionally used continuous statistical tests (t-test ANOVA etc) RNA-Seq differs in that it is count based data, so continuous tests fail at low counts Most differential tests use count based distribution tests, usually based on a negative binomial distribution

Negative binomial tests (DE-Seq) Are the counts we see for gene X in condition 1 consistent with those for gene X in condition 2? Initially modelled using simple Poisson distribution using mean expression as the only parameter Doesn’t model real data very well

Poisson vs Negative binomial DESeq

Parameters Size factors Variance – required for NB distribution Estimator of library sampling depth More stable measure than total coverage Based on median ratio between conditions Variance – required for NB distribution Custom variance distribution fitted to real data Insufficient observation to allow direct measure Smooth distribution assumed to allow fitting

Dispersion shrinkage Plot observed per gene dispersion Calculate average dispersion for genes with similar observation Individual dispersions regressed towards the mean. Weighted by Distance from mean Number of observations Points more than 2SD above the mean are not regressed

Other filters Cook’s Distance – Identifies high variance Effect on mean of removal of one replicate For n<3 test not performed For n = 3-6 failures are removed For n>6 outliers removed to make trimmed mean Disable with cooksCutoff=FALSE Hit count optimisation Low intensity reads are removed Limits multiple-testing to give max significant hits Disable with independentFiltering=FALSE

Replicates Compared to arrays, RNA-Seq is a very clean technical measure of expression Generally don’t run technical replicates Some statistics can be run on single replicates, but they can only tell you about technical noise (how likely is it that this change is due to a technical issue) Assessing biological variation requires biological replicates

Replicates Traditional statistics require min 3x3 DESeq can operate at 2x2, but this is minimum, not recommended True number of replicates required will depend on your biology and requirements 4x4 design is fairly common Always expect at least one sample to fail Randomise samples during sample prep

The problem of power… Gene A (1kb) Gene B (8kb) In a library Gene B is much better observed for the same copy number Power to detect DE is proportional to length

5x5 Replicates 5,000 out of 22,000 genes (23%) identified as DE using DESeq (p<0.05)

Intensity difference test Different approach to differential expression Doesn’t aim to find every differentially expressed gene Conservative test Guaranteed to never return large numbers of hits

Assumptions Noise is related to observation level Similar to DESeq Differences between conditions are either A direct response to stimulus Noise, either technical or biological Find points whose differences aren’t explained by general disruption

Method

Results

Exercises Look at raw QC Mapping with tophat Small test data Quantitation and visualisation with SeqMonk Larger replicated data Differential expression with DESeq Review in SeqMonk

Useful links FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tophat http://tophat.cbcb.umd.edu/ SeqMonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/ Cufflinks http://cufflinks.cbcb.umd.edu/ DESeq http://www-huber.embl.de/users/anders/DESeq/ Bioconductor http://www.bioconductor.org/ RSEM http://deweylab.biostat.wisc.edu/rsem/