Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

RNAseq Library Preparation and ANAlysis basics
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Microarray Normalization
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
Normalization of microarray data
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Ribosomal Profiling Data Handling and Analysis
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Differentially expressed genes
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Biases in RNA-Seq data Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions)
mRNA-Seq: methods and applications
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
RNA-Seq and RNA Structure Prediction
Biases in RNA-Seq data October 30, 2013 NBIC Advanced RNA-Seq course
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)
Todd J. Treangen, Steven L. Salzberg
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
RNAseq analyses -- methods
RNA-Seq Analysis Simon V4.1.
The iPlant Collaborative
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Introduction to RNAseq
Analyzing Expression Data: Clustering and Stats Chapter 16.
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
No reference available
Lecture 12 RNA – seq analysis.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions) Biases in RNA-Seq.
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Statistics Behind Differential Gene Expression
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Lecture 7. Topics in RNA Bioinformatics (Single-Cell RNA Sequencing)
Normalization Methods for Two-Color Microarray Data
Design and Analysis of Single-Cell Sequencing Experiments
Gene expression estimation from RNA-Seq data
Ab initio gene prediction
Getting the numbers comparable
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Quantitative analyses using RNA-seq data
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Sequence Analysis - RNA-Seq 2
Analysis of RNA-Seq data Counting, normalization, and statistical tests for differential expression March 16, 2018 Dr. ir. Perry D. Moerland Bioinformatics.
Schematic representation of a transcriptomic evaluation approach.
Toward Accurate and Quantitative Comparative Metagenomics
SUPT4H1 Depletion Leads to a Global Reduction in RNA
Presentation transcript:

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Biases in RNA-Seq data Aim: to provide you with a brief overview of literature about biases in RNA-seq data such that you become aware of this potential problem.

Experimental (and computational) biases affect expression estimates and, therefore, subsequent data analysis: – Differential expression analysis – Study of alternative splicing – Transcript assembly – Gene set enrichment analysis – Other downstream analyses We must attempt to avoid, detect and correct these biases Biases in RNA-Seq data

Bias No bias Weak noise Strong noise x Bias and Variance

ACTG-signal detection Amplification efficiency DNA QualityDNA quality Background fluorescence Efficiency of RNA Extraction, Reverse Transcription Tissue contamination Efficiency of RNA Extraction, Reverse Transcription Tissue contamination RNA Degradation Similar effects on many (all) data of one sample Correction can be estimated and removed from the data (normalization) Systematic (Bias)Stochastic (Variance) Effects on single data points of a sample Correction can not be estimated, noise can only be quantified and taken into account (error model) Sources of Bias and Variance

Gene length Mappability of reads – e.g., due to sequence complexity (AAAAAAA vs. CTGATGT) Position – Fragments are preferentially located towards either the beginning or end of transcripts Sequence-specific – biased likelihood for fragments being selected – %GC RNASeq-specific Sources of Bias

transcript 1 (size = L) transcript 2 (size=2L) Count =6 Count = 12 You cannot conclude that gene 2 has a higher expression than gene 1 Normalization for gene length and library size

transcript 1 (sample 1) Count =6, library size = 600 You cannot conclude that gene 1 has a higher expression in sample 2 compared to sample 1 transcript 1 (sample 2) Count =12, library size = 1200 Normalization for gene length and library size

RPKM / FPKM Normalization for gene length and library size RPKM: Reads per kilobase per million mapped reads FPKM: Fragments per kilobase per million mapped reads

Mortazavi et al (2008) Nature Methods, 5(7), 621 RPKM: Reads per kilobase per million mapped reads Within sample normalization

Examples

Examples

Examples

Examples

What's the difference between FPKM and RPKM? Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads. Trapnell et al (2010) Nature Biotechnology, 28(5), 511 RPKM: Fragments per kilobase per million mapped reads

Gene Length Bias

Oshlack and Wakefield (2009) Biology Direct, 16, 4 33% of highest expressed genes 33% of lowest expressed genes This bias (a) affects comparison between genes or isoforms within one sample and (b) results in more power to detect longer transcripts Gene Length Bias in Diff. expression

Question: does this bias disappear when we use RPKM?

Sample variance across lanes in the liver sample from the Marioni et al (2008) Genome research, 18(9), 1509–17. Red line: for the one third of shortest genes Blue line: for the longest genes. Black line: line of equality Mean-Variance Dependence

Plot B: Counts divided by gene length (which you do when using RPKM). The short genes have higher variance for a given expression level than long genes. Because of the change in variance we are still left with a gene length dependency.  Thus, RPKM does not fully correct mean mean / length variance (log 2) Mean-Variance-Length Dependence

Reminder Testing

More power to detect longer differentially expressed transcripts Effect size. Power is related to δ L = gene length Length Bias affects power

Uniquely mapping reads are typically summarized over genomic regions – Regions with lower sequence complexity will tend to end up with lower sequence coverage Test: generate all 32nt fragments from hg18 and align them back to hg18 – Each fragment that cannot be uniquely aligned is considered unmappable – Expect: uniform distribution Schwartz et al (2011) PLoS One, 6(1), e16685 Mappability Bias

Since in RNA-seq we align reads prior to further analysis, this step may already introduce a (slight) bias. Unexpected because introns are assumed to have lower sequence complexity in general Mappability Bias

Reads corresponding to longer transcripts have a higher mappability Mappability-Length Bias

Mappability, Conservation, Expression Evolutionary conservation Expression level in lung fibroblasts

RNA-Seq protocol Current sequencers require that cDNA molecules represent partial fragments of the RNA cDNA fragments are obtained by a series of steps (e.g., priming, fragmentation, size selection) – Some of these steps are inherently random – Therefore: we expect fragments with starting points approximately uniformly distributed over transcripts Sequence-specific Bias 1

Biases in Illumina RNA-seq data caused by hexamer priming Generation of double-stranded complementary DNA (dscDNA) involves priming by random hexamers – To generate reads across entire transcript length Turns out to give a bias in the nucleotide composition at the start (5’-end) of sequencing reads This bias influences the uniformity of coverage along transcript Hansen et al (2010) NAR, 38(12), e131 Li et al (2010) Genome Biology, 11(5), R50 Schwartz et al (2011) PLoS One, 6(1), e16685 Roberts et al(2011) Genome Biology, 12: R22 Sequence-specific Bias 1

transcript read (35nt) A A position 1 in read position x in transcript Determine the nucleotide frequencies considering all reads: What do we expect?? Position in read Nucl We would expect that the frequencies for these nucleotides at the different positions are about equal Sequence-specific Bias 1 ACTGACTG relative frequencies

Frequencies slightly deviate between experiments but show identical relative behavior Distribution after position 13 reflects nt composition of transcriptome Effect is independent of study, laboratory, and organisms Apparently, hexamer priming is not completely random These patterns do not reflect sequencing error and cannot be removed by 5’ trimming! first hexamer = shaded Sequence-specific Bias 1

Aim – Adjust biased nucleotide frequencies at the beginning of the reads to make them similar to distribution of the end of the reads (which is assumed to be representative for the transcriptome) Approach 1.Associate weight with each read such that they 2.down- or up-weight reads with heptamer at beginning of read that is over/under-represented  Determine expression level of region by adjusting counts by multiplying them with the weights Sequence-specific Bias correction

Read (assume read of at least 35nt) = heptamer heptamers (h) at positions i=1,2 of reads heptamers (h) at positions i= of reads log(w) Weights are determined over all possible 4 7 = heptamers Sequence-specific Bias correction

 Base level counts at each position Extreme expression values are removed but coverage is still far from uniform unmapple bases Example: gene YOL086C in yeast

External RNA Control Consortium (ERCC) ERCC RNA standards – range of GC content and length – minimal sequence homology with endogenous transcripts from sequenced eukaryotes FPKM normalization Synthetic Spike-in standards

Results suggest systematic bias: better agreement between the observed read counts from replicates than between the observed read counts and expected concentration of ERCC’s within a given library. Count versus concentration*length (mass) per ERCC. Pool of 44 2% ERCC spike-in H. Sapiens libraries Read counts for each ERCC transcript in two different libraries of human RNA-seq with 2% ERCC spike-ins

Transcript-specific sources of error Transcript specific biases affect comparisons of read counts between different RNAs in one library Fold deviation between observed and expected depends on Read count GC content Transcript length Fold deviation between observed and expected read count for each ERCC in the 100% ERCC library

Read coverage biases: single ERCC RNA Position effect

Read coverage biases: 96 ERCC RNAs Average relative coverage along all control RNAs for ERCC spiked in 44 H. sapiens libraries. Dashed lines represent 1 SD around the average across different libraries. Suggested: drop in coverage at 3’-end due to the inherently reduced number of priming positions at the end of the transcript Position effect

Technical bias: cDNA library preparation

DNA library preparation: RNA fragmentation and cDNA fragmentation compared. Fragmentation of oligo-dT primed cDNA (blue line) is more biased towards the 3′ end of the transcript. RNA fragmentation (red line) provides more even coverage along the gene body, but is relatively depleted for both the 5′ and 3′ ends. Wang et al (2009) Nature reviews. Genetics, 10(1), 57–63.

RNAseq may detect other RNA species Tarazona et al (2011) Genome research, 21(12), 2213–23

Normalization

Housekeeping genes (Bullard et al, 2010) Quantile normalization (Irizarry et al, 2003) Variance stabilization (v. Heydebreck et al, 2003, Anders et al. 2010) Loess normalization (Dudoit et al, 2002) Between sample normalization

Log Grün Log Rot Scatterplot A = (Log Grün + Log Rot) / 2 M = Log Rot - Log Grün M-A Plot Rescaling is a global method: all transcripts of a sample are scaled by the same number. Potential problems with RPKM

Idea: „All expression distributions look identical “ This assumption is stronger than for RPKM normalization, which only assumes that the mean of the expression distributions are identical. Algorithm Order transcripts by their RPKM in each sample Replace the n-th largest value (in each sample) by the mean of all n-th largest values (in all samples). Do this for all n=1,2,… Boxplot of expression values after quantile normalization Drawback: Differentially expressed transcripts in the low / high expressed tails tend to be overlooked. Quantile Normalization

Let X μ, μ є [a,b], be a set of gene expression measurements with expectation value E(X μ ) = μ. And variance Var(X μ ) = v(μ). v( μ) XμXμ μ μ We search a transformationT such that Var(T(X μ )) ≈ const. Variance Stabilization

What does variance stabilization help? After variacne stabilization, the data are homoschedastic, i.e. the variance of all measurements T(X μ ), μ є [a,b], is (approximately) equal. Homoschedastic data allow the applicatino of more reliable tests, or increase the power of tests (e.g., t- test). Variance Stabilization

X T(X) y=T(x) μ δδ Tangent in (X,T(X)) T(μ) T´(μ)·δ Variance Stabilization

Additive noise Multiplicative noise Absolute scaleLog scale Same vs. Same Experiment Variance Stabilization

- - - f(x) = log(x) ——— h σ (x) = asinh(x/σ) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002 W. Huber et al., ISMB 2002 Variance Stabilization

Replace the log ratio by the „glog“ ratio (c 1,c 2 experiment-specific parameters) Estimated log fold Expression The glog ratio is a shrinkage estimator: Genes with low expression are biased to a fold of 1 (log fold of 0). This however reduces the variance of this estiator. Such an estimator is useful if you want to guard against false positives. Fold change shrinkage

A = (Log Green + Log Red) / 2 M = Log Red - Log Green Assumption: There exists an expression-dependent bias Taks: Find f, replace y j by y j - f(x j ). f The idea of local regression (loess) is that f can be described locally by a line or by a polynomial of small degree, which can be estimated from the data Loess Normalization

External Spike- In controls Lowess Regression fitted to spike- ins Lowess Regression fitted to all genes External Spike- In controls Loess Normalization

Be aware of different types of bias Try to avoid Try to detect Try to correct Summary

Antoine van Kampen Academic Medical Center Amsterdam Acknowledgements Wolfgang Huber EMBL Heidelberg