Presentation is loading. Please wait.

Presentation is loading. Please wait.

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Similar presentations


Presentation on theme: "Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data."— Presentation transcript:

1 Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

2 Biases in RNA-Seq data Aim: to provide you with a brief overview of literature about biases in RNA-seq data such that you become aware of this potential problem.

3 Experimental (and computational) biases affect expression estimates and, therefore, subsequent data analysis: – Differential expression analysis – Study of alternative splicing – Transcript assembly – Gene set enrichment analysis – Other downstream analyses We must attempt to avoid, detect and correct these biases Biases in RNA-Seq data

4 Bias No bias Weak noise Strong noise x Bias and Variance

5 ACTG-signal detection Amplification efficiency DNA QualityDNA quality Background fluorescence Efficiency of RNA Extraction, Reverse Transcription Tissue contamination Efficiency of RNA Extraction, Reverse Transcription Tissue contamination RNA Degradation Similar effects on many (all) data of one sample Correction can be estimated and removed from the data (normalization) Systematic (Bias)Stochastic (Variance) Effects on single data points of a sample Correction can not be estimated, noise can only be quantified and taken into account (error model) Sources of Bias and Variance

6 Gene length Mappability of reads – e.g., due to sequence complexity (AAAAAAA vs. CTGATGT) Position – Fragments are preferentially located towards either the beginning or end of transcripts Sequence-specific – biased likelihood for fragments being selected – %GC RNASeq-specific Sources of Bias

7 transcript 1 (size = L) transcript 2 (size=2L) Count =6 Count = 12 You cannot conclude that gene 2 has a higher expression than gene 1 Normalization for gene length and library size

8 transcript 1 (sample 1) Count =6, library size = 600 You cannot conclude that gene 1 has a higher expression in sample 2 compared to sample 1 transcript 1 (sample 2) Count =12, library size = 1200 Normalization for gene length and library size

9 RPKM / FPKM Normalization for gene length and library size RPKM: Reads per kilobase per million mapped reads FPKM: Fragments per kilobase per million mapped reads

10 Mortazavi et al (2008) Nature Methods, 5(7), 621 RPKM: Reads per kilobase per million mapped reads Within sample normalization

11 Examples

12 Examples

13 Examples

14 Examples

15 What's the difference between FPKM and RPKM? Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads. Trapnell et al (2010) Nature Biotechnology, 28(5), 511 RPKM: Fragments per kilobase per million mapped reads

16 Gene Length Bias

17 Oshlack and Wakefield (2009) Biology Direct, 16, 4 33% of highest expressed genes 33% of lowest expressed genes This bias (a) affects comparison between genes or isoforms within one sample and (b) results in more power to detect longer transcripts Gene Length Bias in Diff. expression

18 Question: does this bias disappear when we use RPKM?

19 Sample variance across lanes in the liver sample from the Marioni et al (2008) Genome research, 18(9), 1509–17. Red line: for the one third of shortest genes Blue line: for the longest genes. Black line: line of equality Mean-Variance Dependence

20 Plot B: Counts divided by gene length (which you do when using RPKM). The short genes have higher variance for a given expression level than long genes. Because of the change in variance we are still left with a gene length dependency.  Thus, RPKM does not fully correct mean mean / length variance (log 2) Mean-Variance-Length Dependence

21 Reminder Testing

22 More power to detect longer differentially expressed transcripts Effect size. Power is related to δ L = gene length Length Bias affects power

23 Uniquely mapping reads are typically summarized over genomic regions – Regions with lower sequence complexity will tend to end up with lower sequence coverage Test: generate all 32nt fragments from hg18 and align them back to hg18 – Each fragment that cannot be uniquely aligned is considered unmappable – Expect: uniform distribution Schwartz et al (2011) PLoS One, 6(1), e16685 Mappability Bias

24 Since in RNA-seq we align reads prior to further analysis, this step may already introduce a (slight) bias. Unexpected because introns are assumed to have lower sequence complexity in general Mappability Bias

25 Reads corresponding to longer transcripts have a higher mappability Mappability-Length Bias

26 Mappability, Conservation, Expression Evolutionary conservation Expression level in lung fibroblasts

27 RNA-Seq protocol Current sequencers require that cDNA molecules represent partial fragments of the RNA cDNA fragments are obtained by a series of steps (e.g., priming, fragmentation, size selection) – Some of these steps are inherently random – Therefore: we expect fragments with starting points approximately uniformly distributed over transcripts Sequence-specific Bias 1

28 Biases in Illumina RNA-seq data caused by hexamer priming Generation of double-stranded complementary DNA (dscDNA) involves priming by random hexamers – To generate reads across entire transcript length Turns out to give a bias in the nucleotide composition at the start (5’-end) of sequencing reads This bias influences the uniformity of coverage along transcript Hansen et al (2010) NAR, 38(12), e131 Li et al (2010) Genome Biology, 11(5), R50 Schwartz et al (2011) PLoS One, 6(1), e16685 Roberts et al(2011) Genome Biology, 12: R22 Sequence-specific Bias 1

29 transcript read (35nt) A A position 1 in read position x in transcript Determine the nucleotide frequencies considering all reads: What do we expect?? Position in read Nucl 1 2 3 4 5......... 35 We would expect that the frequencies for these nucleotides at the different positions are about equal Sequence-specific Bias 1 ACTGACTG relative frequencies

30 Frequencies slightly deviate between experiments but show identical relative behavior Distribution after position 13 reflects nt composition of transcriptome Effect is independent of study, laboratory, and organisms Apparently, hexamer priming is not completely random These patterns do not reflect sequencing error and cannot be removed by 5’ trimming! first hexamer = shaded Sequence-specific Bias 1

31 Aim – Adjust biased nucleotide frequencies at the beginning of the reads to make them similar to distribution of the end of the reads (which is assumed to be representative for the transcriptome) Approach 1.Associate weight with each read such that they 2.down- or up-weight reads with heptamer at beginning of read that is over/under-represented  Determine expression level of region by adjusting counts by multiplying them with the weights Sequence-specific Bias correction

32 Read (assume read of at least 35nt) = heptamer heptamers (h) at positions i=1,2 of reads heptamers (h) at positions i=24..29 of reads log(w) Weights are determined over all possible 4 7 =16.384 heptamers Sequence-specific Bias correction

33  Base level counts at each position Extreme expression values are removed but coverage is still far from uniform unmapple bases Example: gene YOL086C in yeast

34 External RNA Control Consortium (ERCC) ERCC RNA standards – range of GC content and length – minimal sequence homology with endogenous transcripts from sequenced eukaryotes FPKM normalization Synthetic Spike-in standards

35 Results suggest systematic bias: better agreement between the observed read counts from replicates than between the observed read counts and expected concentration of ERCC’s within a given library. Count versus concentration*length (mass) per ERCC. Pool of 44 2% ERCC spike-in H. Sapiens libraries Read counts for each ERCC transcript in two different libraries of human RNA-seq with 2% ERCC spike-ins

36 Transcript-specific sources of error Transcript specific biases affect comparisons of read counts between different RNAs in one library Fold deviation between observed and expected depends on Read count GC content Transcript length Fold deviation between observed and expected read count for each ERCC in the 100% ERCC library

37 Read coverage biases: single ERCC RNA Position effect

38 Read coverage biases: 96 ERCC RNAs Average relative coverage along all control RNAs for ERCC spiked in 44 H. sapiens libraries. Dashed lines represent 1 SD around the average across different libraries. Suggested: drop in coverage at 3’-end due to the inherently reduced number of priming positions at the end of the transcript Position effect

39 Technical bias: cDNA library preparation

40 DNA library preparation: RNA fragmentation and cDNA fragmentation compared. Fragmentation of oligo-dT primed cDNA (blue line) is more biased towards the 3′ end of the transcript. RNA fragmentation (red line) provides more even coverage along the gene body, but is relatively depleted for both the 5′ and 3′ ends. Wang et al (2009) Nature reviews. Genetics, 10(1), 57–63.

41 RNAseq may detect other RNA species Tarazona et al (2011) Genome research, 21(12), 2213–23

42 Normalization

43 Housekeeping genes (Bullard et al, 2010) Quantile normalization (Irizarry et al, 2003) Variance stabilization (v. Heydebreck et al, 2003, Anders et al. 2010) Loess normalization (Dudoit et al, 2002) Between sample normalization

44 Log Grün Log Rot Scatterplot A = (Log Grün + Log Rot) / 2 M = Log Rot - Log Grün M-A Plot Rescaling is a global method: all transcripts of a sample are scaled by the same number. Potential problems with RPKM

45 Idea: „All expression distributions look identical “ This assumption is stronger than for RPKM normalization, which only assumes that the mean of the expression distributions are identical. Algorithm Order transcripts by their RPKM in each sample Replace the n-th largest value (in each sample) by the mean of all n-th largest values (in all samples). Do this for all n=1,2,… Boxplot of expression values after quantile normalization Drawback: Differentially expressed transcripts in the low / high expressed tails tend to be overlooked. Quantile Normalization

46 Let X μ, μ є [a,b], be a set of gene expression measurements with expectation value E(X μ ) = μ. And variance Var(X μ ) = v(μ). v( μ) XμXμ μ μ We search a transformationT such that Var(T(X μ )) ≈ const. Variance Stabilization

47 What does variance stabilization help? After variacne stabilization, the data are homoschedastic, i.e. the variance of all measurements T(X μ ), μ є [a,b], is (approximately) equal. Homoschedastic data allow the applicatino of more reliable tests, or increase the power of tests (e.g., t- test). Variance Stabilization

48 X T(X) y=T(x) μ δδ Tangent in (X,T(X)) T(μ) T´(μ)·δ Variance Stabilization

49 Additive noise Multiplicative noise Absolute scaleLog scale Same vs. Same Experiment Variance Stabilization

50 - - - f(x) = log(x) ——— h σ (x) = asinh(x/σ) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002 W. Huber et al., ISMB 2002 Variance Stabilization

51 Replace the log ratio by the „glog“ ratio (c 1,c 2 experiment-specific parameters) Estimated log fold Expression The glog ratio is a shrinkage estimator: Genes with low expression are biased to a fold of 1 (log fold of 0). This however reduces the variance of this estiator. Such an estimator is useful if you want to guard against false positives. Fold change shrinkage

52 A = (Log Green + Log Red) / 2 M = Log Red - Log Green Assumption: There exists an expression-dependent bias Taks: Find f, replace y j by y j - f(x j ). f The idea of local regression (loess) is that f can be described locally by a line or by a polynomial of small degree, which can be estimated from the data Loess Normalization

53 External Spike- In controls Lowess Regression fitted to spike- ins Lowess Regression fitted to all genes External Spike- In controls Loess Normalization

54 Be aware of different types of bias Try to avoid Try to detect Try to correct Summary

55 Antoine van Kampen Academic Medical Center Amsterdam Acknowledgements Wolfgang Huber EMBL Heidelberg


Download ppt "Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data."

Similar presentations


Ads by Google