Presentation is loading. Please wait.

Presentation is loading. Please wait.

Edouard Severing RNA seq (I). A typical heat stress experiment ( climate change ) Control Heat stress ( convection ) 5 days 85 minutes How does the frog.

Similar presentations


Presentation on theme: "Edouard Severing RNA seq (I). A typical heat stress experiment ( climate change ) Control Heat stress ( convection ) 5 days 85 minutes How does the frog."— Presentation transcript:

1 Edouard Severing RNA seq (I)

2 A typical heat stress experiment ( climate change ) Control Heat stress ( convection ) 5 days 85 minutes How does the frog adapt and survive? Economically important frog

3 Coping with heat stress The frog likely has to change several processes in order to cope with the heat stress. Adaptation of metabolic pathways. Prevent water loss through skin Changing the concentration of several enzymes, other proteins and molecules. We want to determine these molecule concentration changes Starting with proteins.

4 Changes at the molecular level We could measure protein concentration directly Not often done on a large scale We could measure changes in the expression of the genes that encode these proteins. Gene expression can be approximated by measuring the amount of mRNA molecules that are produced by the gene.

5 Gene count and complexity genes genes

6 From genes to proteins (I) Initial assumption N Protein coding genes N mRNA Molecules N Proteins Assumption is based on studies that were performed on bacterial systems

7 From genes to proteins (II) Current view N Protein coding genes ? N Proteins X N mRNA Molecules What happens here ?

8 Splicing Exon 5’--3’ Splicing mRNA 5’--3’ Pre-mRNA Gene Exon 5’--3’ Intron

9 Alternative splicing 5’--3’ Pre-mRNA 5’- -3’ 5’--3’ Splicing 5’--3’ Splicing 5’- -3’

10 Gene count and complexity 60% of genes have AS 90% of genes have AS The average number of transcripts produced by human genes is also higher than the average number of transcripts produced by plant genes

11 An extreme case Dscam gene produces over 38,000 different transcripts

12 Major alternative splicing event types Exon skipping Intron retention Humans Plants In humans exon skipping is most frequent AS event type In plants intron retention are the most common AS event type

13 RNA editing U A C G A C 5’- AU - 3’ RNA-Editing U A U G A C 5’- AU - 3’ Primary transcript (Predicted sequence) After editing (Observed sequence) Difficulty: Distinguish genuine RNA-editing from sequencing errors

14 Not everything is translated A large fraction (>30%) of transcripts of protein coding genes are degraded by the nonsense- mediated decay (NMD) pathway. The position of the stop codon is used to predict whether a transcript is likely to be degraded by the NMD pathway

15 Detecting putative NMD candidates Exon/Exon junctions 5’--3’ 5’--3’ mRNA Pre-mRNA Open reading frame M 5’--3’ Stop d > 50-55nt

16 Remember The number of unique mRNA molecules is much larger than the number of genes. A large fraction of the mRNA molecules is degraded by the NMD pathway. NMD provides a means to regulate gene-expression at the post-transcriptional level

17 Process the frogs into reads for analysis N2N2 Grind Prepare for sequencing Sequencing Bioinformatics >s1 ATCGTAGGGTA >s2 ATGGCCTAGGT

18 Basic transcriptome analysis steps Many research questions require the following steps: Reconstruction of the transcriptome We usually only have fragments Quantification of the transcriptome Differential expression analysis Other fun stuff.

19 de novo transcriptome reconstruction (I)

20 de novo transcriptome reconstruction (II)

21 Genome-guided transcriptome reconstruction Genome 5’- -3’ mRNA

22 Genome-guided transcriptome reconstruction

23

24 Remember de novo transcriptome assembly When no reference genome is available Finding features which are not on the reference genome (tDNA insertion) Programs: Trinity, Trans-ABySS, Velvet Oases Genome-guided transcriptome reconstruction Reference genome is available with or without annotation Mapping programs: TopHat, GSNAP Transcriptome reconstruction: Scripture, Cufflinks

25 Edouard Severing RNA seq (II) Quantification

26 A typical heat stress experiment Control Heat stress (convection)

27 Raw counts Counting number of reads/fragments falling with exonic regions of a gene. Example: HTseq-count Exon 1 Exon 4 Exon 3 Exon 2

28 The same fragment count yet different expression levels Exon 1 library Library size matters

29 The same fragment count yet different expression levels. Exon 1 Exon 4 Exon 3 Exon 2 Exon 1 Transcript/gene length matters

30 Normalizing/correcting for feature length and library size 10,000, nt RPKM ≈ 1.7 All mapped reads Reads mapped to region Feature length

31 Normalizing/correcting for feature length FPKM is analogous to RPKM Different picture emerges from raw counts and RPKM/FPKM values RPKM = 2 FPKM = 1 RPKM = 1

32 Counting method issues What to do with reads that map to multiple isoforms (alternative splicing) or genes Gene 1Gene 2 Isoform 1 Isoform 2 Isoform 1 Pure Random assignment? No, expression can differ Count multiple time? No, it has been derived from a single transcript

33 Count issues: Back to the gene level (I)

34 Count issues: Back to the gene level (II)

35 Statistical methods: Expression levels of transcripts

36 Fishing in the dark lake experiment Question: What fraction (t) of the fish in the lake is green? Method: We catch a number of fish and determine what fraction is green. Caution: Fish have to be immediately thrown back in the water.

37 Fishing in the dark lake results (I) Fraction of fish that is green t = 1/3 Sample(X) Sane people would do:

38 Fishing in the dark lake results Maximum likelihood estimate of t Sample(X) Maximum likelihood estimate of t t P(t))

39 Fishing in a complex dark lake. Transcript quantification using RNAseq is like fishing in a dark lake with fragmented fish. We are also forced to determine the possible origin(s) of the fish fragments Only lost an eye and a vin but not its tail

40 Estimating relative transcript abundances α1α1 α2α2 Transcript 1 Transcript 2 Fragmentation Sequencing >s1 ATCGTAGGGTA >s2 ATGGCCTAGGT Read mapping Which values of the α1 and α2 gives the highest probability of observing these reads. (α1 + α2 = 1 ) Observation Target

41 Maximum likelihood alignments The likelihood of our observation ( ʎ ) corresponds to the product of observing each of the individual mapped reads (r j ) in our set (R) R

42 Probability of observing a read Probability of observing a read r j is the sum of the individual probabilities that a read originates from each transcript (t) in our transcript set (T). Read j Probability that r j originated from transcript t

43 Component 1: Compatibility t=1 t=2 t=3 Does read j map to transcript t K j1 = 1 K j2 = 1 K j3 = 0

44 Component 2: Sequencing a read from a specific transcript Product of the relative expression level and length of transcript t Probability of “sequencing” a read from transcript t

45 Component 2: Sequencing a read from a specific transcript Why and not just ? Longer transcripts produce more fragments than shorter transcripts at equal expression levels. α1α1 α2α2 α1 = α2 Fragments Fragmentation

46 Component l 1 = 200; α 1 = 0.3 l 2 = 150; α 2 = 0.2 l 3 = 50; α 3 = 0.5 Adjust for length normalize

47 Component 3: Probability of originating from position q on transcript t In the case of no bias:

48 Components: Fragment comes from a certain position of the transcript (I) 300nt 200nt More likely Occurence

49 Components: Fragment comes from a certain position of the transcript (II) Not all regions are equally covered. Frequency

50 Search for abundances that best explain the observed fragments Trapnell et al The method used to find the optimum differs per program.

51 Uncertainty in expression estimate  The statistical methods can also provide an indication of the uncertainty in the expression estimates  One of the sources of that uncertainty are reads that do not map uniquely. FPKM Occurrence

52 Remember The statistical methods calculate the expression level of each transcript The gene expression can then be obtained by simply summing expression levels of its isoforms Gene Isoform 1 Isoform 2 RPKM = 5 RPKM = 6 RPKM = 11

53 Programs employing statistical models Cufflinks Genome annotation based FPKM values Numerical method for finding the maximum likelihood optimum RSEM de novo transcriptome Counts and RPKM values Expectation maximization for finding the optimum BitSeq de novo transcriptome Counts and RPKM values Markov chain Monte Carlo for sampling from the posterior distribution.

54 Edouard severing RNA seq (III) Differential expression

55 A typical heat stress experiment Expression level Control Heat stress (convection) Expression level HSP38 Single measurement Many measurements Is this gene really important ?

56 Sources of variation N2N2 Grind RNA extraction Procedure Sequencing Bioinformatics Biological Technical Treatment Convection Freezer

57 Determining expression variation Accurately determining the variation requires many biological samples (replicates). Unfortunately in most case we only have two or three replicates. Other methods are needed to approximate/model the variation.

58 Determining within condition fragment count variation modeling (DESeq/cuffdiff) Trapnell et al Main assumption: Variance depends on the mean. Objective: Find a function that best describes the relationship between the mean and variance.

59 Building the within condition fragment count distribution (DEseq) At this stage DESeq and Cuffdiff determined for each transcript/gene 1.The within condition fragment count mean 2.The within condition fragment count variance With these parameters a fragment count probability distribution can be determined. DESeq uses a negative binomial (NB) distribution. NB: Variance is larger than the mean

60 Building the within condition fragment count distribution (Cuffdiff) In addition to NB distribution of DESeq Cuffdiff also takes the uncertainty in transcripts fragment assignment into account. The resulting count distribution is called a beta negative binomial (BNB) distribution This count distribution is in the end transformed to a distribution of FPKM values.

61 Differential gene expression  Now that the count or FPKM distributions are known we can start testing for significant differences in expression levels.  There are several ways in which one can test whether the gene/transcript levels in two conditions are significantly different.  DESeq: uses an exact test which is analogous to the Fisher exact test.  (The hyper-geometrical is replaced by the NB distribution)  Cuffdiff: uses a t-test  (We will look at that in the next slide)

62 Testing for differential expression (Cuffdiff) control Heat log FC distribution Expression Many measurements leads to many fold changes log FC

63 Cuffdiff p-value log FC distribution According to authors of cufflinks State that the quantity T is approximately Normally distributed |T| -|T| Unadjusted p-value of cufflinks indicates what the probability of observing a T which is more extreme the current T (Red areas)

64 Samtools view SRR Chr M * 0 0 AGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGCT%"&'"(&&#'$&$)+$#,%83%0&1250'III+$' AS:i:-4 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:34G0A0 YT:Z:UU XS:A:+ NH:i:1 Sw accepted_hits.bam /mnt/geninf15/work/course_2013/day2/mapped_reads


Download ppt "Edouard Severing RNA seq (I). A typical heat stress experiment ( climate change ) Control Heat stress ( convection ) 5 days 85 minutes How does the frog."

Similar presentations


Ads by Google