2A typical heat stress experiment (climate change) Economically important frogHeat stress(convection)Control85 minutesHow does the frog adapt and survive?5 days
3Coping with heat stress The frog likely has to change several processes in order to cope with the heat stress.Adaptation of metabolic pathways.Prevent water loss through skinChanging the concentration of several enzymes, other proteins and molecules.We want to determine these molecule concentration changesStarting with proteins.
4Changes at the molecular level We could measure protein concentration directlyNot often done on a large scaleWe could measure changes in the expression of the genes that encode these proteins.Gene expression can be approximated by measuring the amount of mRNA molecules that are produced by the gene.
10Gene count and complexity 90% of genes have AS60% of genes have ASThe average number of transcripts produced by human genes is also higher than the average number of transcripts produced by plant genes
11An extreme caseDscam gene produces over 38,000 different transcripts
12Major alternative splicing event types In humans exon skipping is most frequent AS event typeIn plants intron retention are the most common AS event typeHumansExon skippingPlantsIntron retention
14Not everything is translated A large fraction (>30%) of transcripts of protein coding genes are degraded by the nonsense-mediated decay (NMD) pathway.The position of the stop codon is used to predict whether a transcript is likely to be degraded by the NMD pathway
16RememberThe number of unique mRNA molecules is much larger than the number of genes.A large fraction of the mRNA molecules is degraded by the NMD pathway.NMD provides a means to regulate gene-expression at the post-transcriptional level
17Process the frogs into reads for analysis SequencingGrindN2Prepare for sequencing>s1ATCGTAGGGTA>s2ATGGCCTAGGTBioinformatics
18Basic transcriptome analysis steps Many research questions require the following steps:Reconstruction of the transcriptomeWe usually only have fragmentsQuantification of the transcriptomeDifferential expression analysisOther fun stuff.
24Remember de novo transcriptome assembly When no reference genome is availableFinding features which are not on the reference genome (tDNA insertion)Programs: Trinity, Trans-ABySS, Velvet OasesGenome-guided transcriptome reconstructionReference genome is available with or without annotationMapping programs: TopHat, GSNAPTranscriptome reconstruction: Scripture, Cufflinks
25RNA seq (II) Quantification 12/04/2017RNA seq (II) QuantificationEdouard Severing
27Raw countsCounting number of reads/fragments falling with exonic regions of a gene.Example: HTseq-countExon 1Exon 4Exon 3Exon 2
28The same fragment count yet different expression levels Exon 1libraryExon 1Library size matterslibrary
29The same fragment count yet different expression levels. Exon 1Exon 4Exon 3Exon 2Exon 1Transcript/gene length matters
30Normalizing/correcting for feature length and library size Reads mapped to regionRPKM ≈ 1.7300 ntFeature length10,000,000All mapped reads
31Normalizing/correcting for feature length FPKM is analogous to RPKMRPKM = 1RPKM = 2FPKM = 1Different picture emerges from raw counts and RPKM/FPKM values
32Counting method issues What to do with reads that map to multiple isoforms (alternative splicing) or genesPure Random assignment?No, expression can differCount multiple time?No, it has been derived from a single transcriptGene 1Gene 2Isoform 1Isoform 1Isoform 2
35Statistical methods: Expression levels of transcripts
36Fishing in the dark lake experiment Question: What fraction (t) of the fish in the lake is green?Method: We catch a number of fish and determine what fraction is green.Caution: Fish have to be immediately thrown back in the water.
37Fishing in the dark lake results (I) Sane people would do:Sample(X)Fraction of fish that is greent = 1/3
38Fishing in the dark lake results Maximum likelihood estimate of t Sample(X)Maximum likelihood estimate of tThe probability of observing oursample X given a certain t:𝑃(𝑋)= 3! 2!∙1! ∙𝑡 ∙ (1−𝑡) 2Find a t that maximizes the probability of our observationP(t))t
39Fishing in a complex dark lake. Transcript quantification using RNAseq is like fishing in a dark lake with fragmented fish.We are also forced to determine the possible origin(s) of the fish fragmentsOnly lost an eye and a vin but not its tail
40Estimating relative transcript abundances TargetFragmentationα1Transcript 1Transcript 2α2Sequencing>s1ATCGTAGGGTA>s2ATGGCCTAGGTObservationRead mappingWhich values of the α1 and α2 gives the highest probability of observing these reads. (α1 + α2 = 1 )
41Maximum likelihood alignments The likelihood of our observation (ʎ) corresponds to the product of observing each of the individual mapped reads (rj ) in our set (R)𝛌= 𝑗=1 𝑅 𝑃(𝑟 𝑗 )R
42Probability of observing a read Probability of observing a read rj is the sum of the individual probabilities that a read originates from each transcript (t) in our transcript set (T).𝑃(𝑟 𝑗 )= 𝑡=1 𝑇 𝐾 𝑗𝑡 ∙ 𝛼 𝑡 𝑙 𝑡 𝑖=1 𝑇 𝛼 𝑖 𝑙 𝑖 ∙ 𝑃 𝑗𝑡 (𝑞)Probability that rj originated from transcript tRead j
44Component 2: Sequencing a read from a specific transcript 𝑃( 𝑟 𝑗 )= 𝑡=1 𝑇 𝐾 𝑗𝑡 ∙ 𝛼 𝑡 𝑙 𝑡 𝑖=1 𝑇 𝛼 𝑖 𝑙 𝑖 ∙ 𝑃 𝑗𝑡 (𝑞)Probability of “sequencing” a read from transcript tProduct of the relative expression level and length of transcript t𝛼 𝑡 𝑙 𝑡
45Component 2: Sequencing a read from a specific transcript Why and not just ?𝛼 𝑡 𝑙 𝑡𝛼 𝑡Longer transcripts produce more fragments than shorter transcripts at equal expression levels.FragmentsFragmentationα1α2α1 = α2
47𝑃(𝑟 𝑗 )= 𝑡=1 𝑇 𝐾 𝑗𝑡 ∙ 𝛼 𝑡 𝑙 𝑡 𝑖=1 𝑇 𝛼 𝑖 𝑙 𝑖 ∙ 𝑃 𝑗𝑡 (𝑞) Component 3:𝑃(𝑟 𝑗 )= 𝑡=1 𝑇 𝐾 𝑗𝑡 ∙ 𝛼 𝑡 𝑙 𝑡 𝑖=1 𝑇 𝛼 𝑖 𝑙 𝑖 ∙ 𝑃 𝑗𝑡 (𝑞)Probability of originating from position q on transcript tIn the case of no bias:𝑃 𝑗𝑡 (𝑞)= 1 𝑙 𝑡
48Components: Fragment comes from a certain position of the transcript (I) OccurenceMore likely
49Components: Fragment comes from a certain position of the transcript (II) FrequencyFrequencyNot all regions are equally covered.FrequencyFrequency
50Search for abundances that best explain the observed fragments The method used to find the optimum differs per program.Trapnell et al. 2010
51Uncertainty in expression estimate The statistical methods can also provide an indication of the uncertainty in the expression estimatesOne of the sources of that uncertainty are reads that do not map uniquely.OccurrenceFPKM
52RememberThe statistical methods calculate the expression level of each transcriptThe gene expression can then be obtained by simply summing expression levels of its isoformsGeneRPKM = 11Isoform 1RPKM = 6Isoform 2RPKM = 5
53Programs employing statistical models CufflinksGenome annotation basedFPKM valuesNumerical method for finding the maximum likelihood optimumRSEMde novo transcriptomeCounts and RPKM valuesExpectation maximization for finding the optimumBitSeqMarkov chain Monte Carlo for sampling from the posterior distribution.
55A typical heat stress experiment (convection)ControlSingle measurementMany measurementsIs this gene really important ?HSP38Expression levelExpression level
56RNA extraction Procedure Sources of variationBiologicalTechnicalN2GrindRNA extraction ProcedureSequencingBioinformaticsTreatmentConvectionFreezer
57Determining expression variation Accurately determining the variation requires many biological samples (replicates).Unfortunately in most case we only have two or three replicates.Other methods are needed to approximate/model the variation.
58Determining within condition fragment count variation modeling (DESeq/cuffdiff) Main assumption:Variance depends on the mean.Objective:Find a function that best describes the relationship between the mean and variance.Trapnell et al
59Building the within condition fragment count distribution (DEseq) At this stage DESeq and Cuffdiff determined for each transcript/geneThe within condition fragment count meanThe within condition fragment count varianceWith these parameters a fragment count probability distribution can be determined. DESeq uses a negative binomial (NB) distribution.NB: Variance is larger than the mean
60Building the within condition fragment count distribution (Cuffdiff) In addition to NB distribution of DESeq Cuffdiff also takes the uncertainty in transcripts fragment assignment into account.The resulting count distribution is called a beta negative binomial (BNB) distributionThis count distribution is in the end transformed to a distribution of FPKM values.
61Differential gene expression Now that the count or FPKM distributions are known we can start testing for significant differences in expression levels.There are several ways in which one can test whether the gene/transcript levels in two conditions are significantly different.DESeq: uses an exact test which is analogous to the Fisher exact test.(The hyper-geometrical is replaced by the NB distribution)Cuffdiff: uses a t-test(We will look at that in the next slide)
62Testing for differential expression (Cuffdiff) log FCExpressioncontrolHeatMany measurements leads to many fold changeslog FC distribution
63𝑇= 𝐸[ log 𝐹𝐶 ] 𝑉𝑎𝑟[ log 𝐹𝐶 ] Cuffdiff p-valueAccording to authors of cufflinksState that the quantity T is approximatelyNormally distributedlog FC distribution𝑇= 𝐸[ log 𝐹𝐶 ] 𝑉𝑎𝑟[ log 𝐹𝐶 ]Unadjusted p-value of cufflinks indicates what the probability of observinga T which is more extreme the current T(Red areas)-|T||T|