Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.

Similar presentations


Presentation on theme: "Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett."— Presentation transcript:

1 Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett

2 The Vision RNAseq can be used for transcript discovery and abundance estimation What’s missing: algorithms which aren’t restricted by prior gene annotations (which are often incomplete) and account for alternative transcription and splicing. Hence, Cufflinks.

3 The Need Evidence of ambiguous assignment of isoforms. TSS site/promoter changes and splice site changes were found previously by the authors Longer reads and pair end reads do not do enough

4 The Biology General assumption of randomization of reads Central Dogma Transcription Start Site (TSS) Splice site Isoform

5 Central Dogma and Regulation

6 Splicing Thisblahblahblahblahblahblahisblahblahimportant “This” “is” “important” - Exons “blah” - Introns (Intrusions)

7 Major Change 1

8 Major Change 2

9 Why it matters: Isoforms Not only different sizes, but different shapes Shape determines function Isoforms would map to the same section of the genome: undetected without Cufflinks Separating transcripts into isoforms elucidates a more realistic representation of what is happening

10

11 TopHat Mapping short reads Trapnell et. al, Bioinformatics, 2009

12 TopHat No genome reference annotations are needed The output of TopHat is the input of Cufflinks. Input: Reads and genome Output:Readmappings Short reads present computational challenges o BOWTIE

13 How does TopHat Work?! Big Idea: “Exon Inference”!!

14 Step 1: Initial Mapping via Bowtie Group 1: Mapped Reads (Segments) Group 2: Initially Unmapped (IUM) Reads o possibly intron-spanning read Based on Group 1, we want to get intron-spanning reads from Group 2 Reference Mapped Reads

15 Step 2: Generate Putative Exons

16 Step 3: Look for Potential Splice Signals Putative Exons

17 Step 4: Seed-and-Extend

18

19 Cufflinks Isoform/Transcript Detection and Quantification Trapnell et al, Nature Biotech, 2010

20 Step 5: Identify Compatible Reads Two reads are compatible if their overlap contains the exact same implied introns (or none). If two reads are not compatible they are incompatible.

21 Step 6: Less BIOLOGY, and NOW it is the time for some GRAPH THEORIES……. “We emphasize that the definition of a transcription locus is not biological……” - Authors

22 Step 6: Create Overlap Graph Connect compatible reads in order Create a DAG

23 A path in this graph correspondsto a transcript isoform

24 Theory 1.Solving minimum path cover (isoforms) in the overlap graph implies the fewest transcripts necessary to explain the reads. 2.Solve minimum path cover by finding largest set of individual reads such that no two are compatible. 3.According to Dilworth Thereom, find a maximum matching in a bipartite graph

25 Step 8: Convert a DAG into a Bipartite Graph

26 Step 9: Looking for Maximum Matching inside a bipartite graph via Bipartite Matching Algorithm BIPARTITE-MATCHING Algorithm: Add augmenting path via BFS, repeatedly adding the paths into the matching until none can be added.

27 A path in this graph correspondsto a transcript isoform

28 Projective normalization underestimates expression 28 isoform a isoform b project all isoforms into genome coordinates R reads total, r reads for the gene: - r a for isoform a - r b for isoform b but so

29 How should expression levels be estimated? A-B are distinguished by the presence of splice junction (a) or (b). A-C are distinguished by the presence of splice junction (a) and change in UTR B-C are distinguished by the presence of splice junction (b) and change in UTR 29 (a) (b)

30 How should expression levels be estimated? Longer transcripts contain more reads. Reads that could have originated from multiple transcripts are informative. Relative abundance estimation requires “ discriminatory reads ”. 30 (a) (b)

31 A model for RNA-Seq 31  Transcript proportions for assignment of reads to transcripts L = Likelihood of this assignment R = all reads

32 A model for RNA-Seq 32  All transcripts

33 A model for RNA-Seq 33 Define: Expected possible positions for an arbitrary fragment in Transcript t F(i) = pr(random fragment has length i) l(t) = Full length of transcript t

34 A model for RNA-Seq 34

35 A model for RNA-Seq 35 I t (r) = Implied length of r’s fragment if r is assigned to transcript t Recall: F(i) = pr(fragment length = i)

36 Projective normalization underestimates expression 36 isoform a isoform b project all isoforms into genome coordinates R reads total, r reads for the gene: - r a for isoform a - r b for isoform b but so

37 A model for RNA-Seq 37 Now we have a maximum likelihood function in terms of  the distribution of reads among transcripts. Non-negative linear model

38 Inference with the sequencing model Maximum likelihood function is concave - optimization using the EM algorithm. Asymptotic MLE theory leads to a covariance matrix for the estimator in the form of the inverse of the observed Fisher information matrix Importance sampling from the posterior distribution used for estimating the abundances from the posterior expectation, and 95% confidence intervals for the estimates. This approach extends the log linear model of H. Jiang and W. Wong, Bioinformatics 2009 to a linear model for paired end reads. For more background see Li et al., Bioinformatics, 2010 and Bullard et al., BMC Bioinformatics,

39 Utility of Cufflinks mRNA as proxy for gene expression & action Control points o transcriptional vs o post transcriptional Does isoform-level discovery & quantification matter? o Apparently, yes o Putatively discovered about 12K new isoforms while recovering about 13K known o Plus other stuff…

40 The skeletal myogenesis transcriptome RNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation hours 60 hours 168 hours differentiation (starting at 0 hours) fusion myotube myoctyte 120 hours Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119: ,369,078 reads 140,384,062 reads 82,138,212 reads 123,575,666 reads 66,541,668 alignments 103,681,081 alignments 47,431,271 alignments 89,162,512 alignments 10,754,363 to junctions 19,194,697 to junctions 9,015,806 to junctions 17,449,848 to junctions 58,008 transfrags 69,716 transfrags 55,241 transfrags 63,664 transfrags Slide courtesy of Hector Corrada Bravo

41 Validation

42 Validation

43 Projective normalization underestimates expression 43 isoform a isoform b project all isoforms into genome coordinates R reads total, r reads for the gene: - r a for isoform a - r b for isoform b but so Slide courtesy of Hector Corrada Bravo

44 Discovery is necessary for accurate abundance estimates 44 Slide courtesy of Hector Corrada Bravo

45 Some Questions… Do isoforms of a given gene have interesting temporal patterns? o Increasing, decreasing, more complex… What does this mean biologically? What about transcriptional versus post transcriptional regulation? o Differential transcription o Differential splicing

46 Dynamics of Myc expression 46 Slide courtesy of Hector Corrada Bravo

47 Overloading Metric using Jensen-Shannon Divergence Metric: One-sided t-test under the null hypothesis that there is no difference in abundance; Type I errors controlled with Benjamini-Hotchberg correction (FDR) Average Entropy Entropy of Average

48 Regulatory Overloading Differential splicing Differential TSS preference Fibronectin Tropomyosin 1 Mef2d … Fibronectin Tropomyosin 1 Mef2d … Fhl3 Fhl1 Myl1 … Fhl3 Fhl1 Myl1 … # Genes (FDR < 0.05)

49 Dynamics of Myc expression 49 d(, ) Slide courtesy of Hector Corrada Bravo

50 New TSS = New Points of Regulation TSS=Transcription Start Site What would a “collapsed” RNA-seq alignment look like? Microarray?

51 Questions?

52 I am the DNA, and I want a protein! The DNA wants a protein. Transcription Translation mRNA Protein


Download ppt "Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett."

Similar presentations


Ads by Google