Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Similar presentations


Presentation on theme: "Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work."— Presentation transcript:

1 Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu

2 Advances in Next Generation Sequencing http://www.economist.com/node/16349358 Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length Ion Proton Sequencer 2

3 RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression ABC AC DE Transcriptome Reconstruction Isoform Expression 3

4 Transcriptome Assembly Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. 4

5 Transcriptome Assembly Types Genome-independent reconstruction (de novo) – de Brujin k-mer graph Genome-guided reconstruction (ab initio) – Spliced read mapping – Exon identification – Splice graph Annotation-guided reconstruction – Use existing annotation (known transcripts) – Focus on discovering novel transcripts 5

6 Previous approaches Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008) Genome-guided reconstruction – Scripture(2010) Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012), CLIIQ(2012), TRIP(2012), Traph (2013) Minimizes set of transcripts explaining reads Annotation-guided reconstruction – RABT(2011), DRUT(2011) 6

7 Gene representation Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events Gene - set of non-overlapping pseudo-exons e1e1 e3e3 e5e5 e2e2 e4e4 e6e6 S pse1 E pse1 S pse2 E pse2 S pse3 E pse3 S pse4 E pse4 S pse5 E pse5 S pse6 E pse6 S pse7 E pse7 Pseudo- exons: e1e1 e5e5 pse 1 pse 2 pse 3 pse 4 pse 5 pse 6 pse 7 Tr 1 : Tr 2 : Tr 3 : 7

8 Splice Graph Genome 1 42 3 5 67 8 9 TSS pseudo-exons TES 8

9 Map the RNA-Seq reads to genome Construct Splice Graph - G(V,E) – V : exons – E: splicing events Candidate transcripts – depth-first-search (DFS) Select candidate transcripts – IsoEM – greedy algorithm 9 Genome MaLTA Maximum Likelihood Transcriptome Assembly

10 How to select? Select the smallest set of candidate transcripts covering all transcript variants Transcript : set of transcript variants 10 Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res. 2011 21: 1260-1272 alternative first exon alternative last exon exon skipping intron retention alternative 5' splice junction splice junction

11 IsoEM: Isoform Expression Level Estimation Expectation-Maximization algorithm Unified probabilistic model incorporating – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores – Repeat and hexamer bias correction

12 Read-isoform compatibility graph

13 Fragment length distribution ABC AC ABC AC ABC AC i j F a (i) F a (j)

14 Greedy algorithm 14 1.Sort transcripts by inferred IsoEM expression levels in decreasing order 2.Traverse transcripts – Select transcripts if it contains novel transcript variant – Continue traversing until all transcript variant are covered

15 Greedy algorithm 15 Transcript Variants: Transcripts sorted by expression levels

16 Greedy algorithm 16 Transcript Variants: Transcripts sorted by expression levels

17 Greedy algorithm 17 Transcript Variants: Transcripts sorted by expression levels

18 Greedy algorithm 18 Transcript Variants: Transcripts sorted by expression levels

19 Greedy algorithm 19 Transcript Variants: Transcripts sorted by expression levels

20 Greedy algorithm 20 Transcript Variants: Transcripts sorted by expression levels

21 Greedy algorithm 21 Transcript Variants: Transcripts sorted by expression levels

22 Greedy algorithm 22 Transcript Variants: Transcripts sorted by expression levels

23 Greedy algorithm 23 Transcript Variants: Transcripts sorted by expression levels

24 Greedy algorithm 24 Transcript Variants: Transcripts sorted by expression levels

25 Greedy algorithm 25 Transcript Variants: Transcripts sorted by expression levels STOP. All transcript variant are covered.

26 MaLTA results on GOG-350 dataset 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2 Number of assembled transcripts – MaLTA : 15385 – Cufflinks : 17378 Number of transcripts matching annotations – MaLTA : 4555(26%) – Cufflinks : 2031(13%) 26

27 Expression Estimation on Ion Torrent reads Squared correlation – IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal

28 Conclusions Novel method for transcriptome assembly Validated on Ion Torrent RNA-Seq Data Comparing with Cufflinks: – similar number of assembled transcripts – 2x more previously annotated transcripts Transcript quantification is useful for transcript assembly  better quantification? 28

29 29


Download ppt "Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work."

Similar presentations


Ads by Google