Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

Similar presentations


Presentation on theme: "An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia."— Presentation transcript:

1 An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia State University Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza, Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky

2 Advances in Next Generation Sequencing http://www.economist.com/node/16349358 Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length Ion Proton Sequencer 2

3 RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression ABC AC DE Transcriptome Reconstruction Isoform Expression 3

4 Transcriptome Reconstruction Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. 4

5 Transcriptome Reconstruction Types Genome-independent reconstruction (de novo) – de Brujin k-mer graph Genome-guided reconstruction (ab initio) – Spliced read mapping – Exon identification – Splice graph Annotation-guided reconstruction – Use existing annotation (known transcripts) – Focus on discovering novel transcripts 5

6 Previous approaches Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008) Genome-guided reconstruction – Scripture(2010) Reports “all” transcripts – Cufflinks(2010), IsoLasso(2011), SLIDE(2012) Minimizes set of transcripts explaining reads Annotation-guided reconstruction – RABT(2011), DRUT(2011) 6

7 Challenges and Solutions Read length is currently much shorter then transcripts length Paired-end reads Fragment length distribution 7

8 1742365 t 1 : 174365 t 2 : 174235 t 3 :t 4 : 174351742365 Exon 2 and 6 are “distant” exons : how to phase them?

9 Map the RNA-Seq reads to genome Construct Splice Graph - G(V,E) – V : exons – E: splicing events Candidate transcripts – depth-first-search (DFS) Filter candidate transcripts – fragment length distribution (FLD) – integer programming 9 Genome TRIP Transciptome Reconstruction using Integer Programming

10 Gene representation Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events Gene - set of non-overlapping pseudo-exons e1e1 e3e3 e5e5 e2e2 e4e4 e6e6 S pse1 E pse1 S pse2 E pse2 S pse3 E pse3 S pse4 E pse4 S pse5 E pse5 S pse6 E pse6 S pse7 E pse7 Pseudo- exons: e1e1 e5e5 pse 1 pse 2 pse 3 pse 4 pse 5 pse 6 pse 7 Tr 1 : Tr 2 : Tr 3 : 10

11 Splice Graph Genome 1 42 3 5 67 8 9 TSS pseudo-exons TES 11

12 How to select? Select the smallest set of candidate transcripts that yields a good statistical fit between – the fragment length distribution empirically determined during library preparation – fragment lengths implied by mapped read pairs 12 500 300 123 200 13 Mean : 500; Std. dev. 50

13 Simplified IP Formulation Objective Constraints T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwise x(p) - 1 if the pe read p is selected to be mapped 13 for each pe read at least one transcript is selected

14 IP Formulation Objective Constraints 14 for each pe read from every category of std.dev. at least one transcript is selected restricts the number of pe reads mapped within different std. dev. each pe read is mapped no more then with one category of std. dev. every splice junction to be covered

15 Comparison on Simulated Data 100x coverage, 2x100bp pe reads, 500 mean fragment length, 10% sd 15

16 Influence of Sequencing Parameters 100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd 16 TRIP-L : individual fragment lengths estimates

17 Results on Real RNA-Seq Data CD1 mouse retina RNA samples specific gene that has 33 annotated transcripts in Ensembl TRIP : 5 out of 10 transcripts, confirmed using qPCR. Cufflinks : 3 out of 10 transcripts 6906 alignments for 22346 read pairs with read length of 68 17

18 Conclusions We introduced a novel method for transcriptome reconstruction from paired-end RNA-Seq reads. Our method : – exploits distribution of fragment lengths – additional experimental data TSS/TES (TRIP with TSS/TES) individual fragment lengths estimates (TRIP-L) 18

19 19


Download ppt "An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia."

Similar presentations


Ads by Google