Presentation is loading. Please wait.

Presentation is loading. Please wait.

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Similar presentations


Presentation on theme: "JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia."— Presentation transcript:

1 JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia State University

2 De-novo Assembly Paradigm Sequencing The Contigs The Scaffolds The Reads The Genome Assembly Scaffolding

3 Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold

4 Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes!

5 Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap Filling Structural variation!

6 Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Only accept uniquely mapped reads  Use the non-unique reads later Both reads in a pair must map to different contigs

7 Linkage Information Possible States Two contigs are adjacent if:  A read pair spans the contigs State (A, B, C, D)  Depends on orientation of the read  Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D

8 The Scaffolding Problem Given Contigs Paired reads Find Orientation Ordering Relative Distance Goal Recreate true scaffolds Possible Objectives Un-weighted Max number of consistent read pairs Weighted Each states is weighted: Overlap with repeat Deviation of expected distance …

9 Graph Representation Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected

10 Integer Linear Program Formulation Variables Contig Pair State: Contig Orientation: Pairwise Contig Consistency: Objective Maximize weight of consistent pairs

11 Constraints Pairwise Orientation Mutually Exclusivity Forbid 2 and 3 Cycles Explicitly

12 Graph Decomposition: Articulation Points solvesolve solvesolve Articulation point

13 Graph Decomposition: 2-cuts 2-cut + + + - - + - -

14 Non-Serial Dynamic Programming SPQR-tree to schedule decomposition Traverse tree using DFS NSDP utilizes solutions of previous stage in current stage

15 Largest Connected Component

16 Largest Biconnected Component

17 Largest Triconnected Component

18 Post Processing ILP Solution May have cycles Not a total ordering for each connected components A B C D F E ILP Solution outgoing incoming A B C D E F A B C D E F Bipartite matching Objectives:  Max weight  Max cardinality  Max cardinality / Max weight

19 Testing Framework Venter Genome Read TypeTotal Reads Total Bases Avg lengthCoverage Sanger31,861,9762.79E+108759.930637 SOLiD pairs4.85E+082.42E+10508.623028 # Reads # Bases in reads# Contigs # Bases in contigsN50 112,00,0001.1E+10422,8372.26E+097704 4x Assembly

20 Testing Metrics Computer Scientists  Finding Scaffold = Binary Classification Test  n contigs, try to predict n-1 adjacencies  TP,FP,TN,FN, Sensitivity, PPV Biologists (main focus)  N50 (basically average scaffold size, ignore gaps)  TP50  Break scaffold at incorrect edges, then find N50

21 Results test casemethodbundle sizesensitivityppvN50TP50 10%opera281.13%99.26%27,56727,327 10%mip259.01%98.94% 19,98819,755 10%ilp179.86%98.58% 26,814 26,459 25%opera280.44%98.27% 27,296 26,849 25%mip258.95%97.56% 19,84219,518 25%ilp179.30%96.93% 26,684 26,079 100%opera3pending… … … 100%mip3failedn/a 100%ilp168.25%89.90% 20,538 19,006

22 Conclusions Success  ILP solves scaffolding problem!  NSDP works. Improvements  Finalize large test cases (then publish?!)  Practical considerations (read style, multi-libraries, merge ctgs) Future Work  Where else can I apply NSDP?  Scaffold before assembly??  Structural Variation??

23 Questions?


Download ppt "JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia."

Similar presentations


Ads by Google