Presentation on theme: "JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia."— Presentation transcript:
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia State University
De-novo Assembly Paradigm Sequencing The Contigs The Scaffolds The Reads The Genome Assembly Scaffolding
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes!
Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation!
Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Only accept uniquely mapped reads Use the non-unique reads later Both reads in a pair must map to different contigs
Linkage Information Possible States Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D
The Scaffolding Problem Given Contigs Paired reads Find Orientation Ordering Relative Distance Goal Recreate true scaffolds Possible Objectives Un-weighted Max number of consistent read pairs Weighted Each states is weighted: Overlap with repeat Deviation of expected distance …
Graph Representation Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected
Integer Linear Program Formulation Variables Contig Pair State: Contig Orientation: Pairwise Contig Consistency: Objective Maximize weight of consistent pairs
Post Processing ILP Solution May have cycles Not a total ordering for each connected components A B C D F E ILP Solution outgoing incoming A B C D E F A B C D E F Bipartite matching Objectives: Max weight Max cardinality Max cardinality / Max weight
Testing Framework Venter Genome Read TypeTotal Reads Total Bases Avg lengthCoverage Sanger31,861,9762.79E+108759.930637 SOLiD pairs4.85E+082.42E+10508.623028 # Reads # Bases in reads# Contigs # Bases in contigsN50 112,00,0001.1E+10422,8372.26E+097704 4x Assembly
Testing Metrics Computer Scientists Finding Scaffold = Binary Classification Test n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50 Break scaffold at incorrect edges, then find N50