JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

NP-Hard Nattee Niparnan.
Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
Clustering.
22C:19 Discrete Math Graphs Fall 2010 Sukumar Ghosh.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Depth-First Search1 Part-H2 Depth-First Search DB A C E.
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
TECH Computer Science Graphs and Graph Traversals  // From Tree to Graph  // Many programs can be cast as problems on graph Definitions and Representations.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
Midterm <  70 3.
Lectures on Network Flows
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.
Projects. Dataflow analysis Dataflow analysis: what is it? A common framework for expressing algorithms that compute information about a program Why.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Sequence comparison: Local alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
De-novo Assembly Day 4.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each decision is locally optimal. These.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Lecture 12-2: Introduction to Computer Algorithms beyond Search & Sort.
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Chapter 2 Graph Algorithms.
Nattee Niparnan. Graph  A pair G = (V,E)  V = set of vertices (node)  E = set of edges (pairs of vertices)  V = (1,2,3,4,5,6,7)  E = ((1,2),(2,3),(3,5),(1,4),(4,
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Order independent structural alignment of circularly permutated proteins T. Andrew Binkowski Bhaskar DasGupta  Jie Liang ‡ Bioengineering Computer Science.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Lectures on Greedy Algorithms and Dynamic Programming
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Greedy Algorithms General principle of greedy algorithm
Proof technique (pigeonhole principle)
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
M. roreri de novo genome assembly using abyss/1.9.0-maxk96
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Graph theory Definitions Trees, cycles, directed graphs.
Sequence comparison: Local alignment
Lectures on Network Flows
Graph Algorithms Using Depth First Search
CS 598AGB Genome Assembly Tandy Warnow.
Globally Optimal Generalized Maximum Multi Clique Problem (GMMCP) using Python code for Pedestrian Object Tracking By Beni Mulyana.
ICS 353: Design and Analysis of Algorithms
Introduction Basic formulations Applications
SPQR Tree.
Fragment Assembly 7/30/2019.
Presentation transcript:

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia State University

De-novo Assembly Paradigm Sequencing The Contigs The Scaffolds The Reads The Genome Assembly Scaffolding

Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold

Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes!

Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap Filling Structural variation!

Read Pairs Paired Read Construction 2kb same strand and orientation R1 R2 Informative Reads Align each read against the contigs Only accept uniquely mapped reads  Use the non-unique reads later Both reads in a pair must map to different contigs

Linkage Information Possible States Two contigs are adjacent if:  A read pair spans the contigs State (A, B, C, D)  Depends on orientation of the read  Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D

The Scaffolding Problem Given Contigs Paired reads Find Orientation Ordering Relative Distance Goal Recreate true scaffolds Possible Objectives Un-weighted Max number of consistent read pairs Weighted Each states is weighted: Overlap with repeat Deviation of expected distance …

Graph Representation Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected

Integer Linear Program Formulation Variables Contig Pair State: Contig Orientation: Pairwise Contig Consistency: Objective Maximize weight of consistent pairs

Constraints Pairwise Orientation Mutually Exclusivity Forbid 2 and 3 Cycles Explicitly

Graph Decomposition: Articulation Points solvesolve solvesolve Articulation point

Graph Decomposition: 2-cuts 2-cut

Non-Serial Dynamic Programming SPQR-tree to schedule decomposition Traverse tree using DFS NSDP utilizes solutions of previous stage in current stage

Largest Connected Component

Largest Biconnected Component

Largest Triconnected Component

Post Processing ILP Solution May have cycles Not a total ordering for each connected components A B C D F E ILP Solution outgoing incoming A B C D E F A B C D E F Bipartite matching Objectives:  Max weight  Max cardinality  Max cardinality / Max weight

Testing Framework Venter Genome Read TypeTotal Reads Total Bases Avg lengthCoverage Sanger31,861, E SOLiD pairs4.85E E # Reads # Bases in reads# Contigs # Bases in contigsN50 112,00,0001.1E+10422, E x Assembly

Testing Metrics Computer Scientists  Finding Scaffold = Binary Classification Test  n contigs, try to predict n-1 adjacencies  TP,FP,TN,FN, Sensitivity, PPV Biologists (main focus)  N50 (basically average scaffold size, ignore gaps)  TP50  Break scaffold at incorrect edges, then find N50

Results test casemethodbundle sizesensitivityppvN50TP50 10%opera281.13%99.26%27,56727,327 10%mip259.01%98.94% 19,98819,755 10%ilp179.86%98.58% 26,814 26,459 25%opera280.44%98.27% 27,296 26,849 25%mip258.95%97.56% 19,84219,518 25%ilp179.30%96.93% 26,684 26, %opera3pending… … … 100%mip3failedn/a 100%ilp168.25%89.90% 20,538 19,006

Conclusions Success  ILP solves scaffolding problem!  NSDP works. Improvements  Finalize large test cases (then publish?!)  Practical considerations (read style, multi-libraries, merge ctgs) Future Work  Where else can I apply NSDP?  Scaffold before assembly??  Structural Variation??

Questions?