Fragment Assembly 7/30/2019.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Lecture 14 Genome sequencing projects
§ 8 Dynamic Programming Fibonacci sequence
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
DNA Fragment Assembly CIS 667 Spring 2004 February 18.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Sequence comparison: Local alignment
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by A. Frieze,
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Outline Today’s topic: greedy algorithms
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCE 411 Design and Analysis of Algorithms
Chapter 5 : Trees.
Greedy Technique.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Courtsey & Copyright: DESIGN AND ANALYSIS OF ALGORITHMS Courtsey & Copyright:
Sequence comparison: Local alignment
13 Text Processing Hongfei Yan June 1, 2016.
COMP 6/4030 ALGORITHMS Prim’s Theorem 10/26/2000.
Chapter 5. Optimal Matchings
Introduction to Genome Assembly
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
CS 598AGB Genome Assembly Tandy Warnow.
SPIRE Normalized Similarity of RNA Sequences
Sequence Alignment 11/24/2018.
Graphs Chapter 11 Objectives Upon completion you will be able to:
Approximation Algorithms
SPIRE Normalized Similarity of RNA Sequences
CSE 589 Applied Algorithms Spring 1999
5.4 T-joins and Postman Problems
Phylogeny.
Dynamic Programming II DP over Intervals
Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)
CSE 332: Minimum Spanning Trees
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Error Correction Coding
Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)
Presentation transcript:

Fragment Assembly 7/30/2019

Introduction Fragments are typically of 200-700 bp long “Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target 7/30/2019

Introduction Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout” The output is called the “consensus sequence” An optimization problem 7/30/2019

Complications Base-call errors: Substitution errors [p 107] Insertion errors (possibly from the host sequence) [p 108, fig 4.3] Deletion error [fig 4.4] Majority voting solves them (or some form of optimization) 7/30/2019

Complications Chimeras: To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5] Needs to be weeded out as a preprocessing step Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well 7/30/2019

Complications Unknown orientation: Fragments may come from either strand Even from the opposite strand, its reverse-complement must be in the target string Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments) [p 109, fig 4.6] 7/30/2019

Complications Repeats: Regions (super-string of some fragments) may repeat in a target Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7] Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9] Inverted repeats: repeat of the reverse complement [fig 4.10] 7/30/2019

Complications Insufficient coverage: Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length) Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here 7/30/2019

Complications Insufficient coverage: What you get with insufficient coverage is multiple “contigs,” not one contig “t-contig” is where we expect t-long overlap between pairs of fragments Expected number of contigs: [p 112, formula 4.1] Lower t means lesser number of contigs (more aligned segments), but weaker consensus 7/30/2019

Reconstruction Shortest common superstrings are not the best solution Fig 4.12 vs Fig 4.13 (p115/116) 7/30/2019

Reconstruction Superstring to be reconstructed out of fragments An alignment problem with no end penalty d_s is edit distance score without end-penalty: minimized over edit distances d Fig 4.14 (p117) for best aligned subsequence-matching Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity” We will use d for d_s 7/30/2019

Reconstruction f is approximate substring of S at error level e, then the score is d(f, S) =< e|f|, e=1 means no error allowed e<1 allows insert/delete/substitution errors f and f- both should be matched 7/30/2019

Reconstruction: Problem Input: Set F of substrings, error level e Output: Shortest possible string S s.t. for all f Min(d(f, S), d(f-, S)) =< e|f| 7/30/2019

Reconstruction: Multicontig How much overlap do we require between strings? Ideally, each column in the layout L should have same character, for all columns 1 through |L| Fig 4.4 (p 118): t-contig for t=3, 2, 1 Balance between t and number of t-contigs 7/30/2019

Reconstruction: Multicontig S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f| Multicontig problem: Input: set F, integer t>=0, 0=<e=<1 Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus 7/30/2019

Reconstruction: Overlap Multi-graph Nodes are the fragments Directed arcs label length t of overlap between nodes” t-suffix= t-prefix Arcs between all pairs of nodes, but no self-loop Fig 4.15 (p 121): example Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved Max weight Hamiltonian path is what we are looking for in this graph  max overlapped superstring 7/30/2019

Reconstruction Substrings of fragments within the set of fragments are noise: remove them Draw OMG of the substring free set of fragments Shortest common superstring always correspond to a Hamiltonian path in this graph 7/30/2019

Reconstruction: OMG Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists) Path follows the same order of fragments (as in S) in OMG S may contain extra garbage materials, so, S(P) is within S 7/30/2019

Reconstruction: OMG If S is shortest common superstring, then S must be within S(P), or S=S(P) In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F 7/30/2019

Reconstruction: OMG Think of an algorithm for weeding out substrings from F Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes If the wt on an edge is below a threshold t, then the wt should be treated as 0 7/30/2019

Reconstruction: OMG Greedy Algorithm to draw Ham. Path (p 125) Collects edges largest to smallest, (1) preventing cycle (union-find), (2) indegree of each node should be =<1 (first node has 0) (3) outdegree of each node should be =<1 (last node has 0) [Does not return Ham. Path. Can you modify to return Ham. Path?] Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4 7/30/2019

Reconstruction: OMG Subintervals: if a fragment can be embedded within another one in the set Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string 7/30/2019

Reconstruction: OMG If a repeat exists in the original string, then the graph will have a cycle False positive: substrings from two different portions has t-overlap If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered 7/30/2019

Reconstruction: OMG If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path If there exist a cycle it may not come from a repeat 7/30/2019

Reconstruction: OMG Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring) Ham path chooses any t-overlap connections – cares for linkage only 7/30/2019

Parameters in aligning for fragment assembly Score on a column: traditionally {0,-1,-2} in sum-of-pairs Entropy: Sum[over alphabets and space c] –pc log pc, where pc is probability of c All same character, pc = 1, entropy=0 For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric 7/30/2019

Parameters in aligning for fragment assembly Coverage: How many each column is “covered” by how many fragments? (Average, min, max) This is different from the concept of t-overlap If a column (of the target) is covered by 0, then the layout is disconnected Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns 7/30/2019

Parameters in aligning for fragment assembly Coverage is not enough, we need good linkage, Example: p 133 Ham. Path algorithm is doing that 7/30/2019

Steps in assembly : Step 1: Overlap finding Approximate – delete, insert, replace allowed by semi-global DP algorithm with appropriate end-gap penalty, pairwise between each fragment and its reverse-complement 7/30/2019

Steps in assembly : Step 2: Construct over (F union F-bar) for the fragment set F (-- after eliminating substrings?) Construct Hamiltonian path in this graph Cycles and unbalanced coverage may mean repeats 7/30/2019

Steps in assembly : Step 3: fine tuning the multiple alignment to get a consensus target Manual or algorithmic Examples in p 137-138 7/30/2019