Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.

Slides:

Advertisements

Similar presentations

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

Advertisements

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

R. Johnsonbaugh Discrete Mathematics 5 th edition, 2001 Chapter 8 Network models.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.

Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.

Introduction To Algorithms CS 445 Discussion Session 8 Instructor: Dr Alon Efrat TA : Pooja Vaswani 04/04/2005.

MAXIMUM FLOW Max-Flow Min-Cut Theorem (Ford Fukerson’s Algorithm)

Chapter 10: Iterative Improvement The Maximum Flow Problem The Design and Analysis of Algorithms.

1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.

1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.

Network Optimization Models: Maximum Flow Problems In this handout: The problem statement Solving by linear programming Augmenting path algorithm.

TCOM 501: Networking Theory & Fundamentals

Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.

Lecture 11. Matching A set of edges which do not share a vertex is a matching. Application: Wireless Networks may consist of nodes with single radios,

CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.

Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.

Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

V. V. Vazirani. Approximation Algorithms Chapters 3 & 22

MAX FLOW CS302, Spring 2013 David Kauchak. Admin.

© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.

Chapter 9 – Graphs A graph G=(V,E) – vertices and edges

Efficient Gathering of Correlated Data in Sensor Networks

PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.

Finding dense components in weighted graphs Paul Horn

CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

Network Flow How to solve maximal flow and minimal cut problems.

 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

1 1 © 2003 Thomson  /South-Western Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Flows in Planar Graphs Hadi Mahzarnia. Outline O Introduction O Planar single commodity flow O Multicommodity flows for C 1 O Feasibility O Algorithm.

1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.

Approximation Algorithms based on linear programming.

1 Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas.

KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.

St. Edward’s University

Greedy Technique.

Chapter 5. Optimal Matchings

Alexander Zelikovsky Computer Science Department

Enumerating Distances Using Spanners of Bounded Degree

Reference based assembly

Discrete Event Simulation - 4

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

R. Johnsonbaugh Discrete Mathematics 5th edition, 2001

Approximation Algorithms

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

離散數學 DISCRETE and COMBINATORIAL MATHEMATICS

Lecture 19 Linear Program

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Presentation transcript:

Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya Georgia State University David Campo Centers for Disease Control Yury Khudyakov Centers for Disease Control Piotr Berman Pennsylvania State University Ion Mandoiu University of Connecticut

Outline 454 Sequencing of Virus Genome Quasispecies Assembly The Read Graph Network Flow Formulations Phasing Flow Problem Maximum Likelihood Simulation Results

HCV Quasispecies HCV is a small, enveloped, positive sense single strand RNA virus that is responsible for Hepatitis C infection. Over the course of infection, the mutations made in replication are passed down to descendants, eventually producing a family of related variants of the ancestral genome referred as quasispecies. Due to HCV's high mutation rate, in time the quasispecies in an infected person can become very diverse. A better understanding of HCV quasispecies diversity could potentially lead to new treatments. The ultimate objective of our work is to develop a method of computationally inferring the quasispecies sequences in a HCV-infected individual.

454 Sequencing 454 Sequencing is one promising technology that may prove useful for quasispecies sequencing. It is a massively-parallel pyrosequencing system developed the by biotechnology firm 454 Life Sciences for genome sequencing. The system fragments the source genetic material to be sequenced into pieces called reads. Then, each read is sequenced and the original genome is reconstructed via software. Since this system was originally designed to sequence genetic material from a single organism, the software assembles all of the reads to a single genome. In order to use 454 for quasispecies sequencing, new methods and software are needed to correctly infer the sequences of the quasispecies present in an infected person and their population frequencies directly from 454 read data.

Problem Formulations Given a collection of 454 reads taken from a quasispecies population of unknown size and distribution, reconstruct the quasispecies population, i.e. the sequences and tehir frequencies. Original Quasispecies 454 ReadsReconstructed Quasispecies

Parsimonious/Min Cost Quasispecies Assembly Given a set of reads, Find the minimum number set of quasispecies covering all reads Given a set of reads with costs on read overlaps, Find the minimum number set of quasispecies covering all reads The cost of the assembly should be inversely connected to the likelihood that the assembly is the correct one. Original Quasispecies 454 Reads Reconstruction 2 Cost: C 2 Reconstruction 1 Cost: C 1 If C 1 < C 2, then favor Reconstruction 1 over Reconstruction 2

Read Alignment Before beginning assembly, first find the genome offset of the read. We assume that the consensus sequence for the particular strain of HCV that the quasispecies came from is available to us. We simply align each read to the consensus sequence, choosing the offset that yields the best alignment (i.e. lowest Hamming distance). Because HCV quasispecies don't contain repeats as long as a 454 read, the alignment is both fast and extremely accurate. GUCUCAUCGGAACAGCAAAACACUUGCCCCGAACGCUAGCGGUUGGGGUACUAUUCAAUGGCUGUAG AACAGCAAAACACUUGCUCCGAACGCUAGCGGUUGGGGAACUAU

The Read Graph The data structure: a directed acyclic graph that contains every possible quasispecies reconstruction. An aligned read can be contained within another aligned read. Find the subset of reads that are not contained within any other read  We call these reads “superreads”  “subreads” = everything else The superreads are vertices in the read graph.

Edges of the Read Graph Put an edge between read X and read Y if  X overlaps Y in the alignment  some suffix of X = some prefix of Y UGGACUAGAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGU GAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGA GGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCA CUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCA AUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCAACGCCUA

Quasispecies in the Read Graph Then, we add two extra vertices: a new vertex with outgoing edges to all vertices with indegree 0 (the “source”) and a new vertex with incoming edges from all vertices with outdegree 0 (the “sink”) The Read Graph SourceSink Any path from the source to the sink represents a potential sequence in the quasispecies population!

Transitive Reduction Edge u  w logically follows from edges u  v and v  w Drop edge u  w from consideration – no information, any quasispecies sequence containing u and w will also have v The transitive reduction of a graph = smallest subgraph that maintains all reachability relationships The graph is partially closed – the transitive reduction found in O(δ|E|), where δ is the read degree

Estimating Read Frequencies In general, superreads may be contained in several quasispecies sequences. Thus, each superread has associated with its frequency = the sum over the quasispecies of the population frequencies of quasispecies that contains the superread. Although the true read frequencies are unknown to us, we may estimate them by counting the number of subreads contained within each superread. By definition, the read frequency of the source and sink vertices are 0.

Probability of a True Overlap Given N reads over Q sequences, each read with L possible starting positions, the probability that a position is b u for some read u is N/(LQ). Let (u, v) be an edge in transitive reduction. The probability of b v -b u > Δ is proportional to exp(Δ N/(LQ)). Probability of an edge from the source or to sink is 0. GUGGGGGCAGCGGACGUAUGC GACGUAUGCAGAACUCUAGGCA bubu bvbv Δ

Network Flow Through Vertices Replacing the vertex for read r with two vertices r_b and r_e and the edge (r_b, r_e)

Networks Flows Observe that the true quasispecies sequences in the read graph can be represented as a flow: Each vertex has a frequency proportional to the number and frequencies of the quasispecies that contains it's associated superread. When we solve the flow, we demand that each vertex has a inflow passing into it >= its frequency.

Min Cost Flow We define the cost of a flow in the following manner:  The flow cost of an edge is that edge's cost multiplied by the amount of flow that traverses the edge  The cost of a flow through the graph is the sum of all of the edge flow costs. Out of all possible flows that go through all of the vertices in the graph, we seek to find the flow with the minimum cost. After solving min cost flow for the graph, all of the edges that have flow > 0 are assumed to participate in true quasispecies. The remaining edges can be dropped from the graph.

LP for Min Cost Flow Although there are fast combinatorial algorithms for solving min cost flow, we opted to solve the flow using a linear program. For each edge e, create a real-valued, nonnegative variable f e to represent the flow across that edge. The Read Graph Sink Source

Linear Program for Min Cost Quasispecies Assembly Objective: Minimize the sum of cost(e) * f e over all edges e in the read graph. Subject to:  For all vertices v: The sum of f e over incoming edges to v equals the sum of f e over outgoing edges from v. The sum of f e over incoming edges to v is greater or equal to the frequency of v.

Splitting Flow in Quasispecies Five quasispecies share a common long segment [a,b] and differ on the left and the right in value of a SNP. The resulted graph with network flow have multiple feasible solutions. b-a > the read length l a b r Multi set LMulti set R A C AACCTAACCT TTACTTTACT T T A C f=2 f=3 f=1

Quasispecies Matching Problem Given two multisets of haplotypes L on the segment [l, b] and R on the segment [a, r] such that all haplotypes are indistinguishable on a common segment [a, b] (l < a < b < r), |L| = |R|, Find the matching between multisets L and R such that concatenation of the matched haplotypes correspond to the original quasispecies.

Decomposing the flow into paths General strategy: ➡ Find a source-to-sink path with positive flow f ➡ Subtract f from all of the edge flows in the path How to find paths? ➡ Pick the shortest path ⇒ “most likely quasispecies” ➡ Pick the maximum bandwidth path ⇒ “most frequent quasispecies” 1→3→5→6 is the shortest path 1→2→4→6 is the maximum bandwidth path

Finding Max Bandwidth Paths A single iteration of the Bellman-Ford algorithm gives an efficient method for finding the maximum bandwidth path from the source to the sink:  Initialize: For each vertex I  W(i) ← 0  p(i) ← 0 For the source s  W(s) ← +infinity  Relax: For each edge (i,j) in order *** (i,j)< (k,l) if i<k or i=k & j<l ***  if W(j) < min { W(i), cap(i,j) } *** cap(i,j) capacity of (i,j) W(j) ← min { W(i), cap(i,j) } p(j) ← I  Return path p(sink), p(p(sink)),..., source=0 Finding minimum cost paths is simple: just grow a shortest-path tree from the source using costs for weights.

Maximum Likelihood Choice After path decomposition, we have a set of candidate quasispecies sequences, but we don’t know what their frequencies are. Given a set of candidate quasispecies and observed reads Expectation Maximization: alternates between 2 steps until convergence: Expectation (E) and Maximization (M) E Step: Calculates the expected likelihood by including the current estimate of the latent variables M step: Computes maximum likelihood estimates of parameters by maximizing the expected likelihood found in the previous E step

EM Implementation Create a bipartite graph: Left side: quasispecies Right side: superreads Put an edge if quasispecies contains read Keep 3 sets of numbers: For each qsp q, keep its estimated frequency f q For each superread r, keep its frequency n r For each (q, r) edge, keep p qr E step: Compute p qr = n r · f q / Σ f q for every edge M step: Compute f q = Σ p qr / Σ n r for every qsp

Validation Real quasispecies population, simulated reads Real data: 44 sequences from E1E2 region of HCV 3 populations consisting of 10 sequences each: Uniformly distributed frequencies (the “uniform” population) Geometrically distributed frequencies (the “geometric” population) Highly skewed distribution (the “skewed” population)

Instance Generation Inputs: A quasispecies population Q, n = number of reads desired n, read length mean μ and variance σ 2 S ← ∅ While |S| < number of reads desired Randomly select a quasispecies q of length l q according to the population frequency distribution Generate a read length l r using normal distribution (μ, σ 2 ) Generate an offset o using uniform distribution on [0, l q - l r ] Extract a substring of length l r starting at position o from quasispecies q and add it to S Return: S

Quality Measures Percentage of correctly predicted sequences: Takes into account frequencies Symmetric difference between multisets “Switching error” Generalization of “switching error” from the haplotype phasing community Average number of times each path corresponding to a predicted quasispecies switches between paths in the read graph corresponding to real quasispecies.

Initial Results Read LengthNumber of Reads % Correctly Predicted Uniform % % % % Geometric % % % % Skewed % % % %

Results – Geometric Instances

Results – Uniform Instances