# Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE 397-497.

## Presentation on theme: "Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE 397-497."— Presentation transcript:

Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE 397-497

Motivation Direct sequencing of full stretches of DNA base pairs is not possible (as of now) Hence, we “sample” DNA stretches by breaking them in fragments Sequencing of fragments is done by previously discussed algorithms Fragments are to “assembled” back to make sense of the full DNA stretches

Approach Primitive (Root) Techniques Primitive (Root) Techniques Models used Algorithms – Greedy/Multicontig Heuristics Greedy Path Merge Heuristics Basics/Examples/Definitions wherever required

Refreshing Basics Consider 4 strings (fragments) to be assembled – CTAGG CTAGG AGGTA AGGTA GAACCT GAACCT TAAC TAAC We know that final sequence size should be around 10 base pairs

How Assembling works… _ _ C T A G G _ _ _ _ C T A G G _ _ _ _ _ _ A G G T A _ _ _ _ A G G T A G A C T A _ _ _ _ G A C T A _ _ _ _ _ _ _ T A G G_ _ _ _ _ T A G G_ _ G A C T A G G T A CONSENSUS

Trivial, eh? Consensus sequence easily achieved Very close to our “goal” of finding a 10 base pair chain A combination was found such that each column was coherent The consensus contained each fragment as an exact substring

Size does matter… Errors creep in when we consider the real life scenario Size of sequence increases Size of fragments increases Noise Loss of data Larger consensus goals

Pandora’s Box Errors might be of the following type (and many more) Base Call :– Insertion – Deletion - Substitution Chimeras Unknown Orientations Repeated Regions Lack of coverage

Primitive Solutions Assembling for Shotgun fragments Directed Sequencing or Walking Dual End Sequencing Non-Shotgun (based on experimental data) Sequencing by Hybridization Good news: Provide headway into the assembly problem Bad news: None of them, by itself, is good enough, not even close

Usable Models Real World Problems ContaminationChimras Experiment Errors Orientation Errors Repeated Sections Lack of Coverage Models Shortest Common Superstring Reconstr - uction Multicontig NP HARD

Algorithms Again, these algorithms are of little value by themselves. Good supporting heuristics help to make them usable Two different approaches GREEDY MULTICONTIG (acyclic overlap graphs)

Getting Started Linking overlaps to graphs…Why? Superstrings can be matched on to Graphs We (the Computer Science community!) come across Graphs frequently Comfortable (logically, psychologically, cognitively)

Definitions [def.] Overlap Multigraph OM(F) of a collection F is the directed, weighted. The set V of nodes is the set F. A directed edge from a Є F to a different fragment b Є F with weights t ≥ 0 exists if a suffix of a with t characters is a prefix of b. suffix(a,t) = prefix (b,t) k is the Killer Agent such that kATCG = TCG, |k| = -1

Hence... Consider F is {a,b,c,d} a = TACGA b = ACCC c = CTAAAG d = GACA TACGA GACA ab c d 2 1 1 1 1 t = 2

Properties of the Graph Node must not have a path to itself We keep the ‘zero’ weight edges too (hence n(n-1) edges) (hence n(n-1) edges) We denote path P 1 = dbc such that G A C A _ _ _ _ _ _ _ G A C A _ _ _ _ _ _ _ _ _ _ A C C C _ _ _ _ _ _ _ A C C C _ _ _ _ _ _ _ _ _ _ C T A A G _ _ _ _ _ _ C T A A G a = TACGA b = ACCC c = CTAAAG d = GACA

Properties of the Graph If A is the set of fragments involved in P, we will have exactly edges in P If A is the set of fragments involved in P, we will have exactly edges in P Common superstring will be called S(P) Relationship between total length of A, the path’s weight and the superstring length is, ||A|| = w(P) + |S(P)| ||A|| = w(P) + |S(P)| |A| – 1

Picture is worth …. G A C A _ _ _ _ _ _ _ _ _ _ _ A C C C _ _ _ _ _ _ _ _ _ _ _ C T A A A G 2 w(P) +12= 14 ||A|| S(P) = G A C A C C C T A A A G

computing length Mathematical Representation OR, we may say, If = ||A|| - w(P)

Hamiltonian Path Each path we traverse gives a common superstring of the fragments involved If there is a path that passes through each vertex, we have a common superstring containing all fragments of F Hence |S(P)| = ||F|| - w(P) We have been talking about … Observation: Since ||F|| is constant, minimizing S(P) is equivalent to maximizing w(P)

Hamiltonian Graph http://mathworld.wolfram.com/HamiltonianGraph.html

Backtracking Every path corresponds to a superstring Is the converse true? Consider shortest superstrings… YES, they will always correspond to a path NO

Important results [def.] A collection of fragments is substring- free if there are no two distinct strings in the collection such that any one of them is a substring for the other For example, F = {GAA, CTGA, AGTCACGCAA} is SFree but, F = {ACTGAA, TGCA, CTGA, ACGA} is not

Important results THEOREM 1: Let F be a substring free collection. Then for every common superstring S of F there is a Hamiltonian path P in OM(F) such that S(P) is a subsequence of S Example: F = {tag, cat, gac} S(P) = tagacat / tagcatgac / catagac /… S = any superstring (with/without faulty characters)

Important results TAG CAT GAC 1 1 0 0 1 1 S(P) = tagacat / tagcatgac / catagac /…

Important results [def.] F dominates another collection G if every elements of G is a substring of some element of F. For eg., F = {ATGA, CATAGAA, GTACTAA} G = {TAG, CAT, TACT}

Important results [def.] F and G are equivalent if F dominates G and G dominates F F = {TGA, CTTGAA, ACTTGAA} G = {TTGA, ACTTGAA}

Important results LEMMA: Two equivalent substring-free collections are identical F = {TGA, CTTGAA, ACTTGAA} G = {TTGA, ACTTGAA} F = {ACTTGAA} G = {ACTTGAA}

Important results THEOREM 2: F is a collection of strings. There is a unique substring-free collection G equivalent to F This result implies that if we are looking for common superstrings, then we just have to look for substring-free collections, since every collection will have one equivalent to it. Hence, just removing all strings from F that are substrings of other elements in F solves our purpose Hamiltonian Path SCS SCS Substring free Hamiltonian Path

GREEDY ALGORITHM Looking for shortest common superstrings is same as looking for Hamiltonian paths of maximum weights in a directed multigraph Since only heaviest edges are required, we may prune the weaker ones, saving time and space This is a greedy attempt

GREEDY ALGORITHM Lets look at an example Considering previous graph

ACCC GACA Hence... Consider F is {a,b,c,d} a = TACGA b = ACCC c = CTAAAG d = GACA TACGA GACA ab c d 2 1 1 1 1 t = 2 TACGA CTAAAG

IMPLEMENTATION Graphs are easier to understand, but not necessary We can implement the greedy algorithm by recursive implementation of following procedure, until only one fragment remains.

IMPLEMENTATION 1.Take pair (f, g) of fragments with largest overlap, say T. 2.Remove both fragments from F and add f k T g to F Assumption, F is substring-free

Greed seldom pays Consider F is {a,b,c} a = ATGC b = TGCAT c = GCC ab c 3 2 2 0 Greedy Algorithm follows the path a → b → c giving a total weight of 3 including a zero weight path The Best path would have been b → a → c giving a weight of 4

HEURISTICS Motivation - Algorithms like Greedy do not guarantee optimal solutions - Algorithms like Greedy do not guarantee optimal solutions - Solving Shortest Common Superstrings through Hamiltonian paths is NP- Complete - Solving Shortest Common Superstrings through Hamiltonian paths is NP- Complete - Closeness of problem to Multiple alignment problem - Closeness of problem to Multiple alignment problem

HEURISTICS Following heuristics mostly aim towards solving two major problems 1.Fragments can participate with either direct or the reverse-complemented sequence 2.Fragments themselves are usually much shorter than the alignment

HEURISTICS 1)calls for discrimination between treatment of different types of gaps (internal, external), to bind the fragment characters 2)urges us to consider other criteria, besides the score, to assess the quality of an alignment

Accessing Criteria SCORING Measure of participation of each aligned column in a multiple alignment Entropy may be used for that Entropy measures the uniformity of alignment

Accessing Criteria COVERAGE A fragment covers a column i if it participates in this column either with a character or with an internal space Minimum, maximum and mean coverage can be calculated for a layout Lesser the coverage, weaker the connection and more the independence Coverage bolsters/weakens the notion of consensus sequence

Accessing Criteria Linkage Measure of the way individual fragments are linked to each other in the layout Fragments should have overlapping ends to show some linkage

Summing up For practical implementation we may divide the process in three phases Finding overlaps Building a layout Computing the consensus

Greedy Path Merge Algorithm Clone-by-Clone Approach (by Human Genome project) Whole Genome Shotgun (Celera Genomics) Preliminary Data Heuristic Greedy Path Merge

Greedy Path Merge Algorithm HGP’s clone-by-clone approach Constructs a tiling of the genome by overlapping pieces (~ 150k bp) Concentrates on determining the sequence of each such piece Pieces are called BAC clones or simply BACs (Bacterial Artificial Chromosomes) BAC is randomly broken into many smaller fragments that are cloned and sequenced The fragments are assembled resulting in Contigs

Greedy Path Merge Algorithm WGS strategy Whole Genome is randomly broken into smaller pieces that are sequenced Due to size of Genome a pure overlap-based approach is not feasible Additionally, Celera uses “mate-pairs”. These are fragments with a known relative distance and a standard deviation (sort of experimental data from sampling and sequencing large chunks of DNA

Greedy Path Merge Algorithm [def.] Bactigs are pieces of DNA sequences that are obtained from a common source region of ~ 150k bp contiguous DNA of human genome obtained by shotgun sequencing and assembly BACs start as phase-0 BAC, which means several bactids of small size. They evolve upto phase-3 BAC, which is one bactig fully representing its source sequence

Greedy Path Merge Algorithm This approach takes up the BACs and Celera’s fragments and primarily aims at increasing the level of assembly of BACs using the information given by fragments and mate links Hence evolving phase-1/2 BACs to phase-3 BACs

Mate edge Greedy Path Merge Algorithm INITIAL BACTIG GRAPH BAC B={B 1, B 2, ….., B n }. Fragment f hits (or is embedded in) a bactig B i if it (or its reverse complement) aligns with the bactig with a high density Bactig graph is a weighted, undirected multi-graph without self- loops

Greedy Path Merge Algorithm Functions performed on Initial Bactig Graph Edge Bundling – For more than one mate pairs Transitive Reduction – Transitively resucing long mate edges Hence Final Bactig Graph is the initial one with Edge Bundling and Transitive Reduction performed on it Hence Final Bactig Graph is the initial one with Edge Bundling and Transitive Reduction performed on it

Greedy Path Merge Algorithm BACTIG ORDERING PROBLEM Given a Bactig graph G. To find an ordering of G that maximizes the weights of valid (happy*) mate edges The problem is NP complete From here on we proceed almost exactly as in normal Greedy Algorithm

Greedy Path Merge Algorithm Throughout the implementation it is maintained that every node is adjacent to at most two selected edges. These edges form the selected path. The ordering of Bactigs induced by such a path is called Scaffolding of Bactigs The deviation from the actual algorithm is that it can introduce new edges to the bactig graph, which are called inferred edges

Greedy Path Merge Algorithm Given a bactig graph G. The output of the algorithm is a node-disjoint covering of G by selected paths, each one defining an ordering of the bactigs whose edges it covers The algorithm runs in O(mn+m 2 ) time, where m is he number of mate pair edges and n is the number of bactig edges

THANK YOU

Download ppt "Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE 397-497."

Similar presentations