Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures
Chapter : Multiple Alignment Lecturer: Dr. Rose Slides by: Dr. Rose February 28, 2003

Sum-of-Pairs Defn. The sum-of-pairs (SP) score of a multiple alignment is the sum of the score of all induced pairs in a global alignment. From the previous example: 1 A A T - G G T T T 2 A A - C G T T A T T A T C G - A A T SP = = 13

Sum-of-Pairs Q: What theoretical justification is there for adopting the SP score? Wait for response….. A: None. Or rather none more than for any other multiple alignment scoring scheme. In practice it is a good heuristic and is popular. Q: How can we compute a global alignment M using a minimum sum-of-pairs score? A: Why dynamic programming of course!

Sum-of-Pairs Assuming that we want to align k strings
Q: What time complexity for the DP solution? A: Q(nk), exact SP aligment has been shown to be NP-complete. Q: So what should we do? A: Choose small a k. In practice, the NP-completeness of a problem often does not mean that the sky is falling.

Sum-of-Pairs Q: How will k affect the recurrence relation?
The recurrence relation for k = 3 is: D(i, j, k) = min[ D(i -1, j - 1, k - 1) + ?, D(i -1, j - 1, k ) + ?, D(i -1, j, k - 1) + ?, D(i, j - 1, k - 1) + ?, D(i -1, j , k ) + ?, D(i, j - 1, k ) + ?, D(i, j , k - 1) + ?]

Sum-of-Pairs Let’s consider each term of the recurrence in turn:
D(i -1, j - 1, k - 1) is the diagonal cell in all three dimensions. Q: What should be the SP transition cost for D(i-1,j-1,k-1) D(i, j, k) ? Recall for k = 2, if S1(i) = S2(j) the cost is the match cost, o/w S1(i)  S2(j) and we incur the mismatch cost. A: the sum of pairwise match comparisons, i.e., ij, jk, ik.

Sum-of-Pairs Let m(i, j) denote the pairwise character match function defined as: m(i, j) = matchCost if the characters match m(i, j) = mismatchCost if the characters mismatch Then the SP transition cost for D(i - 1, j - 1, k - 1) D(i, j, k) is m(i, j) + m(j, k) + m(i, k) Hence the term cost is : D(i - 1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k)

Sum-of-Pairs The next term:
D(i -1, j - 1, k ) is the diagonal cell in the first two dimensions. Q: What should be the SP transition cost for D(i-1, j-1, k) D(i, j, k) ? We have two types of cases to consider: The pairwise diagonal case: i-1, j-1 i, j The two pairwise space insertion cases: i-1, k  i, k and j-1, k  j, k

Sum-of-Pairs The cost will be the sum of the pairwise match and space insertion costs. m(i, j) for (i-1, j-1 i, j) and spacecost for i-1, k  i, k and spacecost for j-1, k  j, k Then the SP transition cost for D(i - 1, j - 1, k) D(i, j, k) is m(i, j) + 2 * spacecost Hence the term cost is : D(i - 1, j - 1, k) + m(i, j) + 2 * spacecost

Sum-of-Pairs Similarly, the third and fourth term costs are:
D(i - 1, j, k - 1) + m(i, k) + 2 * spacecost, D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost Note the similarity in the fifth, sixth, and seventh terms: D(i -1, j , k ) + ? D(i, j - 1, k ) + ? D(i, j , k - 1) + ? Q: What should be the cost for transitions from them?

Sum-of-Pairs For D(i -1, j , k) we have two types of cases to consider: The pairwise no change case: j, k  j, k The two pairwise space insertion cases: i-1, j  i, j and i-1, k  i, k Then the SP transition cost for D(i - 1, j , k) D(i, j, k) is * spacecost Hence the term cost is : D(i - 1, j, k) + 2 * spacecost

Sum-of-Pairs Similarly, the sixth and seventh term costs are:
D(i - 1, j, k) + 2 * spacecost, D(i, j, k) + 2 * spacecost Hence D(i, j, k) = min[ D(i -1, j - 1, k - 1) + m(i, j) + m(j, k) + m(i, k), D(i -1, j - 1, k ) + m(i, j) + 2 * spacecost, D(i -1, j, k - 1) + m(i, k) + 2 * spacecost, D(i, j - 1, k - 1) + m(j, k) + 2 * spacecost, D(i -1, j , k ) + 2 * spacecost, D(i, j - 1, k ) + 2 * spacecost, D(i, j , k - 1) + 2 * spacecost]

Sum-of-Pairs Q: What about the boundary cells on the 3 faces of the table? D(i, j, 0), D(i, 0, k), D(0, j, k) Observation: Each case degenerates into the familiar two-string alignment distance + space costs for the empty string argument. Approach: represent these cases in terms of pair-wise distance + space costs.

Sum-of-Pairs Let D1,2(i, j) denote the pairwise distance between S1[1..i] and S2[1..j]. D1,3(i, k) and D2,3(j, k) are analogously defined. Consider D(i, j, 0): D(i, j, 0) = D1,2(i, j) + ? * spaceCost Q: What is the space cost, i.e., how many spaces? A: i for S1 and j for S2 hence: D(i, j, 0) = D1,2(i, j) +(i + j) * spaceCost

Sum-of-Pairs By this argument, the boundary cells are given by:
D(i, j, 0) = D1,2(i, j) + (i + j) * spaceCost , D(i, 0, k) = D1,3(i, k) + (i + k) * spaceCost , D(0, j, k) = D2,3(j, k) + (j + k) * spaceCost, D(0,0,0) = 0

Sum-of-Pairs Speedup Q: How can we speedup our DP approach?
A: Use forward dynamic programming. Note: so far we have used backward dynamic programming, i.e., cell (i, j, k) looks back to the seven cells that can influence its value. In contrast: forward DP sends the result of cell (i, j, k) forward to the seven cells whose value it could influence.

Sum-of-Pairs Speedup Q: How does this speed things up?
A: it doesn’t, if we always send cell (i, j, k)’s value forward. The only significant way to speed up the Q(nk) is to avoid computing all nk cells in the DP table. We will use forward DP to reduce the number of cells that we compute in the DP table.

Sum-of-Pairs Speedup Let’s rethink this problem:
View the optimal alignment problem as the shortest path through the weighted edit distance graph. We are looking for the shortest path from (0,0,0) to (n,n,n). When node (i, j, k) is computed, we have the shortest path from (0,0,0) to (i, j, k). The value of node (i, j, k) is sent forward to the seven neighboring nodes that it can influence

Sum-of-Pairs Speedup Let w be reached by an outgoing edge from (i, j, k) the true shortest distance from (0,0,0) to w is the value computed after it has been updated by every node with a ingoing edge to it. A queue is used to order the nodes for processing. The final shortest distance for the node v at the head of the queue is set and node v is removed. Every neighbor w of v is then updated, w is placed in the queue if it is not already there.

Sum-of-Pairs Speedup At this point we borrow an A*-like idea:
IF (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n) then avoid passing its value forward. More importantly, avoid putting its neighbors, not already in the queue, into the queue. The trick is deciding (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n). Q: How do we pull this rabbit out of our hat?

Sum-of-Pairs Speedup Define d1,2(i, j) to be the edit distance between suffixes S1[i..n] and S2[j..n]. Define d1,3(i, k) & d2,3(j, k), analogously. Note: these edit distances can be computed in O(n2) via DP on the reversed strings. Observation: any shortest path from (i, j, k) to (n,n,n) must have distance at least d1,2(i, j) + d1,3(i, k) + d2,3(j, k)

Sum-of-Pairs Speedup Suppose we have an alignment (from somewhere) with an SP distance score z. Core idea: if D(i, j, k) + d1,2(i, j) + d1,3(i, k) + d2,3(j, k) > z, then node (i, j, k) can not be on any shortest path. Do not pass its value forward. Do not put its neighbors reached by outgoing edges onto the queue.

Sum-of-Pairs Speedup Benefits of being able to prune cell (i, j, k):
We automatically prune many of its descendants. We don’t process all nk cells in a k-string problem. Big win!!!! The computation is still exact & will find the optimal alignment.

Sum-of-Pairs Speedup The program called MSA implements the speedup we are discussing. Cold shower: MSA can align 6 strings with n = ~200 Unlikely to be able to align tens or hundreds of strings. Still, 2006 cells (= 6.4 * 1013 cells), otherwise impossible.

Bounded-Error Approximation for SP-Alignment
Q: Where do we get z from? A: We will use a bounded-error approximation method. Properties of the specific method we will discuss: Polynomial worst-case time complexity The SP-score is less than twice the optimal value.

Idea: focus on alignments consistent with a tree. Q: What do we mean by “consistent with a tree”? Informal explanation: A graph edge denotes a relation between two nodes. Recall that D(Si, Sj) is the optimal weighted distance between Si and Sj. We could let D(Si, Sj) be the edge relation.

Informal explanation: A graph edge denotes a relation between two nodes. Recall that D(Si, Sj) is the optimal weighted edit distance between Si and Sj. We could let D(Si, Sj) be the edge relation between the node labeled Si and the node labeled Sj.

Informal explanation continued: Suppose we have a multiple alignment M. Suppose we construct an unrooted tree from a subset of such edges between nodes labeled with strings from M. We call the alignment of the strings represented in the tree consistent with the tree.  recall D(Si, Sj) is the edge relation.

Example from text: A X X _ Z A X _ _ Z A _ X _ Z A Y _ _ Z A Y X X Z

Defn. More formally, let: S be a set of distinct strings. T be an unrooted tree comprised of nodes labeled with strings from set S. M be multiple alignment of the strings in S. M is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si, Sj) for each pair of strings (Si, Sj) that label adjacent nodes in T.

Thm. For any set of strings S and for and tree T whose nodes are labeled by distinct strings from set S, we can efficiently find a multiple alignment M(T) of S that is consistent with T. Proof sketch: construct M(T) of S one string at a time. Base case: Pick two strings Si and Sj labeling nodes adjacent in T. Create M2(T) a two string alignment with distance D(Si,Sj).

Inductive Hypothesis: Assume the theorem holds for 2 < k strings, i.e., Mk(T) is consistent with T. Inductive Step: show that the theorem holds for k + 1 strings. Pick a string Sj not in Mk(T) such that it labels a node adjacent to a node labeled Si already in Mk(T). Optimally align Sj with Si (Si with spaces in Mk(T)). Add Sj (Sj with spaces) to Mk(T) creating Mk+1(T). Look at detailed proof (pg. 348) to see how the issue of inserted spaces is handled.

By construction: Sj and Si have distance D(Si, Sj) Mk+1(T) is consistent with T. By induction, M(T) of S is consistent with T and is efficiently computed.

We need some more definitions at this point: Defn. the center string Sc  S, a set of k strings, is the string that minimizes M = SSjS D(Sc, Sj). Defn. the center star is a star tree of k nodes, with the center node Sc and each of the k-1 remaining nodes labeled by a distinct string in S – Sc.

Defn. the multiple alignment Mc of strings in S is the multiple alignment consistent with the center star. Defn. let d(Si, Sj) denote the score of the pairwise alignment of strings Sj and Si induced by Mc. Defn. let d(M) denote the score of the alignment M. Observations: d(Si, Sj)  D(Si, Sj) d(Mc) = Si<jd(Si, Sj).

Defn. the triangle inequality wrt a scoring scheme is defined as the relation s(x, z)  s(x, y) + s(y, z) for any three characters x, y, and z. We can extend the triangle inequality from the scoring scheme for characters to string alignment.

Lemma. If a 2-string scoring scheme that satisfies the triangle inequality is used, then for any Si& Sj : d(Si, Sj)  d(Si, Sc) + d(Sc, Sj) = D(Si, Sc) + D(Sc, Sj) Proof sketch: Notice that for each column we have: s(x, z)  s(x, y) + s(y, z) The inequality in the lemma follows immediately. The equality holds since all strings are optimally aligned with Sc.

We can now establish the bounded-error approximation: Defn. Let M* denote the optimal alignment of the k string of S. Defn. Let d*(Si, Sj) denote the pairwise alignment score of the strings Si and Sj induced by M*.

Thm. d(Mc)/d(M*)  2(k – 1)/k < 2 See proof on page 350 for details. (basically depends on the previous lemma) Corollary: kM  Si<jD(Si, Sj)  d(M*)  d(Mc)  [2(k – 1)/k] Si<jD(Si, Sj) Recall that M = SSjSD(Sc, Sj) The alignment score D(Si, Sj) is not based on Mc or M* Observation: d(Mc)/Si<jD(Si, Sj) gives a measure of the goodness of Mc and is guaranteed to be less than 2.

Consensus Objective Functions
First fact of consensus representations: There is no consensus as to how to define consensus. Consequently, we will look at several definitions. Steiner consensus strings: Defn. Given a set of string S and a string S´, the consensus error of S´ relative to S is E(S´)= SSjSD(S´, Sj). S´ is not required to be a member of S.

Defn. Given a set of strings S, an optimal Steiner string S* for S minimizes the consensus error E(S*). S* is not required to be a member of S. Observations: in S* we are trying to capture the essential common features in S. Computing E(S*) appears to be a hard problem.

No known efficient method for finding S*. We will consider an approximate method. Lemma: Assume that S contains k strings and that the scoring scheme satisfies the triangle inequality. There exists a string S´ S such that E(S´)/E(S*)  2. Q: What does this lemma say? (Proof sketch next slide)

Proof sketch: For any i, D(S´, Si)  D(S´, S*) + D(S*, Si) so, E(S´) = SSjS D(S´, Sj) and SSjS D(S´, Sj)  SSjS*[ D(S´, S*) + D(S*, Sj)] But SSjS*[ D(S´, S*) + D(S*, Sj)] = (k-2) D(S´, S*) + E(S*) Therefore E(S´)  (k-2) D(S´, S*) + E(S*)

Q:Where do we find a good candidate for S´? A: Sc, the center string. Recall Sc minimizes SSjS D(Sc, Sj). Thm. E(Sc)/E(S*)  2 - 2/k, assuming the scoring scheme satisfies the triangle inequality. Proof. Follows immediately from the previous lemma and the observation that E(Sc)  E(S´)

Consensus strings from multiple alignment Defn. Let M be a multiple alignment of strings S, the consensus character of column i of M is the character that minimizes the summed distance to all the characters in column i. Note: the summed distance depends on the pairwise scoring scheme. The plurality character is the consensus character for some scoring schemes.

Defn. Let d(i) denote the minimum sum in column i. Defn. The consensus string SM derived from alignment M is the concatenation of consensus characters for each column of M. Q: How can we evaluate the goodness of SM ? A: One possibility is Goodness(SM ) = SiD(SM, Si), i.e., see how good of a Steiner string SM is. Consider a different approach…..

Defn. The alignment error of SM, a consensus string containing q characters, is Sqi=1d(i). Defn. The alignment error of M is defined as the -alignment error of SM, its consensus string. Example: 1 A A T - G - T T T 2 A A - C G T T A T T A T C G - A A T A A T C G - T A T Consensus (alignment error of ?)

Defn. The optimal consensus multiple alignment is a multiple alignment M whose consensus string SM has the smallest alignment error over all possible multiple alignments of S.

The 3 notions of consensus we have discussed are: The Steiner string S* defined from S. The consensus string SM derived from M, with goodness related to its function as a Steiner string. The consensus string SM derived from M, with goodness related to is ability to reflect the column-wise properties of M. Surprisingly (or not) they lead to the same multiple alignment.

Let’s investigate the assertion these concepts result in the same multiple alignment. Let S be a set of k strings. Let T be the star tree with Steiner string S* at the root and each of the k strings of S at distinct leave of T, then: Defn. the multiple alignment consistent with S* is the multiple alignment of S  S* consistent with T.

Thm. Let S denote the consensus string of the optimal consensus multiple alignment. Removing the spaces from S results in the optimal Steiner string S*. Removal of S* from the multiple alignment consistent with S* results in the optimal consensus multiple alignment of S. Proof on page 353.

Q: Why should we care about this theorem? A: The theorem stating: E(Sc)/E(S*)  2 - 2/k plus this theorem can be used to approximate the optimal consensus alignment: Find the center string Sc. Recall the center string Sc  S, a set of k strings, is the string that minimizes M = SSjS D(Sc, Sj). Place Sc at the center of a k node star.

Label each leaf with a string from S. Construct the multiple alignment M consistent with this tree T. Recall: M is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si, Sj) for each pair of strings (Si, Sj) that label adjacent nodes in T.

Revelation: The multiple alignment M is the same as Mc used to approximate the SP objective function. Thm. The multiple alignment Mc created by the center star method has: An SP score  (2-2/k) score of the optimal SP alignment. A consensus alignment error  (2-2/k) the alignment error of the optimal consensus multiple alignment.

Phylogenetic trees: Multiple alignment
Phylogenetic tree: a depiction of the evolutionary history of set of taxa. The leaves of the tree are labeled by taxa names. Convention: Each edge (u,v) denotes an ancestor-descendant relation. This relation may be on the basis of morphological attributes or sequence similarity. The internal nodes represent extinct taxa. The leafs represent currently existing taxa.

Two related problems: Problem: find a multiple alignment for a tree Given a phylogenetic tree, deduce sequences for the internal nodes to optimize some objective function. Find the multiple alignment consistent with the tree. Delete the deduced sequences (internal node labels) Find a tree from a set of leaf sequences.(Chapter 17)

Let T be a tree with leaf nodes labeled with distinct strings from a set S. Defn. a phylogenetic alignment for T is an assignment of one string to each internal node. Note: strings labeling internal nodes need not come from S.

Recall that D(S1, S2) denotes the edit distance between strings S1 and S2. Defn. The edge distance of edge (i, j) is D(Si, Sj) where Si and Sj are the strings labeling nodes i and j, respectively. Defn. Path distance is the sum of edge distances along the path. Defn. Phylogenetic alignment distance is the sum of all edge distances in the tree.

Phylogenetic alignment problem for T: Find an assignment of strings to internal nodes of T that minimizes the distance of the alignment.

Phylogenetic alignment problem for T: The general problem is too hard (NP-complete). We will consider a heuristic approximate solution. The solution is within twice the minimal distance. The approach has polynomial time complexity.

Defn. A lifted alignment is a phylogenetic alignment in which the string assigned to each internal node is also assigned to one of its children. Example:

Lifted Alignment Observation: Each internal node v is labeled by a leaf label appearing in the subtree rooted at v.

Plan: Construct a lifted alignment TL. Initial approach: conceptually transform the optimal phylogenetic alignment. Q: Why do we say “conceptually”? A: Because we don’t have T*, the optimal phylogenetic alignment. Demonstrate property of TL: total distance < twice optimal phylogenetic alignment distance. Next: show how to compute TL efficiently using DP.

Creating TL: Start with input tree T, with leafs labeled by distinct strings. Let T* denote the optimal phylogenetic alignment for T. (This is the assignment of strings to internal nodes of T that minimizes the total of all edge distances.) Successively lift each internal node. An internal node can only be lifted if all of its children have been lifted. Leaf nodes are defined to be lifted.

Q: How do we “lift” a node? Let S*v denote the label of node v in T*. Assume that v’s children have been lifted. WLOG let the labels of v’s children be S1, S2,..,Sk from S.

Find the string Sj among the children that is closest to S*v, i.e., the string Sj such that D(S*v, Sj)  D(S*v, Si) for all i from1 to k. Replace S*v,with Sj.

Claim: The lifted alignment TL has total distance less or equal to twice that of the optimal phylogenetic alingment T* of T. Sketch of proof: Suppose e(v, w) (v the parent of w) is a nonzero-length edge in TL. Suppose v is labeled Sj  S, and w is labeled Si  S. If Sj  Si then the distance of e in TL is D(Sj, Si)  D(Sj, S*v) + D(S*v, Si). But D(Sj, S*v) + D(S*v, Si)  2 * D(S*v, Si) Q: Why is this true? A: because D(Sj, S*v)  D(S*v, Si)

Sketch of proof (continued): What about paths? Let Pe denote the path from v to the leaf labeled Si in T*. The distance is at most the sum of the edge distances. In TL, if e is a nonzero-length edge, then this path has distance at most twice Pe.

The lifted alignment can be computed with DP. Let Tv be the subtree of T rooted at node v. Defn. d(v, S) denotes the distance of the best lifted alignment of Tv where v is labeled with S. Obviously, S must be the label of a leaf in Tv.

d(v, S) is computed from the leaves up. The leaves are already considered “lifted”. d(v, S) for a parent of leaves is computed by: d(v, S) = SS´ D(S, S´) where S´ is the label of a child of v. The general recurrence for an internal node is: d(v, S) = Sv´ minS´ [D(S, S´) + d(v´, S´) ], where v´ is a child of v and S´ labels a leaf in Tv´.

Time analysis: Assume that T has k leaves. Assume that all pairwise distances have been computed. Q: How long does this take? A: O(N2) where N is the total length of all the k strings. Why is this true? How can we explain it?

Time analysis: The processing at an internal node is O(k2). Why is this true? Then the total time is O(N2 + k3). Why O(N2 + k3) and not O(N2 + k2)? Bottom line: we can compute the optimal lifted alignment in time that is polynomial in the length of the strings and size of the tree.

Bioinformatics Algorithms and Data Structures

Similar presentations

Presentation on theme: "Bioinformatics Algorithms and Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics Algorithms and Data Structures

Similar presentations

Presentation on theme: "Bioinformatics Algorithms and Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback