Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String."— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String Edits and Alignments Lecturer: Dr. Rose Slides by: Dr. Rose February 13, 2003

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Homework: due 2/20/03 Chapter 11 questions: #1 #4 #7 #8 Additional question for gradstudents #10

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Dynamic program takes  (nm) space for alignments. Can alignments be computed in linear space? Hirschberg’s method –Good news: reduces space from  (nm) to O(n) where n<m. –Bad News: doubles worst case time bound.

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space for Similarity Recall: similarity is expressed as a single scalar. There is an alignment that corresponds to this scalar. –i.e., the optimal alignment whose values is this scalar. We’ve needed the O(n*m) table for the alignment. Q: If we only want the similarity value do we need the table? A: No. We only need the space required to compute the value.

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space for Similarity Q: How much space is that? A: Two rows. –Recall, computing cell (i, j) we need cells (i -1, j - 1), (i -1, j), (i, j - 1). –Cells (i -1, j - 1) and (i -1, j) are on the previous row. –Cells (i, j - 1) and (i, j) are on the current row. –We only need the current row, C, and the previous row, P. –When the current row is done, copy it to the previous row for the next iteration, i.e., C  P

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space for Similarity –After n iterations row C holds the last row n of the full table. –The similarity value V(n,m) is in the last cell of C. –The time complexity is still O(nm) but space is now O(m).

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Q: How can we find the actual alignment in linear space? Consider an alignment solution path in the table computation.

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Imagine that we knew that the optimal alignment solution path went through cell (n/2, k * )? Knowing this, we could solve the problem by piecing together solution paths for the diagonal quadrants.

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments As important, we could ignore the antidiagonal quadrants. We could repeat this process, reducing the amount of space needed to find the optimal alignment.

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Repeating this process would reduce the amount of space needed to find the optimal alignment. Q: How far can we go?

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Q: How can we find the cell (n/2, k * )? A: Stay tuned! Defn. Let  r denote the reverse of string . Defn. V r (i,j) is the similarity of the first i characters of S r 1 with the first j characters of S r 2.

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments An equivalent formulation: V r (i,j) is the similarity of the last i characters of S 1 with the last j characters of S 2. It should be obvious how V r (i,j) can be computed in O(nm) time and O(m) space. Furthermore, any row can be computed in O(m) space.

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Lemma: V(n, m) = max 0  k  m [V(n/2,k) + V r (n/2,m-k)] Q: What does this lemma say? A: The solution to alignment value V(n, m) is the sum of the smaller alignment problems V(n/2,k) & V r (n/2,m-k) where k is chosen to yield the largest sum.

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Defn: Let k * be the position k that maximizes V(n/2,k) + V r (n/2,m-k) Defn: Let L denote the solution path from cell (0,0) to cell (n,m)

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Defn: Let L n/2 be the the subpath of L that starts with the last node of L in row n/2 –1 and ends with the first node of L in n/2+1

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Lemma: 1.Position k * can be found in row n/2 in time O(nm) and space O(m). 2.The subpath L n/2 can be found and stored in the same bounds. Proof sketch: 1.Process first n/2 rows to find S 1 & S 2 alignment, saving row n/2 with traceback pointers. 2.Process first n/2 rows to find S r 1 & S r 2 alignment saving row n/2 with traceback pointers.

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Proof sketch continued : 3.For each k, add V(n/2,k) and V r (n/2,m-k). 4.Set k * to be the k that maximizes V(n/2,k) + V r (n/2,m-k). Steps 1 & 2 take O(nm) time and O(m) space. Steps 3 & 4 take O(m) time. 5.One set of traceback pointers leads from k * lead to k 1 in row n/2-1. 6.The other leads from k * lead to k 2 in row n/2+1. Steps 5 & 6 give the subpath L n/2.

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Summary : In O(mn) time and O(m) space we have: 1.Found the value V(n,m) 2.Found k *, k 1, and k 2 3.Found the subpath L n/2. 4.Created two subproblems: 1.Aligning S 1 [1..n/2-1] with S 2 [1..k 1 ] 2.Aligning S 1 [n/2+1..n] with S 2 [k 2..m]

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Aligning S 1 [1..n/2-1] with S 2 [1..k 1 ] is the top problem labeled A. Aligning S 1 [n/2+1..n] with S 2 [k 2..m] is the bottom problem labeled B.

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Q: What is the dynamic programming time for a p by q table? A: cpq, where c is some constant. Q: Determining the n/2 th row of a n by m table takes how long? A: cnm/2  Thus cnm time to process the two rows (V & V r ). We can solve problems A and B in time proportional to their total size.  The middle row of A can be determined in ck * n/2  The middle row of B can be determined in c(m-k * )n/2  Altogether this is cnm/2.

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Q: How are we going to find the optimal alignment in linear space? A: Use recursion!

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Hirschberg’s Algorithm Procedure OPTA(l,l´,r,r´){ h = (l´- l)/2; /* midpoint of first substring */ Find k *, k 1, k 2, & L h in space O(l´- l) = O(m) OPTA(l,h-1, r, k 1 ); /* new top problem */ output subpath L h ; OPTA(h+1, l´, k 2, r´); /* new bottom problem */ } The first call is: OPTA(1,n,1,m)

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Analysis: The first call uses cnm time The second call uses cnm/2 time for 2 subproblems The i th level of recursion entail 2 i-1 subproblems There are n/2 i-1 rows in each of the level i problems The total time at level i is cnm/2 i-1 Thm. Hirschberg’s optimal alignment algorithm takes time  1+log n cnm/2 i-1  2cnm and space O(m).

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Q: What about computing local alignment? Recall: –This is solved by finding the cell (i *, j * ) with maximum value v. – i * and j * are the end indices of substrings  and . We can compute v row-wise. (recall v(i,j) is the optimal suffix alignment chapt 11)  use only linear space. Q:How do we find the start indices of  and  without the full table? A: Author suggests reverse dynamic programming. Huh? ‘reverse the polarity’? Where is Dr. Who?

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Linear Space Alignments Finding the start indices of  and  : Extend the algorithm for v to set pointer h(i, j) for each cell (i, j): If v(i, j) = 0 then set h(i, j) to (i, j) If v(i, j) > 0 & normal traceback pointer would be to cell (p, q) then set h(i, j) to h(p, q). Consequently, h(i *, j * ) specifies the starting cell, i.e., the starting positions of  and . Finding  and  can be done in linear space.  Local alignment can be done in O(nm) time & O(m) space.

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Imagine problems in which there is a bound on the number of expected differences. Q: Can we solve the alignment in faster than O(nm)? A: Yes, if the alignment contains at most k differences O(km) is possible. Core Idea: The main diagonal is comprised of cells (i,i), i  n  m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Core Idea: The main diagonal is comprised of cells (i,i), i  n  m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

28 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Recall: a solution path must extend from cell (0,0) ending along or to the right of the main diagonal in cell (n,m) Observation: k >= m – n is required for a k-difference solution to exist.

29 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Q: How can we achieve time complexity O(km) in a table with O(nm) entries? A: Only fill O(km) of the O(nm) cells straddling the main diagonal.

30 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Algorithm: Fill in the table in strips 2k+1 cells wide centered on the main diagonal. Note: The recurrence requires the three neighboring cells. Q: How do we handle neighbors in the forbidden zone? A: Ignore them.

31 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Thm. There is a global alignment of S 1 and S 2 with at most k differences IFF the algorithm from the previous slide assigns cell (n,m) the value k or less.  The k-difference global alignment problem can be solved in time O(km) and space O(km).

32 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Q: What if we don’t know the value of k? Q: How can we decide on a k value? Soln. Start with k = 1. If no solution is found let k = 2 * k Repeat the doubling of k until a solution is found. We double k to find the optimal value k *. We stop doubling k when a solution is found. k * will be the best alignment with the current value of k. Since we have been doubling k, k *  k.

33 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bounded Differences Thm. The doubling of k, starting from1, yields a k- difference alignment with the edit distance k * and its alignment in O(k * m) time and space. Proof: Let k L be the largest value of k used for a given pair of strings. Then k L  2k *. The effort involved is O(k L m + k L m/2 + k L m/4 +.. + m) = O( k L m). But, O( k L m) = O(k * m). Q: Why do we state k L  2k * instead of k L < 2k * ?

34 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Homework Due 2/27/03 Part 1 #24. Show how to solve the alphabet-weight alignment problem with affine weights in O(nm) time. #27. The recurrence relations we developed for the affine gap model follow the logic of paying W g + W s when a gap is initiated and then paying W s for each additional space used in that gap. An alternative logic is to pay W g + W s at the point when the gap is “completed”. Write recurrence relations for the affine gap model that follows that logic. The recurrences should compute the alignment in O(nm) time. Continued on next page.

35 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Homework Part 2 #1. Show how to compute the value V(n,m) of the optimal alignment using only min(n,m) +1 space in addition to the space needed to represent the two input strings. #4. Show how to reduce the size of the strip needed in the method of Section 12.2.3, when |m - n| < k. Continued on next page.

36 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Homework Part 2 continued Gradstudents only: #5. Fill in the details of how to find the actual alignments of P in T that occur with at most k differences. The method uses the O(km) values stored during the k differences algorithm. The solution is somewhat simpler if the k differences algorithm also stores a sparse set of pointers recording how each farthest-reaching d-path extends a farthest-reaching (d-1)-path. These pointers only take O(km) space and are a spare version of the standard dynamic programming pointers. Fill in the details of this approach as well.  Required portion of question.  Optional, extra credit portion of question.


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String."

Similar presentations


Ads by Google