Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp.157-169. G. Landau and U. Vishkin Advisor: Prof. R. C.

Similar presentations


Presentation on theme: "1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp.157-169. G. Landau and U. Vishkin Advisor: Prof. R. C."— Presentation transcript:

1 1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp.157-169. G. Landau and U. Vishkin Advisor: Prof. R. C. T. Lee Speaker: L. Y. Huang

2 2 Problem Give two arrays: P = p 1 p 2 …p m – the pattern, and T = t 1 t 2 …t n – the text, and an integer k (k 1), find all occurrences of the pattern in the text with edit distances at most equal to k.

3 3 This algorithm improves the Alternative Dynamic Programming Computation. First, we introduce the Dynamic Programming Computation.

4 4 The Dynamic Programming Algorithm[S80] In the dynamic programming approach, we construct a matrix D n+1,m+1 when D i,j is the minimum edit distance between P(1, j) and any substring in T which ends at T i. Example: T = gggtcta P = gttc k = 2 21101112t 2 1 1 0 t 1 1 1 0 catgg 223334c 212223t 110001g 000000 i 1 2 3 4 5 6 7 j1234j1234 g

5 5 We found: –gt gt gt –gttc g t t gt –g t c gtc –g t t c gtc Distance =2 (1) Distance =1 (2) 21101112t 2 1 1 0 t 1 1 1 0 catgg 223334c 212223t 110001g 000000 i 1 2 3 4 5 6 7 j1234j1234 g

6 6 –g t c t g t c t gtct –g t t c g t t t gtct – –g t c t g t c t gtct –g t t c g t t gtct –g t c t a g t c t a gtcta – g t t c g t t a gtcta Distance =2 (3) (4) (5) 21101112t 2 1 1 0 t 1 1 1 0 catgg 223334c 212223t 110001g 000000 i 1 2 3 4 5 6 7 j1234j1234 g

7 7 An alternative Dynamic Programming Computation We should heavily use the concept of diagonal. Diagonal d is defined as all of the D i,j s where d = i – j. Diagonal 2 Diagonal 0 1 0122c 101b 0000 cba i 1 2 3 j12j12

8 8 We first have the following: –(a) If T i = P j, D i,j = D i-1,j-1 ; –(b) otherwise, D i,j = D i-1,j-1 +1 (subsitutaion) or D i,j = D i, j-1 +1 (deletion) or D i,j = D i-1,j (insertion)

9 9 Consider any diagonal d. Let us find the largest j, if it exists, such that (i,j) is on Diagonal d (i - j = d) and D i,j = 0. Let us now label all of these locations. c t 0t 000 g 00000000 atctggg i 1 2 3 4 5 6 7 j1234j1234 Diagonal 0 Diagonal 1 Diagonal 2

10 10 Having found the above locations (i, j) where D i,j = 0, we can further find the largest j, if it exists, such that (i, j) is on Diagonal d and D i,j = 1. To do this, we use the following observation: Each element in Diagonal d can only influence elements in Diagonals d-1, d and d+1.

11 11 Let us consider any (i, j) location on Diagonal d. Why can D i,j suddenly become 1? –It can only be influenced as shown below: Thus, we conclude that we only need to consider Diagonals d-1, d and d+1. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution

12 12 Let us consider the following table. Question: what is the value of D 4,3 ? –It can not be 0 because we have already decided that on Diagonal 1, the largest j on Diagonal 1 is 1. Thus D 4,3 =1. j1234j1234 d =1 i 1 2 3 4 5 6 7 0c ?0t 00t 0000g 00000000 atctggg

13 13 Question: What is the value of D 5,4 ? –Since T 5 =P 4, D 5,4 =D 4,3 =1. j1234j1234 d =1 i 1 2 3 4 5 6 7 ?0c 10t 00t 0000g 00000000 atctggg

14 14 Based upon the above discussion, we can find all (i,j)s where D i,j =1 after finding all (i, j)s when D i,j =0. In fact, after finding all D i,j s where D i,j = e, we can find all (i, j)s where D i,j = e+1. Thus the dynamic programming table does not have to computed. In the following, we shall give the Alternative Dynamic Programming Computations Method formally.

15 15 Let L d,e denote the largest row j such that D i,j is on the Diagonal d (i- j = d) and D i,j =e. Based upon this definition, e is the minimum edit distance between any substring of T ending at T L d,e +d and P L d,e +1 T L d,e +d+1 Let d =3. L 3,0 = 0, L 3,1 =3, L 3,2 =4 i 1 2 3 4 5 6 7 21223334c 21101112t 1 1 0 t 1 1 0 catggg 212223t 110001g 000000 j1234j1234

16 16 Example: –T = gggtcta –P = gttc –k = 2 Now, L 3,1 = 3. It means that we have found a substring A, which is T(3,6)=gtct, ending at T L d,e +d = T 3+3 =T 6, such that the edit distance between A and P(1,3) = gtt is 1. P L d,e +1 T L d,e +d+1 P 3+1 T 3+3+1 gggtcta 00000000 g10001111 t21110112 t32221112 c43332122 i 1 2 3 4 5 6 7 j1234j1234

17 17 Example: –T = gggtcta –P = gttc –k = 2 Now, L 1,1 = 4 = m. It means that we have found substring A, which is T(2,5)=ggtc, ending at T L d,e +d = T 3+3 =T 6, such that the edit distance between A and P(1,3) = gtt is 1. They are T(2,5) = ggtc and P = gttc. 22123334c 21112223t 21101112t 11110001g 00000000 atctggg j1234j1234 i 1 2 3 4 5 6 7

18 18 The alternative dynamic algorithm computation is to compute the L d,e s value.

19 19 gggtcta 00000000 g0 t0 t0 c0 An alternative Dynamic Programming Computation First, we set the initial value. Example: –T = gggtcta –P= gttc

20 20 gggtcta 00000000 g000 t0 t0 c0 i 1 2 3 4 5 6 7 j1234j1234 e =0 From d = 0 to d = n, if P [1…j] is equal T [d+1…i], then we set the value of L d,0 = j. d = 0 P 1 = T 1, L 0,0 =1 d=0

21 21 gggtcta 00000000 g000 t0 t0 c0 i 1 2 3 4 5 6 7 j1234j1234 e =0 d = 1 P 1 = T 2, L 1,0 =1 d=1

22 22 gggtcta 00000000 g0000 t00 t0 c0 i 1 2 3 4 5 6 7 j1234j1234 e =0 d =2 P 1 =T 3, P 2 = T 4, L 2,0 = 2 d=2

23 23 Our approach is based upon Rule 1 proposed by Professor Lee. Consider tow substring A 1 and A 2 as shown below: A1A1 P1P1 S1S1 A2A2 P2P2 S2S2 If d(A 1, A 2 ) k and S 1 =S 2, then d(P 1, P 2 ) k.

24 24 Observe the following: If d(A 1,A 2 ) = k, S 1 = S 2, x y, then d(A 1 +S 1 +x, A 2 +S 2 +y) k+1

25 25 For e0, we search through d = -e to d =n. Let row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (subsitutaion) (deletion) (insertion) Find the largest j, if it exists, such that P(row+1, j) = T(row+1+d, i) =T(row +1+i-j, i), set L d,e =j. If no such j exists, set L d,e = row.

26 26 Let row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)]. (subsitutaion) (deletion) (insertion) L d,e-1 L d-1,e-1 L d+1,e-1 Diagonal d Diagonal d+1 Diagonal d-1 substitution deletion insertion

27 27 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 2, 1+1] = max[2, 2, 2] = 2 P(row+1, j) T(row+1+d, i), P 3 T 2 L -1,1 = 2 d = -1 i 1 2 3 4 5 6 7 j1234j1234 0c 0t 00t 0000g 00000000 atctggg

28 28 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 1, 1+1] = max[2, 1, 2] = 2 P(row+1, j) T(row+1+d, i), P 3 T 3 L 0,1 = 2 i 1 2 3 4 5 6 7 d =0 j1234j1234 0c 0t 010t 0000g 00000000 atctggg

29 29 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[1+1, 1, 2+1]= max[2, 1, 3] = 3 P(row+1, j) = T(row+1+d, i) = P 4 = T 5 = c L 1,1 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 1 that ends at T d+m = T 1+4 = T 5 j1234j1234 d =1 i 1 2 3 4 5 6 7 0c 0t 0110t 0000g 00000000 atctggg

30 30 10c 110t 0110t 0000g 00000000 atctggg i 1 2 3 4 5 6 7 j1234j1234 d =3 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[0+1, 2, 0+1] = max[1, 2, 1] = 2 P(row+1, j) = T(row+1+d, i), P 3 = T 6, P 4 T 7 L 3,1 = 3

31 31 row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] = max[3+1, 3, 2+1] = max[4, 3, 3] = 4 L 3,2 = 4 = m We find an occurrence of the pattern in the text with edit distance at most 2 that ends at t d+m = t 3+4 = t 7. 22120c 1112220t 1101110t 1110000g 00000000 atctggg j 1 2 3 4 5 6 7 i1234i1234 d =3

32 32 An alternative Dynamic Programming Computation Initialization for all d, 0 d n, L d,-1 = -1 for all d, -(k+1) d -1, L d,|d|-1 = |d|, L d,|d|-2 = |d|-2 for all e, -1 e k, L n+1,e = -1 For e = 0 to k do For d = -e to n do row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] row = min(row,m) while row < m and row +d { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/2/699489/slides/slide_32.jpg", "name": "32 An alternative Dynamic Programming Computation Initialization for all d, 0 d n, L d,-1 = -1 for all d, -(k+1) d -1, L d,|d|-1 = |d|, L d,|d|-2 = |d|-2 for all e, -1 e k, L n+1,e = -1 For e = 0 to k do For d = -e to n do row = max[(L d,e-1 +1),(L d-1,e-1 ),(L d+1,e-1 +1)] row = min(row,m) while row < m and row +d

33 33 Different with this algorithm In the alternative dynamic algorithm computation, we must search j such that P(row+1,j) = T (row +1+d, i) = T (row +1+i-j, i). Essentially, we are looking for S 1 and S 2 in T and P respectively, as show below: This paper will use LCA (lowest common ancestor) to improve this searching part.

34 34 This algorithm has two steps: –Concatenate the text and the pattern to one string t 1,…,t n,p 1,…p m. Compute the suffix tree of this string. –Find all occurrence of the pattern in the text with edit distance at most k. Algorithm

35 35 T = ABCDEA P = DDBE S = ABCDEADDBE Suffix tree of a string with length n can be constructed in O(n). Weiner, 1973 McCreight, 1976 Ukkonen, 1995

36 36 The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time. Harel and Tarjan, 1984

37 37 To find such S, if it exists, we may concatenate T and P to find a new string. Obviously, on the suffix tree, suffixes S 1 and S 2 have a common ancestor S. T P S1S1 S2S2

38 38 If we want to compute L 3,1, we will use L 2,0, L 3,0, L 4,0 to decide the row value (row =2). 1 0 a 0a 0a 1110t 101110t 10000g 00000000 ctctggg i 1 2 3 4 5 6 7 8 j12345j12345 d=3 In this paper, we find the length of LCA 2,3 is 2. q = 2 L 3,1 = row +2 =4 S1S1 S2S2

39 39 S= gggtctacgttac text pattern

40 40 Time Complexity An alternative Dynamic Programming Computation takes O(mn) time. The suffix tree has O(n) nodes. LCA query responds in O(1) time. For each of the n+k+1 diagonals, we evaluate (k+1)L d,e s This algorithm takes O(nk) time.

41 41 [AHU-74] A. V. AHO, J. W. HOPCROFT, AND J. D. ULLMAN, The Designand Analysis of Computer Algorithms, Addison- Wesley, Reading, MA, 1974 [AILSV-88] A. APOSTOLICO, C. ILIOPOULOS, G.M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree with applications, Algorithmica 3(1988), 347- 365. [BM-77] R.S. BOYER AND J. S. MOORE, Afast string searching algorithm, Comm. ACM 20(1977), 762-772 [CS-85] M. T. CHEN AND J. SEIFERAS, Efficient and elegant subword tree construction, in Combinatiorial Algorithms on Words, (A. Apostolico and Z. Galil, ED.), NATO ASI Series F: Computer and System Sciences Vol. 12, pp. 97-107, Springer-Verlag, New York/ Berlin, 1985. [G-84] Z. GALIL, Optimal parallel algorithms for string matching, in Proceedings, 16th ACM Symposium on Theory of Computing, 1984 pp..240-248; Inform. And CONTROL 67(1985), 144-157. [GG-86] Z. GALIL AND R. GIANCARLO, Improved string matching with k mismatches, SIGACT News 17, No. 4(1986), 52-54. [GG-87] Z. GALIL AND R. GIANCARLO, Parallel string matching with k mismatches, Theoret. Comput. Sci. 51(1987), 341-348. [GS-83] Z. GALIL AND J. I. SEFIERAS, Time-space-optimal string matching, J. Comput. System Sci. 26(1983),280-294 [HT-84] D. HAREL AND R. E. TARJAN, Fast algorithms for finding nearest common ancestors, SIAM J. Comput. 13, No. 2(1984), 338-355. [KMP-77] D.E. KNUTH, J. H. MORRIS, AND V. R. PRATT, Fast pattern matching in strings, SIAM J. COMPUT. 6(1977), 323-350. [KR-87] R. KARP AND M. O. RABIN, Efficient randomized pattern-matching algortihms, IBM J. Res. Develop. 31, No.2(1987), 249-260 Reference

42 42 [LSV-87] G. M. LANDAU, B. SCHIEBER, AND U. VISHKIN, Parallel construction of a suffix tree, in Proceedings 14th ICALP, Lecture Notes in Computer Science Vol. 267, pp. 314-325, Springer-Verlag, New York/Berlin,1987. [LV-86a] G. M. Landau and U. Vishkin, Introducing efficient parallelism into approximate string matching, in Proc. 18 th ACM Symposium on Theory of Computing, 1986, pp. 220-230. [LV-86b] G. M. Landau and U. Vishkin, Efficient string with k mismatches, Theoret. Comput. Sci.,43(1986), 239-249. [LV-88] G. M. LANDAU AND VISHKIN, Fast string matching with k differences, J. Comput. System Sci. 37(No. 1), 1988,63-78 [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. [SK-83] D. SANKOFF AND J. B. KURSKAL (Eds.),Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983. [SV-88] B. SCHIEBER AND U. VISHIN, Parallel computation of lowest common ancestor in trees, SIAM J. Comput., in press. [U-83]E. UKKONEN, On approximate string matching, in press. In Proceedings Int. Conf. Found. Comput. Theory, Lecture Notes in Computer Science Vol. 158, pp. 487-495, Springer-Verlag, Berlin/New York, 1983. [U-85] E. UKKONEN, Finding approximate pattern in strings, J. Algorithms 6(1985),132-137. [V-83] U. VISHKIN, Synchronous parallel computation-A survey, TR-71, Department of Computer Science, Courant Institute, NYU, 1983. [V-85] U. VISHKIN, Optimal parallel pattern matching in strings, in Proceedings 12th ICALP, Lecture Notes in Computer Science Vol. 194, pp. 497-508, Springer- Verlag, New York/Berlin, Inform. and Control 67(1985, 91-113.)

43 43 Thank you


Download ppt "1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp.157-169. G. Landau and U. Vishkin Advisor: Prof. R. C."

Similar presentations


Ads by Google