1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

Presentation on theme: "1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,"— Presentation transcript:

1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. Speaker: C. C. Lin Adviser: R. C. T. Lee

2 In the following, we will present a problem related to the notion of edit distance. Next, let us introduce edit distance.

3 In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position, with its cost being 1. Substitution: symbols at corresponding positions are distinct, with its cost being 1. Deletion: a symbol of X is missing in Y at a corresponding position, with its cost being 1. X: G C A Y: G A X : A C C Y : T C C X : A T Y : A G T

4 Given two strings X and Y, the edit distance between X and Y is the minimum number of insertions, deletions and substitutions needed to transform Y to X.

5 String X ATGAATCTTACCGCCTCG String Y ATGAGGCTCTGGCCCCTG Transformation (from string Y to string X) String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).

6 Next, we will introduce a dynamic programming method to compute the edit distance between two strings X and Y.

7 Dynamic Programming for Edit Distance: (Delete) (Insert) (Substitute)

8 abcabba c b a b a c 01234567 1 2 3 4 5 6 Given X=abcabba Y=cbabac

9 abcabba c b a b a c 01234567 11 2 3 4 5 6 Given X=abcabba Y=cbabac

10 abcabba c b a b a c 01234567 112 2 3 4 5 6 Given X=abcabba Y=cbabac

11 abcabba c b a b a c 01234567 1122 2 3 4 5 6 Given X=abcabba Y=cbabac

12 abcabba c b a b a c 01234567 11223 2 3 4 5 6 Given X=abcabba Y=cbabac

13 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444

14 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 a c Given X=abcabba Y=cbabac Substitute

15 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 ba ac Given X=abcabba Y=cbabac Substitute

16 abcabba c b a b a c EDIT(X, Y)=4 bba bac Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 Match

17 abcabba c b a b a c EDIT(X, Y)=4 abba abac Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 Match

18 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 cabba –abac Given X=abcabba Y=cbabac Insert

19 44443456 33333345 43233234 44322223 54332122 65432211 7654321 EDIT(X, Y)=4 bcabba b–abac Given X=abcabba Y=cbabac c a b a b c abbacba Match 0

20 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba cb–abac Given X=abcabba Y=cbabac abcabba c b a b a c Substitute

21 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba- cb–ab-ac Substitute Match Insert Match Insert Match Delete

22 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba- cb–a-bac

23 We can recognize the time complexity of computing edit distance by the above algorithm to be O(mn) and space complexity O(mn) where n and m are the size of text and pattern, respectively.

24 In the following, we will introduce the topic, called the string matching with errors problem.

25 The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal. Given: T=abcabba P=cbabac Find: S=cabba EDIT(S, P)=3 P= cbabac S= c–abba Given: T=abcabba P=cbabac Ts substring K=bcabb EDIT(K, P)=4 P= –cbabac K= bc–ab–b

26 Dynamic Programming for the String Matching with Error Problem:

27 The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem. The dynamic programming approach for the edit distance problem:

28 In the edit distance problem, we have EDIT[0, j]=j. In the string matching with error problem, we set SE[0, j]=0.

29 abcabba c b a b a c 33343456 22233345 22123234 12212223 21111122 11110111 00000000 T=abcabba P=cbabac Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.

30 We find the lowest value of the last row and trace back from the point. Our output may be several strings.

31 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 S=cabba T=abcabba P=cbabac T: abc–abba P: cbabac

32 012345 101234 211123 321222 432123 543222 654333 T=abcabba P=cbabac EDIT(S, P)=3 edit distance cabba c b a b a c S: c–abba P: cbabac

33 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: cabba– P: cbabac EDIT(S, P)=3

34 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: c-ab-- P: cbabac EDIT(S, P)=3

35 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: --ab-c P: cbabac EDIT(S, P)=3

36 References For Edit Distance Computation: [NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): 443-453. For String matching with error: [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.

37 Thank you

Similar presentations