Download presentation

Presentation is loading. Please wait.

Published byNathan Gallegos Modified over 3 years ago

1
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. Speaker: C. C. Lin Adviser: R. C. T. Lee

2
2 In the following, we will present a problem related to the notion of edit distance. Next, let us introduce edit distance.

3
3 In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position, with its cost being 1. Substitution: symbols at corresponding positions are distinct, with its cost being 1. Deletion: a symbol of X is missing in Y at a corresponding position, with its cost being 1. X: G C A Y: G A X : A C C Y : T C C X : A T Y : A G T

4
4 Given two strings X and Y, the edit distance between X and Y is the minimum number of insertions, deletions and substitutions needed to transform Y to X.

5
5 String X ATGAATCTTACCGCCTCG String Y ATGAGGCTCTGGCCCCTG Transformation (from string Y to string X) String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).

6
6 Next, we will introduce a dynamic programming method to compute the edit distance between two strings X and Y.

7
7 Dynamic Programming for Edit Distance: (Delete) (Insert) (Substitute)

8
8 abcabba c b a b a c 01234567 1 2 3 4 5 6 Given X=abcabba Y=cbabac

9
9 abcabba c b a b a c 01234567 11 2 3 4 5 6 Given X=abcabba Y=cbabac

10
10 abcabba c b a b a c 01234567 112 2 3 4 5 6 Given X=abcabba Y=cbabac

11
11 abcabba c b a b a c 01234567 1122 2 3 4 5 6 Given X=abcabba Y=cbabac

12
12 abcabba c b a b a c 01234567 11223 2 3 4 5 6 Given X=abcabba Y=cbabac

13
13 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444

14
14 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 a c Given X=abcabba Y=cbabac Substitute

15
15 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 ba ac Given X=abcabba Y=cbabac Substitute

16
16 abcabba c b a b a c EDIT(X, Y)=4 bba bac Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 Match

17
17 abcabba c b a b a c EDIT(X, Y)=4 abba abac Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 Match

18
18 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 cabba –abac Given X=abcabba Y=cbabac Insert

19
19 44443456 33333345 43233234 44322223 54332122 65432211 7654321 EDIT(X, Y)=4 bcabba b–abac Given X=abcabba Y=cbabac c a b a b c abbacba Match 0

20
20 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba cb–abac Given X=abcabba Y=cbabac abcabba c b a b a c Substitute

21
21 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba- cb–ab-ac Substitute Match Insert Match Insert Match Delete

22
22 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba- cb–a-bac

23
23 We can recognize the time complexity of computing edit distance by the above algorithm to be O(mn) and space complexity O(mn) where n and m are the size of text and pattern, respectively.

24
24 In the following, we will introduce the topic, called the string matching with errors problem.

25
25 The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal. Given: T=abcabba P=cbabac Find: S=cabba EDIT(S, P)=3 P= cbabac S= c–abba Given: T=abcabba P=cbabac Ts substring K=bcabb EDIT(K, P)=4 P= –cbabac K= bc–ab–b

26
26 Dynamic Programming for the String Matching with Error Problem:

27
27 The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem. The dynamic programming approach for the edit distance problem:

28
28 In the edit distance problem, we have EDIT[0, j]=j. In the string matching with error problem, we set SE[0, j]=0.

29
29 abcabba c b a b a c 33343456 22233345 22123234 12212223 21111122 11110111 00000000 T=abcabba P=cbabac Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.

30
30 We find the lowest value of the last row and trace back from the point. Our output may be several strings.

31
31 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 S=cabba T=abcabba P=cbabac T: abc–abba P: cbabac

32
32 012345 101234 211123 321222 432123 543222 654333 T=abcabba P=cbabac EDIT(S, P)=3 edit distance cabba c b a b a c S: c–abba P: cbabac

33
33 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: cabba– P: cbabac EDIT(S, P)=3

34
34 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: c-ab-- P: cbabac EDIT(S, P)=3

35
35 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: --ab-c P: cbabac EDIT(S, P)=3

36
36 References For Edit Distance Computation: [NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): 443-453. For String matching with error: [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.

37
37 Thank you

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google