Download presentation

Presentation is loading. Please wait.

Published byNathan Gallegos Modified over 2 years ago

1
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. Speaker: C. C. Lin Adviser: R. C. T. Lee

2
2 In the following, we will present a problem related to the notion of edit distance. Next, let us introduce edit distance.

3
3 In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position, with its cost being 1. Substitution: symbols at corresponding positions are distinct, with its cost being 1. Deletion: a symbol of X is missing in Y at a corresponding position, with its cost being 1. X: G C A Y: G A X : A C C Y : T C C X : A T Y : A G T

4
4 Given two strings X and Y, the edit distance between X and Y is the minimum number of insertions, deletions and substitutions needed to transform Y to X.

5
5 String X ATGAATCTTACCGCCTCG String Y ATGAGGCTCTGGCCCCTG Transformation (from string Y to string X) String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).

6
6 Next, we will introduce a dynamic programming method to compute the edit distance between two strings X and Y.

7
7 Dynamic Programming for Edit Distance: (Delete) (Insert) (Substitute)

8
8 abcabba c b a b a c Given X=abcabba Y=cbabac

9
9 abcabba c b a b a c Given X=abcabba Y=cbabac

10
10 abcabba c b a b a c Given X=abcabba Y=cbabac

11
11 abcabba c b a b a c Given X=abcabba Y=cbabac

12
12 abcabba c b a b a c Given X=abcabba Y=cbabac

13
13 abcabba c b a b a c Given X=abcabba Y=cbabac

14
14 abcabba c b a b a c EDIT(X, Y)=4 a c Given X=abcabba Y=cbabac Substitute

15
15 abcabba c b a b a c EDIT(X, Y)=4 ba ac Given X=abcabba Y=cbabac Substitute

16
16 abcabba c b a b a c EDIT(X, Y)=4 bba bac Given X=abcabba Y=cbabac Match

17
17 abcabba c b a b a c EDIT(X, Y)=4 abba abac Given X=abcabba Y=cbabac Match

18
18 abcabba c b a b a c EDIT(X, Y)=4 cabba –abac Given X=abcabba Y=cbabac Insert

19
EDIT(X, Y)=4 bcabba b–abac Given X=abcabba Y=cbabac c a b a b c abbacba Match 0

20
EDIT(X, Y)=4 abcabba cb–abac Given X=abcabba Y=cbabac abcabba c b a b a c Substitute

21
21 abcabba c b a b a c Given X=abcabba Y=cbabac EDIT(X, Y)=4 abcabba- cb–ab-ac Substitute Match Insert Match Insert Match Delete

22
22 abcabba c b a b a c Given X=abcabba Y=cbabac EDIT(X, Y)=4 abcabba- cb–a-bac

23
23 We can recognize the time complexity of computing edit distance by the above algorithm to be O(mn) and space complexity O(mn) where n and m are the size of text and pattern, respectively.

24
24 In the following, we will introduce the topic, called the string matching with errors problem.

25
25 The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal. Given: T=abcabba P=cbabac Find: S=cabba EDIT(S, P)=3 P= cbabac S= c–abba Given: T=abcabba P=cbabac Ts substring K=bcabb EDIT(K, P)=4 P= –cbabac K= bc–ab–b

26
26 Dynamic Programming for the String Matching with Error Problem:

27
27 The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem. The dynamic programming approach for the edit distance problem:

28
28 In the edit distance problem, we have EDIT[0, j]=j. In the string matching with error problem, we set SE[0, j]=0.

29
29 abcabba c b a b a c T=abcabba P=cbabac Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.

30
30 We find the lowest value of the last row and trace back from the point. Our output may be several strings.

31
31 abcabba c b a b a c S=cabba T=abcabba P=cbabac T: abc–abba P: cbabac

32
T=abcabba P=cbabac EDIT(S, P)=3 edit distance cabba c b a b a c S: c–abba P: cbabac

33
33 abcabba c b a b a c T=abcabba P=cbabac S: cabba– P: cbabac EDIT(S, P)=3

34
34 abcabba c b a b a c T=abcabba P=cbabac S: c-ab-- P: cbabac EDIT(S, P)=3

35
35 abcabba c b a b a c T=abcabba P=cbabac S: --ab-c P: cbabac EDIT(S, P)=3

36
36 References For Edit Distance Computation: [NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): For String matching with error: [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.

37
37 Thank you

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google