# 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:

## Presentation on theme: "1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter:"— Presentation transcript:

1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

2 Given a text T(1,n), a pattern P(1,m) and an error found k. Our approximate string matching problem is defined as follow: Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P) k where d(x,y) is the edit distance between x and y.

3 Example: T=deaabeg, P=aabac and k=2. For i=5. T(1, 5)= deaab. We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.

4 Example: T=deaabeg, P=aab and k=2. Consider i=5. T(1,5)=deaab. We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0. Consider i=6. T(1,6)=deaabe. We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.

5 T P S2S2 Let S be a substring of T. If there exists a suffix S 2 of S and a suffix P 2 of P such that d(S 2, P 2 ) = 0, and d(S 1, P 1 ) k, we have d(S, P) k. S1S1 S P1P1 P2P2 Our approach is based upon the following observation:

6 Example: A=addcd and B=abcd. k=2. We may decompose A and B as follows: A=add+cd. B=ab+cd. d(add,ab)=2. Thus d(A,B)=2.

7 A Recursive Operation for the Dynamic Programming Approach Consider T(1,i) and P(1, j). Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B ) k. i j T : P : A B i-1 j-11

8 Case 2: T(i)P(j). We consider three cases: 2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B) k-1. This corresponds to an insertion as illustrated below: i j T : P : A B i-1 1 i j T : P : A B i-1 1 insertion

9 Case 2: T(i)P(j). We consider three cases: 2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B) k-1. This corresponds to an insertion as illustrated below: i j T : P : A B 1j-1

10 Case 2: T(i)P(j). We consider three cases: 2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B) k-1. This corresponds to an insertion as illustrated below: i j T : P : A B i-1 1j-1

11 To solve our approximate string matching problem, we start with a table, called R k [n, m]. Let S=T(1, i). R k (i,j) Where 1 i n and 1 j m. 1100011000 a a b a a c a a b a c a b 1110011100 1111011110 1111111111 1111111111 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 1110111101 1101011010 1110011100 1111011110 1111111111 1101111011 1100111001 1110011100 Example: T:aabaacaabacab, P:aabac and k=1. Consider i=9, j=4. S=T(1, 9)=aabaacaab P(1, 4)=aaba A=aab d(A,P(1, 4))=d(aab,aaba)=1 R 1 (9, 4)=1 R1R1 =1 if there exists a suffix A of S such that d(A, P 1,j ) k. =0 otherwise.

12 1100011000 a a b a a c a a b a c a b 1110011100 1111011110 1111111111 1111111111 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 1110111101 1101011010 1110011100 1111011110 1111111111 1101111011 1100111001 1110011100 Example: T:aabaacaabacab, P:aabac and k=1. Consider i=13 and j=5. S=T(1, 13)=aabaacaabacab P(1, 5)=aabac There doesnt exist any suffix A of S such that d(A,P(1, 5)) 1. R 1 (13,5)=0 R1R1

13 Question: How can we find R k (i, j)? Answer: Dynamic Programming. There are three types of operation in edit distance: (1) Insertion (2) Deletion (3) Substitution We consider them separately and combine the results later.

14 Let R I k (i,j), R D k (i,j) and R S k (i,j) denote the R k (i,j) related to insertion, deletion and substitution respectively. And let R I k [i,j], R D k [i,j] and R S k [i,j] denote the R k [i,j] related to insertion, deletion and substitution of table respectively.

15 Consider R I k (i,j) first. R I k (i,j) =1 if t ip j and R k-1 (i-1,j)=1 or t i p j and R k (i-1,j-1)=1, =0 otherwise. T:P:T:P: aabac b b insertion i j i-1

16 R I k (i,j) =1 if t ip j and R k-1 (i-1,j)=1 or t i p j and R k (i-1,j-1)=1 =0 otherwise Example: Text = aabaacaabacab. Pattern = aabac. k=1. 1000010000 a a b a a c a a b a c a b 1100011000 0010000100 1001010010 1100011000 0000000000 1000010000 1100011000 0010000100 1001010010 0000100001 1000010000 0000000000 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 R 0 [13,5] R I 1 [13,5] (1) When i=13 and j=3. t 13 =p 3 =b, R 1 (12,2)=1 R I 1 (13,3)=1 (2) When i=6 and j=4. t 6 =cp 4 =a, R 0 (5,4)=0 R I 1 (6,4)=0 (3) When i=11 and j=4. t 11 =cp 4 =a, R 0 (10,4)=1 R I 1 (11,4)=1 1000010000 a a b a a c a a b a c a b 1100011000 1110011100 1111011110 1101011010 1100111001 1100011000 1100011000 1110011100 1111011110 1001110011 1100111001 1010010100 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345

17 Consider R D k (i,j). R D k (i,j) =1 if t ip j and R k-1 (i,j-1)=1 or t i p j and R k (i-1,j-1)=1, =0 otherwise. T:P:T:P: aabac b deletion i jj-1

18 R D k (i,j) =1 if t ip j and R k-1 (i,j-1)=1 or t i p j and R k (i-1,j-1)=1 =0 otherwise Example: Text = aabaacaabacab. Pattern = aabac. k=1. 1000010000 a a b a a c a a b a c a b 1100011000 0010000100 1001010010 1100011000 0000000000 1000010000 1100011000 0010000100 1001010010 0000100001 1000010000 0000000000 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 R 0 [13,5] (1) When i=13 and j=3. t 13 =p 3 =b, R 1 (12,2)=1 R D 1 (13,3)=1 (2) When i=6 and j=4. t 6 =cp 4 =a, R 0 (6,3)=0 R D 1 (6,4)=0 (3) When i=3 and j=4. t 3 =bp 4 =a, R 0 (3,3)=1 R D 1 (3,4)=1 R D 1 [13,5] 1100011000 a a b a a c a a b a c a b 1110011100 1011010110 1101111011 1110011100 1000010000 1100011000 1110011100 1011010110 1101111011 1000110001 1100011000 1010010100 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345

19 Consider R S k (i,j). R S k (i,j) =1 if t ip j and R k-1 (i-1,j-1)=1 or t i p j and R k (i-1,j-1)=1 =0 otherwise T:P:T:P: aabac a b T:P:T:P: b substitution b i j i-1 j-1 i j i-1 j-1

20 R S k (i,j) =1 if t ip j and R k-1 (i-1,j-1)=1 or t i p j and R k (i-1,j-1)=1 =0 otherwise Example: Text = aabaacaabacab. Pattern = aabac. k=1. 1000010000 a a b a a c a a b a c a b 1100011000 0010000100 1001010010 1100011000 0000000000 1000010000 1100011000 0010000100 1001010010 0000100001 1000010000 0000000000 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 R 0 [13,5] (1) When i=13 and j=3. t 13 =p 3 =b, R 1 (12,2)=1 R D 1 (13,3)=1 (2) When i=6 and j=4. t 6 =cp 4 =a, R 0 (5,3)=0 R D 1 (6,4)=0 (3) When i=5 and j=5. t 3 =bp 4 =a, R 0 (4,4)=1 R D 1 (5,5)=1 1000010000 a a b a a c a a b a c a b 1100011000 1110011100 1101011010 1100111001 1110011100 1101011010 1100011000 1110011100 1101011010 1100111001 1100011000 1110011100 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 R S 1 [13,5]

21 After every R I k (i,j), R D k (i,j) and R S k (i,j) have found, we immediately determine R k (i,j) by R k (i,j)= R I k (i,j) or R D k (i,j) or R S k (i,j). 1100011000 a a b a a c a a b a c a b 1110011100 1111011110 1111111111 1111111111 1 2 3 4 5 6 7 8 9 10111213 aabacaabac 1234512345 1110111101 1101011010 1110011100 1111011110 1111111111 1101111011 1100111001 1110011100 Example: Text = aabaacaabacab. Pattern = aabac. k=1. R 1 [13,5]

22 Thank you!

Similar presentations