1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, Pages 1-47 Professor R.C.T Lee Speaker K.W.Liu

2 The Problem The approximate string matching problem: Given text T[1...n] and pattern P[1...m] over some finite alphabet of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).

3 For a window of size m-k, if there exists a substring s 1 in this window such that its edit distance with every substring of P is greater than k, we move P Our algorithm scans from the right as shown below: Fig. 3 S1S1 T:T: P:P: m - k

4 For a window of size m-k, if there exists a suffix S 1 such that its edit distance with every substring of P is greater than k, we move P to S 1 Our algorithm scans from the right as shown below: Fig. 3 S1S1 T:T: P:P: m - k

5 But, how do we know that ED(S 1,S 2 ) > k? We use a very useful lemma.

6 Lemma Consider string Q and P. Let Q be divided into q 1,q 2,…,q n as shown below: qnqn …q2q2 q1q1 For each q i, let p i be the substring in P such that ED(q i,p i ) is the smallest, among all substrings in P.

7 Proof: Divide P into n pieces as shown below qnqn …q2q2 q1q1 p'np'n …p'2p'2 p'1p'1 Q P

8 To determine whether ED(S 1,S 2 ) > k, we may Use the lemma. We divide the window into small pieces: t 1, t 2, …,t a. For each t i, we find the substring p i in P where ED(p i,t i ) is the smallest. T:T: P:P: Window W Fig. 7 …t2t2 t1t1 p1p1 p2p2

9

10 In general, to find such a p i, we may use Dynamic programming [Sellers 1980]. But, we may use a special kind of small pieces. It is customary to call a small piece with size L a L-gram. Let us use the 2-gram.

11 Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them. Thus, we may use 2-grams in our algorithm.

12 Our algorithm Make a table D to store the smallest edit distance between each possible 2-gram from finite alphabet set and all substrings of the pattern P. The above is done in the preprocessing stage.

13 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between gg and all substrings of P = 2 > k

14 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between tt and all substrings of P = 0 Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between at and all substrings of P = 0 Smallest edit distance between ga and all substrings of P = 1 == k m+2k ctagggaataatttacaatt m-k i i+1

15 ctagggaataatttacaatt m-k m+k i+1 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 To find the edit distance between gaataattta and P.

16 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k ctagggaataatttacaatt

17 In the preprocessing We make a D table to record the smallest edit distance between each possible l-gram from alphabet set whose length is l and all substrings of P.

18 D table : example ( step by step ) aaacagatcacccgctgagcgggttatctgtt DpDp 2222222222222222 For example P = aacaccgaa For P = a a c a c c g a a a For P = a a c a c c g a a a vs aa

19 For P = a a c a c c g a a a vs ac aaacagatcacccgctgagcgggttatctgtt DpDp 0022222222222222

20 For P = a a c a c c g a a a vs ag For P = a a c a c c g a a a with at For P = a a c a c c g a a a with caFor P = a a c a c c g a a a with cc aaacagatcacccgctgagcgggttatctgtt DpDp 0011002222222222

21 Time complexity The average complexity of the algorithm is for

22 The end

