Presentation is loading. Please wait.

Presentation is loading. Please wait.

Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.

Similar presentations


Presentation on theme: "Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University."— Presentation transcript:

1 Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

2 String Matching with k Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

3 Exact String Matching Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A …

4 Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 Exact String Matching

5 Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 7 Exact String Matching

6 Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … 3 7 11 Exact String Matching

7 Input: T = t 1... t n P = p 1 … p m Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A … Answer: {3,7,11,..} Exact String Matching

8 Problem: Matching not exact in applications of: Computational Biology Musicology Text Editing Meteorology etc. Need other definitions of string matching!

9 Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently small. distance metric:HAMMING DISTANCE Let S = s 1 s 2 … s m R = r 1 r 2 … r m Ham(S,R) = The number of locations j where s j r j Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

10 String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C …

11 String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2 Ham(P,T 1 ) = 2

12 String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4 Ham(P,T 2 ) = 4

13 String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6 Ham(P,T 3 ) = 6

14 String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2 Ham(P,T 4 ) = 2

15 String Matching with Mismatches Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

16 Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

17 Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, …

18 Input: T = t 1... t n, P = p 1 … p m String Matching with k Mismatches Output: Every i in T s.t. Ham(P, t i t i+1 … t i+m-1 ) k Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Y,N,N,Y, …

19 Naïve Algorithm (for counting mismatches or k-mismatches problem) Running Time: O(nm) n = |T|, m = |P| - Goto each location of text and compute hamming distance of P and T i

20 The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

21 Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

22 Trie (Cont) Assume no string is a prefix of another a b c e e f d b f e g Each string corresponds to a leaf.

23 Compressed Trie Compress unary nodes, label edges by strings a b c e e f d b f e g a bbf c eef d e g 

24 Suffix tree Suffix tree of string s: a compressed trie of all suffixes of s Prefix-free: add a special character, say $, at the end of s

25 Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

26 Suffix Tree properties - Succint in space - O(n). - Can be built in O(n) time. McCreight, Weiner, Ukkonen, Farach-Colton b 1 2 a b a b $ a b $ 3 $ 4 $ 5 $

27 Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Given a pattern P = ab we traverse the tree according to the pattern. s=abab$

28 Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Leaves correspond to locations of appearance! s=abab$ 1 3

29 Exact string matching 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches s=abab$ 1 3

30 Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

31 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes s = abbaab$ 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $

32 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$

33 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$ abbaab$

34 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 3 a b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ s = abbaab$ aab$ abbaab$

35 LCA/LCP properties a 1 3 b a a b a b $ b 5 $ 2 b 4 b $ a 6 $ 7 $ b $ a a a b $ Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

36 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

37 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

38 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

39 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

40 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

41 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

42 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

43 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

44 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

45 The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

46 a b a c c a c b a c a b a c c P = a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

47 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

48 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

49 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = Boolean Convolutions (FFT) Method

50 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask Boolean Convolutions (FFT) Method

51 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Boolean Convolutions (FFT) Method

52 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Boolean Convolutions (FFT) Method

53 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Multiply P a and T not a to count mismatches (use FFT) PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a... Boolean Convolutions (FFT) Method

54 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Multiply P a and T not a to count mismatches (use FFT) PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a... Boolean Convolutions (FFT) Method

55

56 Running Time: One boolean convolution - O(n log m) time # of matches of all symbols - O(n| | log m) time Boolean Convolutions (FFT) Method

57 Counting Method Input:Text: T = t 1 …t n Pattern: P = p 1 …p m Max # of allowed mismatches: k Assumption: Each pattern element is distinct a b c d e f g h b g d e f h d c c a b g h h... Count matches (instead of mismatches) P = T = counter increment

58 O(n log m) Algorithm Frequent Symbol: a symbol that appears at least times in P. Case 1 : At least frequent symbols. - Consider first frequent symbols. - For each of them construct a mask for first appearances. We distinguish between two cases: Case 2 : Less than frequent symbols. Case 1 : At least frequent symbols.

59 Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = use a-mask

60 Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 counter

61 Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0... 1 counter

62 Counting Stage: Run through text and count occurrences of all marks. Time: O(n ). For location i of T, if counter i < k then no match at location i. Why? The total # of elements in all masks is 2 = 2k. Important Observations: 1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

63 How many locations remain? Sum of all counters: 2n Value of potential matches > k The Kangaroo Method. How do we check these locations? Use Kangaroo Method Time: O(k) per location Overall Time: O( ) = O( ) # of potential matches:

64 Case 2:X frequent symbols, x < a) Count all matches of frequent symbols - one boolean convolution per symbol. b) For non-frequent symbols, build full masks. Time: O(x n log m) = O( n log m) Symbol non-frequent appears < 2 in P mask size < 2 Count time: O(n )

65 So, Case 2 is O(n log m) Overall Algo. Time: O(n log m) c) Add results of a) & b) and get total number of matches at every text location. Time: a) O(n log m) b) O(n ) c) O(n)

66 Additional Points 1. O(n log k) For there is a linear time algorithm - O( ) 2. O( n ) Better tradeoff: Define frequent symbol >

67 O( ) time algorithm Outline : 1. Find 2k special substrings of pattern. 2. Construct forest data structure combining info of special pattern substrings and text. 3. Use local counting arguments and quick queries to forest data structure to prune candidates. 4. Use kangaroo method to check leftover potential candidates.

68 k-Mismatches and Matrix Multiplication “Or-And” matrix multiplication: AxB = C, c ij = a ik b kj Pattern all-mismatch problem: Find all text locations where the pattern mismatches at every character. Indyk: If there is an algorithm faster than O(n ) for the Pattern all-mismatch problem then there is a new method for solving “Or-And” matrix multiplication faster than O(n 3 )

69 OPEN PROBLEMS Hamming Distance in time: O(n log m) Edit Distance? Other metrics?


Download ppt "Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University."

Similar presentations


Ads by Google