Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

Similar presentations


Presentation on theme: "String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)"— Presentation transcript:

1 String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

2 String Matching with Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

3 Approximate String Matching problem: Find all text locations where distance from pattern is sufficiently small. distance metric:HAMMING DISTANCE Let S = s 1 s 2 … s m R = r 1 r 2 … r m Ham(S,R) = The number of locations j where s j r j Example: S = ABCABC R = ABBAAC Ham(S,R) = 2

4 Example: P = A B B A A C T = A B C A A B C A C … Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Problem 1: Counting mismatches

5 Example: P = A B B A A C T = A B C A A B C A C … 2 Ham(P,T 1 ) = 2 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

6 Example: P = A B B A A C T = A B C A A B C A C … 2, 4 Ham(P,T 2 ) = 4 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

7 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6 Ham(P,T 3 ) = 6 Counting mismatches

8 Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2 Ham(P,T 4 ) = 2 Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

9 Example: P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Input: T = t 1... t n P = p 1 … p m Output: For each i in T Ham(P, t i t i+1 … t i+m-1 ) Counting mismatches

10 Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … Problem 2: k-mismatches Input: T = t 1... t n P = p 1 … p m Output: Every i where Ham(P, t i t i+1 … t i+m-1 ) ≤ k

11 Example: k = 2 P = A B B A A C T = A B C A A B C A C … 2, 4, 6, 2, … 1, 0, 0, 1, Problem 2: k-mismatches Input: T = t 1... t n P = p 1 … p m Output: Every i where Ham(P, t i t i+1 … t i+m-1 ) ≤ kh

12 Naïve Algorithm (for counting mismatches or k-mismatches problem) Running Time: O(nm) n = |T|, m = |P| - Goto each location of text and compute hamming distance of P and T i

13 The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

14 The Kangaroo Method (for k-mismatches) -Create suffix tree (+ lca) for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

15 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

16 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

17 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

18 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

19 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

20 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

21 The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T - Do up to k LCP queries for every text location Example: P = A B A B A A B A C A B T = A B B A C A B A B A B C A B B C A B C A … i

22 The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

23 How do we do counting in less than O(nm) ?

24 Lets start with binary strings 0 1 0 1 1 0 1 1 P = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1... T = If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n We can do this using FFT in O(nlog(n)) time

25 a b a c c a c b a c a b a c c P = a b b b c c c a a a a b a c b... T = And if the strings are not binary ?

26 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask a b b b c c c a a a a b a c b... T =

27 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T =

28 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T =

29 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask

30 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = a b b b c c c a a a a b a c b... not-a mask 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a

31 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a

32 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Multiply P a and T (not a) to count mismatches using “a” PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a...

33 a b a c c a c b a c a b a c c P = PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 a b b b c c c a a a a b a c b... T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a Multiply P a and T not a to count mismatches using a PaPa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1T not a...

34 Boolean Convolutions (FFT) Method

35 Running Time: One boolean convolution - O(n log m) time # of matches of all symbols - O(n| | log m) time Boolean Convolutions (FFT) Method

36 How do we do counting in less than O(nm) ? Lets count matches rather than mismatches

37 a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters.

38 a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters.

39 a b c d e b b d b g d e f h d c c a b g h h... P = T = counter increment This is fast if all characters are “rare”

40 Partition the characters into rare and frequent Rare: occurs ≤ c times in the pattern For rare characters run this scheme with the counters Takes O(nc) time

41 Frequest chars You have at most m/c of them Do a convolution for each Total cost O(m/c n log(n)).

42 Fix c Complexity:

43 Frequent Symbol: a symbol that appears at least times in P. Back to the k-mismatch problem Want to beat the O(nk) kangaroo bound

44 Few (≤√k) frequent symbols Do the counters scheme for non-frequent Convolve for each frequent O(n log n) O(n )

45 (≥√k) frequent symbols Intuition: There cannot be too many places where we match

46 (≥√k) frequent symbols - Consider frequent symbols. - For each of them consider the first appearances. Do the counters scheme just for these symbols and occurrences

47 k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = use a-mask

48 Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 counter

49 Example of Masked Counting k = 4, = 4 a b a c c a c b a c a b a c c P = a b a c c a c b a c a b a c c a-mask c-mask a b b b c c c a a a a b a c b... T = a b a c c a c b a c a b a c c d 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0... 1 counter

50 Counting Stage: Run through text and count occurrences of all marks. Time: O(n ). For location i of T, if counter i < k then no match at location i. Why? The total # of elements in all masks is 2 = 2k. Important Observations: 1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

51 How many locations remain? Sum of all counters: 2n Value of potential matches > k The Kangaroo Method. How do we check these locations? Use Kangaroo Method Time: O(k) per location Overall Time: O( ) = O( ) # of potential matches:

52 Additional Points Can reduce to O( n )


Download ppt "String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)"

Similar presentations


Ads by Google