Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.

Similar presentations


Presentation on theme: "Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University."— Presentation transcript:

1 Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University

2 Classical Pattern Matching Input: - Pattern P = p 1 p 2 …p m - Text T = t 1 t 2 t 3... t n over alphabet Σ. m is the PATTERN size. n is the TEXT size. Output: locations of T where P appears.

3 Pattern Matching (eg.) Input: P=agca = {a,g,c,t} T=aaagcattagctagcagcat

4 Pattern Matching (eg.) Input: P=agca = {a,g,c,t} Output: 1 2 3 4 5 6 … 13... 16 3, 13, 16,… T=aaagcattagctagcagcat

5 “Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern.  C. Dynamic Text and Static Pattern.

6 “Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern. a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

7 “Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log 2 n) change - O(log 2 n)

8 “Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time  C. Dynamic Text and Static Pattern? Time: query - O(m + log 2 n) change - O(log 2 n)

9 Dynamic Text and Static Pattern Matching  Pattern is non-changing  Text changes over time  Goal: report new occurrences of the pattern without performing a new search.

10 Motivation a 14 a 4 b 2 c 3 d 5 c 8 a 6 FAX 1.Intrusion detection systems 2. Info alerts 3. Two-dimensional run-length compressed matching problem, [ALS03]

11 Problem Definition  Input: T and P over Σ ={1, …, m}.  Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T. Change Operation: change one character in the text, e.g. location 5 from a to b.

12 Example  Input: P=agagagc = (ag) 3 c = {a,g,c,t} T = g a g a g c t a g c g a g c a t

13 Example  Input: P=agagagc = (ag) 3 c = {a,g,c,t} T = g a g a g c t a g a g a g c a t 10

14 Example  Input: P=agagagc = (ag) 3 c = {a,g,c,t} T = g a g a g c t a g a g a g c a t 108  Output: {8}

15 Results O(log log m) time per replacement. After O(n log log m + ) preprocessing time,

16 “Dynamic” Pattern Matching  A. Static Text and Dynamic Pattern.  B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time  C. Dynamic Text and Static Pattern. Time: query - O(m + log 2 n) change - O(log 2 n) Time: change and announce O(log log m)

17 Static Stage  To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77].  All pattern occurrences in a text of length 2m can be stored in O(1) space.

18 Succinct Output Assumption: the text is of size 2m. (Break the text T into overlapping strings of length 2m-1. ) T 1 m 2m 3m 4m P

19 Succinct Output (cont.)  P is periodic: A string p is periodic if it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabca Store the output as a ‘chain’ of pattern occurrences.  P is non-periodic: By definition, no more than two occurrences.

20 On-line Algorithm Following each replacement:  Delete old matches that are no longer pattern occurrences.  Find new matches.

21 Delete Old Matches Deleting is trivial since we store the matches in constant space:  P is periodic: Truncate the chain of pattern occurrences.  P is non-periodic: Discard all matches that are within distance -m of the replacement.

22 Find New Matches  Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

23 Main Idea - Text Covers We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P. Pattern Text = g a g a g c t a g c g a g c a t = a g a g a g c g a g a g c [ 2,7] 1 2 3 4 5 6 7 a g c [5,7] g a g c a [4,7][1,1]Cover:

24 Text Cover (cont.) The text cover must satisfy two properties:  Substring Property: each element of the cover is a substring of P, or a character not included in P.  Maximality Property: no two adjacent elements can concatenate to form a substring of P.

25 Text Cover (cont.) How does a replacement in the text affect the text cover? Initially, in the static stage, we construct a text cover for T. We ensure that the cover satisfies both the substring and maximality property.

26 Text Cover following replacement Pattern = a g a g a g c Text = g a g a g c t a g c g a g c a t g a g a g c,a g c,g a g c, a Cover: (2,7) - (5,7) (4,7) (1,1) - 1 2 3 4 5 6 7 a (2,7) - (5, 6)(1,1) (4,7) (1,1) - (1,3) (1,7)

27 Updating the Text Cover At most 5 pieces can violate the maximality property.

28 Substring Concatenation Query  Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?  Query time: O(log log m).  Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.

29 Find New Matches  Given: a text cover which satisfies both the substring and maximality properties.  Find: all new locations of the pattern in the text.

30 Key Observations  A new match must begin within distance -m of the change.  A new match can include at most one entire piece of the cover.  It can span at most three pieces of the cover.

31 Furthermore A new match can begin in one of at most three pieces of the cover: –the piece with the change –the previous piece –the one previous to that P T

32 Simplified Problem  Search starts within piece of cover.  Simple O(m) time algorithm: –Check each location in X for a pattern start. –Use suffix trees and LCA queries to compare substrings in constant time. P T X

33 Improved Algorithm  Really, we only have to check each suffix of X that is a pattern prefix. e.g. X = a g a g a  The KMP automaton can give the necessary information. However, the time is still O(m) !

34 Improved Algorithm  We can group the prefixes of P by their periods.  Each group of prefixes can be checked in constant time!  There are at most O(log m) groups.

35 Groups (eg.) Pattern = a g a g a g c 1 2 3 4 5 6 7 X = a g a g a There are three suffixes of X that are also pattern prefixes: { agaga, aga } { a } Prefixes with the same period fall into a single group.

36 Checking a group in Constant Time Pattern = a g a g a g c 1 2 3 4 5 6 7 X = a g a g a a g a g a a g t... a g a g a g a g a g c Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’ g c...

37 Groups  A string cannot have more than O(log m) border groups.  Hence, the time of the algorithm is O(log m). [Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

38 Even Better...  We check only a constant number of groups.  Choosing these O(1) groups takes O(log log m) time.  Hence, our algorithm takes O(log log m) time per replacement.

39 Open Problems  Allowing insertions and deletions to the text.  Searching for a set of multiple static patterns.


Download ppt "Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University."

Similar presentations


Ads by Google