Download presentation

Presentation is loading. Please wait.

Published byAlan Shropshire Modified about 1 year ago

1
D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014

2
CPM M OSCOW 2 CPM 2014

3
!MIND THE GAP 3 CPM 2014

4
O UTLINE The DMG( Dictionary Matching with one Gap ) Problem Motivation Previous Work Bidirectional Suffix Trees Solution Lookup Table addition Open Problems 4 CPM 2014

5
T HE DMG P ROBLEM 5 A gapped pattern is a pattern P of the form: P 1 { 1, 1 } P 2 { 2, 2 }… P k-1 { k-1, k-1 }P k Each P j is over alphabet , { j, j } is a sequence of at least j and at most j don’t cares Example: aba{3,6}cbb aba CPM 2014

6
T HE DMG P ROBLEM The DMG problem is: Preprocess: A dictionary D of d gapped patterns P 1,…, P d over alphabet . Query: A text T of length n over alphabet . Output: all locations in T where a dictionary gapped pattern ends. We focus on DMG with a single gap. 6 CPM 2014

7
7 E XAMPLE Dictionary: P 1 = aba {3,6} cbb P 2 = ab {3,6} bbac P 3 = aa {3,6} ac Query text: a b a a b a c b b a c P 1,1 P 1,2 P 2,1 P 2,2 P 3,1 P 3,2 CPM 2014 First = 1≤i≤d { P i,1 } Second = 1≤i≤d { P i,2 }

8
M OTIVATION Computational Biology A renew interest due to cyber security. Network intrusion detection systems perform protocol analysis, content searching and content matching to detect harmful software. Malware may appear in several packets! 8 CPM 2014

9
P REVIOUS W ORK Gapped pattern matching problem was studied for a few decades, eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica 2004],[Bille&Thorup, ICALP 2009],[Bille&Thorup SODA 2010], [Morgante et al., JCB 2005], [Rahman et al., COCOON 2006], [Bille et al., TCS 2012] DMG problem not studied enough ! [Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no bounds on the length of the gap. 9 CPM 2014

10
B I - DIRECTIONAL SUFFIX TREES ALGORITHM 10 Gapped pattern: a b{3,6}b b a c Query: a b a a b a c b b a c CPM 2014

11
B I - DIRECTIONAL SUFFIX TREES ALGORITHM Idea: view as [Amir et al., JAL 2000] 11 Gapped patterns: P 1 = a b a{3,6}a b a c P 2 = a b a{3,6}b b a P 3 = a b{3,6}b a a Query: a b a a b a c b b a c Use suffix tree T S of Second Use suffix tree T F R of First R gap CPM 2014

12
B I - DIRECTIONAL SUFFIX TREES ALGORITHM For each text location l Insert t l t l +1 …t n to T S (the node h) to find labels on the path to h. For f= l - -1 to l - -1 Insert t f t f-1 …t 1 to T F R (the node g) to find labels on the path to g. Output intersection (for end locations). 12 Finds P i,2 starting at location l. Finds P i,1 ending at location f. CPM 2014

13
13 B I - DIRECTIONAL SUFFIX TREES ALGORITHM - I NTERSECTION Patterns: {(1,4),(2,9),(3,7),…,(6,5),…} TSTS TFRTFR Range: [1,9] Range: [2,7] CPM g h

14
14 B I - DIRECTIONAL SUFFIX TREES ALGORITHM ( CONTINUED ) Intersection via range queries: Range: [2,7] Range: [1,9] (1,4) (3,7) (6,5) (8,8) (2,9) CPM 2014

15
T IME & S PACE Preprocessing Time: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Preprocessing grid for range queries: O(d log d). [Chan et al., SoCG 2011] Preprocessing Space: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Space for grid: O(d log d). [Chan et al., SoCG 2011] 15 CPM 2014

16
T IME & S PACE Query Time: For each end text location, we try every gap size: a factor of . The number of range queries is the number of vertical paths in a given path: O(log 2 min{d, log |D|}). A range query costs: O(log log d+occ). [Chan et al., SoCG 2011] Total: O(n( )log log d log 2 min{d, log |D|}+occ). 16 CPM g

17
17 L OOKUP T ABLE ALGORITHM Idea: Instead of using range queries in a grid to compute the intersection, we use a pre-computed lookup table. Enables intersection in O(occ) time. Total query time becomes: O(n( )+occ). CPM 2014

18
18 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} g h

19
19 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9, 6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} g h

20
20 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …, P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} g h

21
21 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} Inter[ 9, 7 ]= {3,4,6} g h

22
22 L OOKUP T ABLE ALG. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5),P 7 =(9,6) Inter[3,5]= {4} Inter[3,7]= {3,4} Inter[6,7]= {3,4,7} 1 3 : … :

23
23 L OOKUP T ABLE ALGORITHM Preprocessing: Time: Table can be computed using DP in time O(d 2 ovr + |D|) where ovr is the number of subpatterns including other subpattern as a prefix or suffix. Space: O(d 2 + |D|). Query time: O(n( )+occ). CPM 2014

24
O UR R ESULTS Preprocessing time: O(d log d + |D|). Space: O(d log d + |D|). Query time: O(n( )log log d log 2 (min{d, log |D|} )+occ). Preprocessing time: O(d 2 ovr + |D|). Space: O(d 2 + |D|). Query time: O(n( )+occ). 24 Bi-directional suffix trees & range queries Bi-directional suffix trees & Lookup table CPM 2014

25
O PEN P ROBLEMS Generalizing to k gaps Reducing the dependency on the size Scalability to different gap bounds in the dictionary Online algorithm 25 CPM 2014

26
T HANK Y OU ! 26 CPM 2014

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google