# D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

## Presentation on theme: "D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014."— Presentation transcript:

D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014

CPM 2014 - M OSCOW 2 CPM 2014

!MIND THE GAP 3 CPM 2014

O UTLINE The DMG( Dictionary Matching with one Gap ) Problem Motivation Previous Work Bidirectional Suffix Trees Solution Lookup Table addition Open Problems 4 CPM 2014

T HE DMG P ROBLEM 5 A gapped pattern is a pattern P of the form: P 1 {  1,  1 } P 2 {  2,  2 }… P k-1 {  k-1,  k-1 }P k Each P j is over alphabet , {  j,  j } is a sequence of at least  j and at most  j don’t cares = @. Example: aba{3,6}cbb aba @@@cbb aba@@@@cbb aba@@@@@cbb aba@@@@@@cbb CPM 2014

T HE DMG P ROBLEM The DMG problem is: Preprocess: A dictionary D of d gapped patterns P 1,…, P d over alphabet . Query: A text T of length n over alphabet . Output: all locations in T where a dictionary gapped pattern ends. We focus on DMG with a single gap. 6 CPM 2014

7 E XAMPLE Dictionary: P 1 = aba {3,6} cbb P 2 = ab {3,6} bbac P 3 = aa {3,6} ac Query 1 2 3 4 5 6 7 8 9 10 11 text: a b a a b a c b b a c P 1,1 P 1,2 P 2,1 P 2,2 P 3,1 P 3,2 CPM 2014 First = 1≤i≤d { P i,1 } Second = 1≤i≤d { P i,2 }

M OTIVATION Computational Biology A renew interest due to cyber security. Network intrusion detection systems perform protocol analysis, content searching and content matching to detect harmful software. Malware may appear in several packets! 8 CPM 2014

P REVIOUS W ORK Gapped pattern matching problem was studied for a few decades, eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica 2004],[Bille&Thorup, ICALP 2009],[Bille&Thorup SODA 2010], [Morgante et al., JCB 2005], [Rahman et al., COCOON 2006], [Bille et al., TCS 2012] DMG problem not studied enough ! [Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no bounds on the length of the gap. 9 CPM 2014

B I - DIRECTIONAL SUFFIX TREES ALGORITHM 10 Gapped pattern: a b{3,6}b b a c Query: a b a a b a c b b a c CPM 2014

B I - DIRECTIONAL SUFFIX TREES ALGORITHM Idea: view as [Amir et al., JAL 2000] 11 Gapped patterns: P 1 = a b a{3,6}a b a c P 2 = a b a{3,6}b b a P 3 = a b{3,6}b a a Query: a b a a b a c b b a c Use suffix tree T S of Second Use suffix tree T F R of First R gap CPM 2014

B I - DIRECTIONAL SUFFIX TREES ALGORITHM For each text location l Insert t l t l +1 …t n to T S (the node h) to find labels on the path to h. For f= l -  -1 to l -  -1 Insert t f t f-1 …t 1 to T F R (the node g) to find labels on the path to g. Output intersection (for end locations). 12 Finds P i,2 starting at location l. Finds P i,1 ending at location f. CPM 2014

13 B I - DIRECTIONAL SUFFIX TREES ALGORITHM - I NTERSECTION Patterns: {(1,4),(2,9),(3,7),…,(6,5),…} TSTS TFRTFR Range: [1,9] Range: [2,7] CPM 2014 3 6 9 1 g 5 7 2 h

14 B I - DIRECTIONAL SUFFIX TREES ALGORITHM ( CONTINUED ) Intersection via range queries: Range: [2,7] Range: [1,9] (1,4) (3,7) (6,5) (8,8) (2,9) CPM 2014

T IME & S PACE Preprocessing Time: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Preprocessing grid for range queries: O(d log d). [Chan et al., SoCG 2011] Preprocessing Space: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Space for grid: O(d log  d). [Chan et al., SoCG 2011] 15 CPM 2014

T IME & S PACE Query Time: For each end text location, we try every gap size: a factor of . The number of range queries is the number of vertical paths in a given path: O(log 2 min{d, log |D|}). A range query costs: O(log log d+occ). [Chan et al., SoCG 2011] Total: O(n(  )log log d log 2 min{d, log |D|}+occ). 16 CPM 2014 3 6 9 1 g

17 L OOKUP T ABLE ALGORITHM Idea: Instead of using range queries in a grid to compute the intersection, we use a pre-computed lookup table. Enables intersection in O(occ) time. Total query time becomes: O(n(  )+occ). CPM 2014

18 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM 2014 3 6 9 1 5 7 2 P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} g h

19 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM 2014 3 6 9 1 5 7 2 P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9, 6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} g h

20 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM 2014 3 6 9 1 5 7 2 P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …, P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} g h

21 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM 2014 3 6 9 1 5 7 2 P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} Inter[ 9, 7 ]= {3,4,6} g h

22 L OOKUP T ABLE ALG. CPM 2014 3 6 9 1 5 7 2 P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5),P 7 =(9,6) Inter[3,5]= {4} Inter[3,7]= {3,4} Inter[6,7]= {3,4,7} 1 3 : 1 9 6 …. 2 5 6 7 2 : -- 4 1 6 3 4 7

23 L OOKUP T ABLE ALGORITHM Preprocessing: Time: Table can be computed using DP in time O(d 2 ovr + |D|) where ovr is the number of subpatterns including other subpattern as a prefix or suffix. Space: O(d 2 + |D|). Query time: O(n(  )+occ). CPM 2014

O UR R ESULTS Preprocessing time: O(d log d + |D|). Space: O(d log  d + |D|). Query time: O(n(  )log log d log 2 (min{d, log |D|} )+occ). Preprocessing time: O(d 2 ovr + |D|). Space: O(d 2 + |D|). Query time: O(n(  )+occ). 24 Bi-directional suffix trees & range queries Bi-directional suffix trees & Lookup table CPM 2014

O PEN P ROBLEMS Generalizing to k gaps Reducing the dependency on the size  Scalability to different gap bounds in the dictionary Online algorithm 25 CPM 2014

T HANK Y OU ! 26 CPM 2014

Download ppt "D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014."

Similar presentations