D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.

Slides:



Advertisements
Similar presentations
1 IP-Lookup and Packet Classification Advanced Algorithms & Data Structures Lecture Theme 08 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Advertisements

On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Goodrich, Tamassia String Processing1 Pattern Matching.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
E.G.M. PetrakisTries1  Trees of order >= 2  Variable length keys  The decision on what path to follow is taken based on potion of the key  Static environment,
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
Route Planning Vehicle navigation systems, Dijkstra’s algorithm, bidirectional search, transit-node routing.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
Asynchronous Pattern Matching - Address Level Errors Amihood Amir Bar Ilan University 2010.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
1 Geometric Intersection Determining if there are intersections between graphical objects Finding all intersecting pairs Brute Force Algorithm Plane Sweep.
Succinct Representations of Trees
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
The TRIE Amihood Amir.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Segment tree and Interval Tree
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Reachability on Suffix Tree Graphs
String Data Structures and Algorithms
2-Dimensional Pattern Matching
String Data Structures and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Presentation transcript:

D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014

CPM M OSCOW 2 CPM 2014

!MIND THE GAP 3 CPM 2014

O UTLINE The DMG( Dictionary Matching with one Gap ) Problem Motivation Previous Work Bidirectional Suffix Trees Solution Lookup Table addition Open Problems 4 CPM 2014

T HE DMG P ROBLEM 5 A gapped pattern is a pattern P of the form: P 1 {  1,  1 } P 2 {  2,  2 }… P k-1 {  k-1,  k-1 }P k Each P j is over alphabet , {  j,  j } is a sequence of at least  j and at most  j don’t cares Example: aba{3,6}cbb aba CPM 2014

T HE DMG P ROBLEM The DMG problem is: Preprocess: A dictionary D of d gapped patterns P 1,…, P d over alphabet . Query: A text T of length n over alphabet . Output: all locations in T where a dictionary gapped pattern ends. We focus on DMG with a single gap. 6 CPM 2014

7 E XAMPLE Dictionary: P 1 = aba {3,6} cbb P 2 = ab {3,6} bbac P 3 = aa {3,6} ac Query text: a b a a b a c b b a c P 1,1 P 1,2 P 2,1 P 2,2 P 3,1 P 3,2 CPM 2014 First = 1≤i≤d { P i,1 } Second = 1≤i≤d { P i,2 }

M OTIVATION Computational Biology A renew interest due to cyber security. Network intrusion detection systems perform protocol analysis, content searching and content matching to detect harmful software. Malware may appear in several packets! 8 CPM 2014

P REVIOUS W ORK Gapped pattern matching problem was studied for a few decades, eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica 2004],[Bille&Thorup, ICALP 2009],[Bille&Thorup SODA 2010], [Morgante et al., JCB 2005], [Rahman et al., COCOON 2006], [Bille et al., TCS 2012] DMG problem not studied enough ! [Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no bounds on the length of the gap. 9 CPM 2014

B I - DIRECTIONAL SUFFIX TREES ALGORITHM 10 Gapped pattern: a b{3,6}b b a c Query: a b a a b a c b b a c CPM 2014

B I - DIRECTIONAL SUFFIX TREES ALGORITHM Idea: view as [Amir et al., JAL 2000] 11 Gapped patterns: P 1 = a b a{3,6}a b a c P 2 = a b a{3,6}b b a P 3 = a b{3,6}b a a Query: a b a a b a c b b a c Use suffix tree T S of Second Use suffix tree T F R of First R gap CPM 2014

B I - DIRECTIONAL SUFFIX TREES ALGORITHM For each text location l Insert t l t l +1 …t n to T S (the node h) to find labels on the path to h. For f= l -  -1 to l -  -1 Insert t f t f-1 …t 1 to T F R (the node g) to find labels on the path to g. Output intersection (for end locations). 12 Finds P i,2 starting at location l. Finds P i,1 ending at location f. CPM 2014

13 B I - DIRECTIONAL SUFFIX TREES ALGORITHM - I NTERSECTION Patterns: {(1,4),(2,9),(3,7),…,(6,5),…} TSTS TFRTFR Range: [1,9] Range: [2,7] CPM g h

14 B I - DIRECTIONAL SUFFIX TREES ALGORITHM ( CONTINUED ) Intersection via range queries: Range: [2,7] Range: [1,9] (1,4) (3,7) (6,5) (8,8) (2,9) CPM 2014

T IME & S PACE Preprocessing Time: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Preprocessing grid for range queries: O(d log d). [Chan et al., SoCG 2011] Preprocessing Space: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Space for grid: O(d log  d). [Chan et al., SoCG 2011] 15 CPM 2014

T IME & S PACE Query Time: For each end text location, we try every gap size: a factor of . The number of range queries is the number of vertical paths in a given path: O(log 2 min{d, log |D|}). A range query costs: O(log log d+occ). [Chan et al., SoCG 2011] Total: O(n(  )log log d log 2 min{d, log |D|}+occ). 16 CPM g

17 L OOKUP T ABLE ALGORITHM Idea: Instead of using range queries in a grid to compute the intersection, we use a pre-computed lookup table. Enables intersection in O(occ) time. Total query time becomes: O(n(  )+occ). CPM 2014

18 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} g h

19 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9, 6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} g h

20 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …, P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} g h

21 L OOKUP T ABLE ALGORITHM Inter[g,h] = all i s.t. P i,1 R appears on the path from the root of T F R till node g and P i,2 appears on the path from the root of T S till node h. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5), P 7 =(9,6) Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} Inter[ 9, 7 ]= {3,4,6} g h

22 L OOKUP T ABLE ALG. CPM P 1 =(1,4), P 2 =(2,9), P 3 =(3,7), P 4 =(3,2), …,P 6 =(6,5),P 7 =(9,6) Inter[3,5]= {4} Inter[3,7]= {3,4} Inter[6,7]= {3,4,7} 1 3 : … :

23 L OOKUP T ABLE ALGORITHM Preprocessing: Time: Table can be computed using DP in time O(d 2 ovr + |D|) where ovr is the number of subpatterns including other subpattern as a prefix or suffix. Space: O(d 2 + |D|). Query time: O(n(  )+occ). CPM 2014

O UR R ESULTS Preprocessing time: O(d log d + |D|). Space: O(d log  d + |D|). Query time: O(n(  )log log d log 2 (min{d, log |D|} )+occ). Preprocessing time: O(d 2 ovr + |D|). Space: O(d 2 + |D|). Query time: O(n(  )+occ). 24 Bi-directional suffix trees & range queries Bi-directional suffix trees & Lookup table CPM 2014

O PEN P ROBLEMS Generalizing to k gaps Reducing the dependency on the size  Scalability to different gap bounds in the dictionary Online algorithm 25 CPM 2014

T HANK Y OU ! 26 CPM 2014