Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"
Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Deterministic Finite Automata (DFA)
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
CS21 Decidability and Tractability
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY (For next time: Read Chapter 1.3 of the book)
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
1 CSCI-2400 Models of Computation. 2 Computation CPU memory.
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Reverse Colussi algorithm
Topics Automata Theory Grammars and Languages Complexities
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
1 Amihood Amir Bar-Ilan University and Georgia Tech UWSL 2006.
Source Coding-Compression
1 Theory of Digital Computation Course material for undergraduate students on IT Department of Computer Science University of Veszprem Veszprem, Hungary.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Section 14.2 A Hierarchy of Languages Context-Sensitive Languages A context-sensitive grammar has productions of the form xAz  xyz, where A is a nonterminal.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
1 Chapter 1 Introduction to the Theory of Computation.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Great Theoretical Ideas in Computer Science.
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Compressed Pattern Matching in DNA Sequences BARNA SAHA.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
CS 203: Introduction to Formal Languages and Automata
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Foundations of (Theoretical) Computer Science Chapter 2 Lecture Notes (Section 2.2: Pushdown Automata) Prof. Karen Daniels, Fall 2010 with acknowledgement.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Fundamental Data Structures and Algorithms
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Recap: Transformation NFA  DFA  s s1s1... snsn p1p1 p2p2... pmpm >...  p1p1  p2p2  pipi s e s1s1 e s2s2 e sisi >
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
Turing Machines Sections 17.6 – The Universal Turing Machine Problem: All our machines so far are hardwired. ENIAC
Pushdown Automata Chapter 12. Recognizing Context-Free Languages Two notions of recognition: (1) Say yes or no, just like with FSMs (2) Say yes or no,
Universal Turing Machine
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
CSE 589 Applied Algorithms Spring 1999
Introduction to the Theory of Computation
Applied Algorithmics - week7
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
CSE 105 theory of computation
Reachability on Suffix Tree Graphs
CSE 589 Applied Algorithms Spring 1999
Lecture 10: Query Complexity
Chapter 1 Introduction to the Theory of Computation
Presentation transcript:

Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu University

Contents Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity Conclusion

Compressed Pattern Matching Compressed Text Original Text Compressed Text Pattern Matching Machine Compressed Pattern Matching Machine decompress

Works on This Study Compression methodCompressed pattern matching algorithms Run-lengthEilam-Tzoreff & Vishkin (1988) Run-length (two dim)Amir et al. (1992, 1997); Amir & Benson (1992) LZ77 familyFarach & Thorup (1995); Gąsieniec, et al. (1996); Klein & Shapira (2000) LZ78 familyAmir et al. (1996); Kida et al. (1998, 1999); Navarro & Tarhio (2000); K ä rkk ä inen et al. (2000); LZ familyNavarro et al. (1999) Straight-line programsKarpinski et al. (1997); Miyazaki et al. (1997); Hirao et al. (2000) HuffmanFukamachi et al. (1998); Klein & Shapira (2001); Miyazaki et al. (1998) Finite state encodingTakeda (1997) Word based encodingMoura et al. (1998) Pattern substitutionManber (1994); Shibata et al. (1998) Antidictionary basedShibata et al. (1999)

Works on This Study Previous Algorithm for word-based method Word-based Algorithm for LZ78 LZ78 Algorithm for LZ77 LZ77 Algorithm for texts represented by collage system Word-based LZ78 LZ77 Collage System A Unifying framework for compressed pattern matching. T. Kida et al. (1999), SPIRE1999

Collage System Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

Collage System X 1 = a ; X 2 = b ; D : S :S : X 3 X 6 X 4 X 5 X 2 X 3 X 1 X 5 X 4 X 2 X 6 = [ 3 ] X 5 ; (Truncation) X 5 = ( X 3 ) 3 ; (Repetition) X 4 = X 2 ・ X 1 ; (Concatenation) X 3 = X 1 ・ X 2 ; (Concatenation) bab ababab ba ab abbabbaabababbabaabababbab

Notations and Definitions Collage system is a pair 〈 D, S 〉 D : a set of assignments of tokens –X 1 = expr 1 ; X 2 = expr 2 ; ・・・ ; X n = expr n ; where each expr k is any of the form a for a ∈ Σ ∪ {ε}, X i ・ X j for i, j < k, ( X i ) j for i < k and an integer j, [ j ] X i for i < k and an integer j, X i [ j ] for i < k and an integer j, –|| D || = n : the number of tokens defined in D –X.u : the string represented by a token X S : a sequence of tokens defined in D – X i 1 X i 2 ・・・ X i l ( X i is a token defined in D ) –| S | = l : the number of tokens in S concatenation j times repetition prefix truncation suffix truncation primitive assignment

Height of D X 1 = a ; X 2 = b ; D : X 7 = X 6 ・ X 4 ; X 6 = [ 3 ] X 5 ; X 5 = ( X 3 ) 3 ; X 4 = X 2 ・ X 1 ; X 3 = X 1 ・ X 2 ; height(X 7 ) = 4 height( D ) = max{height(X) | X  F( D )} X7X7 X6X6 X4X4 X5X5 X3X3 X1X1 X2X2 X2X2 X1X1 F ( D ) is the set of tokens defined in D.

Example of Collage System (LZSS [gzip]) X q+1 = ( ( [i 1 ] X l(1) X l(1)+1 ・・・ X r(1) ) m 1 ) [ j 1 ] b 1 ; X q+2 = ( ( [i 2 ] X l(2) X l(2)+1 ・・・ X r(2) ) m 2 ) [ j 2 ] b 2 ; X q+n = ( ( [i n ] X l(n) X l(n)+1 ・・・ X r(n) ) m n ) [ j n ] b n ; X 1 = a 1 ;X 2 = a 2 ;X q = a q ; ・・・ S : X q+1 X q+2 ・・・ X q+n D :  ={a 1,..., a q } b j  and 0  i k, j k, m k

Pattern Matching on Collage System state : S :S : Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 7 : goto function : failure function a b b a b 3 KMP automaton for  = a b a b b original text : abababba Jump( 4, X i 4 ) = 1Output( 4, X i 4 ) = {3} 33 4

Pattern Matching on Collage System no truncation truncation O( (|| D ||+| S |) ・ height( D ) + |  | 2 + r ) time O( || D || + |  | 2 ) space LZ77 Sequitur LZ78 LZSS BPE O( || D || + | S | + |  | 2 + r ) time r is the number of pattern occurrences LZW

Extension of Output function for multiple patterns Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

Basic Idea Simulate the move of Aho-Corasick pattern matching machine  AC machine for  ={aba,ababb,abca,bb} a c b b a b c a b b {bb} {abca} {aba} {ababb,bb} : goto function : failure function { } : output Jump( q, X) =  AC ( q, X.u) Output( q, X)={  |v|,     o(  AC (q, v)), v is a prefix of X.u} (  AC is a transition function of AC machine) ( o is an output function of AC machine)

Enumeration of Output(q, X) Enumerate Output( q, X)  Enumerate Occ( , X.u) Y.uZ.u Period ? Enumerate for each case of X e.g. Enumerate Occ*( , Y.u  Z.u) for X=Y  Z Single pattern case Multiple pattern case

Enumeration of Occ*( , x  y) O(m 2 ) time and space preprocessing  ={abcabc, cabb, abca} abccabcab abcabca abcabc abcab ccabcab cabca cabc cab cabcab abca abc ab 11 22 33 11 11 33 abca ca a bcbcabcabc Suffixes of  Prefixes of  11 nil 11 33 (p x, p y ) pxpx pypy m is the total length of the patterns in 

Enumeration of Occ( , (Y.u) k ) Reduce to the single pattern case –If Y.u  Y.u is a substring of a pattern in , Add a list of the patterns that occur in X.u with covering Y.u 2. The number of substring that is a square is O(m).  O(m 2 ) space Generalized Suffix trie GST  {  1,  3,  6 } (Y.u) 2 is a substring of  1, a nd |Y.u| is a period of  1. (same for  3,  6 ) (Y.u) 2 is a substring of  1, a nd |Y.u| is a period of  1. (same for  3,  6 ) X=Y k Y.u 11 {  1,  3,  6 } m is the total length of the patterns in 

Our Results Theorem The multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved in O( ( || D || + | S | ) ・ height( D ) + m 2 + r ) time, using O( || D || + m 2 ) space. If D contains no truncation operation, it can be solved in O( || D || + | S | + m 2 + r ) time. Theorem The multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved in O( ( || D || + | S | ) ・ height( D ) + m 2 + r ) time, using O( || D || + m 2 ) space. If D contains no truncation operation, it can be solved in O( || D || + | S | + m 2 + r ) time. m is the total length of the patterns in  r is the number of pattern occurrences

Multi-pattern version of BM type algorithm Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

Boyer-Moore type algorithm A Boyer-Moore type algorithm for compressed pattern matching, CPM2000 –Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa –O( (|| D ||+| S |) ・ height( D ) + |  | ・ | S | + |  | 2 + r ) time –O( || D || + |  | 2 ) space –If no truncation, O( || D ||+| S |+ |  | ・ | S | + |  | 2 + r ) time r is the number of pattern occurrences m is the total length of the patterns in  Theorem The BM type algorithm for multiple pattern serching on collage system runs in –O( (|| D ||+| S |) ・ height( D ) + |  | ・ | S | + m 2 + r ) time –O( || D || + m 2 ) space –If no truncation, O( || D ||+| S |+ m| S | + m 2 + r ) time Theorem The BM type algorithm for multiple pattern serching on collage system runs in –O( (|| D ||+| S |) ・ height( D ) + |  | ・ | S | + m 2 + r ) time –O( || D || + m 2 ) space –If no truncation, O( || D ||+| S |+ m| S | + m 2 + r ) time

Boyer-Moore Type Algorithm S ・・ Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 Xi5Xi5 Xi6Xi6 Xi7Xi7 ・・・・ CTTAATTAAGCCTGCTAAGCAT Original text Pattern occurrences Shift by  1.Enumerate Occ( , S [i].u) 2.Enumerate Occ*( , q  S [i].u). 3.Calculate the maximal safe shift Δ Calculate Shift (lpps( S [i+1].u ), S [i]) Calculate the smallest k s.t. 4. i:= i +  1.Enumerate Occ( , S [i].u) 2.Enumerate Occ*( , q  S [i].u). 3.Calculate the maximal safe shift Δ Calculate Shift (lpps( S [i+1].u ), S [i]) Calculate the smallest k s.t. 4. i:= i +  Shift(lpps( S [i+1].u), S [i])   (| S [i+j].u|)  |lpps( S [i].u)|. j =0 k Same way of AC type O(m)O(m)

Calculate Shift(lpps( S [i+1].u), S [i]) rightmost_occ  (w) = minl > 0  [m  l  |w| : m  l ] = w, or  [1: m  l ] is a suffix of w text ll a suffix of w w ww ww w   rightmost_occ  (w) = min  {rightmost_occ  (w)}

Calculate Shift(lpps( S [i+1].u), S [i]) Shift(lpps( S [i+1].u), X) = rightmost_occ  (X.u ・ lpps( S [i+1].u)) O( || D || ・ height( D )+ m 2 ) time and O(|| D ||+ m 2 ) space S[i]S[i] Shift  =3 Shift(lpps( S [i+1].u), S [i])   (| S [i+j].u|)  |lpps( S [i].u)| j =0 k

Experimental Result AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Medline (English text) 60.3Mbyte AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Medline (English text) 60.3Mbyte Pattern length CPU time (second) Search for uncompressed texts with KMP method. Search for uncompressed texts with Agrep. Search for texts compressed by BPE with AC type algorithm. * Agrep is a search tool developed by Wu and Manber. * BPE: Byte Pair Encoding Search for texts compressed by BPE with BM type algorithm. * A single pattern was inputted.

Parallel complexity of compressed pattern matching Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

Problem to consider Instance: A regular collage system 〈 D, S 〉 and a set  ={  1,  2, ,  s } of patterns. Question: Is there any pattern  j  that occurs in the text T represented by 〈 D, S 〉 ? Contains no truncation and repetition LogCFL Can be efficiently parallelized ! LogCFL  NC 2 *LogCFL is the class of problems logspace- reducible to a context-free language

The space of pushdown store is not bounded Nondeterministic Turing machine Idea of the Proof Using the lemma of I. Sudborough –LogCFL = AuxPDA( log n, n O(1) ) Using log n space worktape in n O(1) time *AuxPDA is an auxiliary pushdown automaton. Show such an AuxPDA M M accepts an input string if and only if there is some pattern that occurs in the text represented by 〈 D, S 〉.

AuxPDA M M ¢  1 #  2 #  #  s &X i 1 X i 2  X i n $ $ Pushdown store Occ(  j, X i k.u) =  XikXik X i k.u[ l ]=  j [ t ] ? t

Conclusion Collage system is a formal system –Texts compressed by various compression method can be expressed by collage system. Two types of algorithm for multiple pattern matching on collage system –AC type O( ( || D || + | S | ) ・ height( D ) + m 2 + r ) time O( || D || + | S | + m 2 + r ) space –BM type O() time and O() space Compressed pattern matching can be efficiently parallelized in principle. –For regular collage systems –Not yet for general collage systems

Substituted text Text BPE (Byte Pair Encoding) ABABCDEBDEFABDEABC GGCHBHFGHGC GIHBHFGHI GGCDEBDEFGDEGC G H I 9 18 G H I AB DE GC dictionary → → →

D Collage System of BPE text Text ABABCDEBDEFABDEABC X 1 =A X 2 =B X 3 =C X 4 =D X 5 =E X 6 =F X 7 = X 1 ・ X 2 X 8 = X 4 ・ X 5 X 9 = X 7 ・ X 3 S X7 X9 X8 X2 X8 X6 X7 X8 X9X7 X9 X8 X2 X8 X6 X7 X8 X9 X7 X9 X8 X2 X8 X6 X7 X8 X9X7 X9 X8 X2 X8 X6 X7 X8 X9 || D || = 9 | S | = 9 G H I AB DE GC dictionary → → →

concatenation, repetition, truncation X=YZ X=Y k X=Y [k] or X= [k] Y LZSS LZ77 concatenation only X=YZ Re-Pair BPE Sequitur concatenation only X=YZ (|Y|=1 or |Z|=1) LZW LZ78 Hierarchy of Collage Systems Run-length