Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.

Similar presentations


Presentation on theme: "Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu."— Presentation transcript:

1 Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu University

2 Contents Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity Conclusion

3 Compressed Pattern Matching Compressed Text Original Text Compressed Text Pattern Matching Machine Compressed Pattern Matching Machine decompress

4 Works on This Study Compression methodCompressed pattern matching algorithms Run-lengthEilam-Tzoreff & Vishkin (1988) Run-length (two dim)Amir et al. (1992, 1997); Amir & Benson (1992) LZ77 familyFarach & Thorup (1995); Gąsieniec, et al. (1996); Klein & Shapira (2000) LZ78 familyAmir et al. (1996); Kida et al. (1998, 1999); Navarro & Tarhio (2000); K ä rkk ä inen et al. (2000); LZ familyNavarro et al. (1999) Straight-line programsKarpinski et al. (1997); Miyazaki et al. (1997); Hirao et al. (2000) HuffmanFukamachi et al. (1998); Klein & Shapira (2001); Miyazaki et al. (1998) Finite state encodingTakeda (1997) Word based encodingMoura et al. (1998) Pattern substitutionManber (1994); Shibata et al. (1998) Antidictionary basedShibata et al. (1999)

5 Works on This Study Previous Algorithm for word-based method Word-based Algorithm for LZ78 LZ78 Algorithm for LZ77 LZ77 Algorithm for texts represented by collage system Word-based LZ78 LZ77 Collage System A Unifying framework for compressed pattern matching. T. Kida et al. (1999), SPIRE1999

6 Collage System Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

7 Collage System X 1 = a ; X 2 = b ; D : S :S : X 3 X 6 X 4 X 5 X 2 X 3 X 1 X 5 X 4 X 2 X 6 = [ 3 ] X 5 ; (Truncation) X 5 = ( X 3 ) 3 ; (Repetition) X 4 = X 2 ・ X 1 ; (Concatenation) X 3 = X 1 ・ X 2 ; (Concatenation) bab ababab ba ab abbabbaabababbabaabababbab

8 Notations and Definitions Collage system is a pair 〈 D, S 〉 D : a set of assignments of tokens –X 1 = expr 1 ; X 2 = expr 2 ; ・・・ ; X n = expr n ; where each expr k is any of the form a for a ∈ Σ ∪ {ε}, X i ・ X j for i, j < k, ( X i ) j for i < k and an integer j, [ j ] X i for i < k and an integer j, X i [ j ] for i < k and an integer j, –|| D || = n : the number of tokens defined in D –X.u : the string represented by a token X S : a sequence of tokens defined in D – X i 1 X i 2 ・・・ X i l ( X i is a token defined in D ) –| S | = l : the number of tokens in S concatenation j times repetition prefix truncation suffix truncation primitive assignment

9 Height of D X 1 = a ; X 2 = b ; D : X 7 = X 6 ・ X 4 ; X 6 = [ 3 ] X 5 ; X 5 = ( X 3 ) 3 ; X 4 = X 2 ・ X 1 ; X 3 = X 1 ・ X 2 ; height(X 7 ) = 4 height( D ) = max{height(X) | X  F( D )} X7X7 X6X6 X4X4 X5X5 X3X3 X1X1 X2X2 X2X2 X1X1 F ( D ) is the set of tokens defined in D.

10 Example of Collage System (LZSS [gzip]) X q+1 = ( ( [i 1 ] X l(1) X l(1)+1 ・・・ X r(1) ) m 1 ) [ j 1 ] b 1 ; X q+2 = ( ( [i 2 ] X l(2) X l(2)+1 ・・・ X r(2) ) m 2 ) [ j 2 ] b 2 ; X q+n = ( ( [i n ] X l(n) X l(n)+1 ・・・ X r(n) ) m n ) [ j n ] b n ; X 1 = a 1 ;X 2 = a 2 ;X q = a q ; ・・・ S : X q+1 X q+2 ・・・ X q+n D :  ={a 1,..., a q } b j  and 0  i k, j k, m k

11 Pattern Matching on Collage System state : 012343 45 1 1 2 4 1 S :S : Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 7 : goto function : failure function a 012 4 5 b b a b 3 KMP automaton for  = a b a b b original text : abababba Jump( 4, X i 4 ) = 1Output( 4, X i 4 ) = {3} 33 4

12 Pattern Matching on Collage System no truncation truncation O( (|| D ||+| S |) ・ height( D ) + |  | 2 + r ) time O( || D || + |  | 2 ) space LZ77 Sequitur LZ78 LZSS BPE O( || D || + | S | + |  | 2 + r ) time r is the number of pattern occurrences LZW

13 Extension of Output function for multiple patterns Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

14 Basic Idea Simulate the move of Aho-Corasick pattern matching machine  AC machine for  ={aba,ababb,abca,bb} a c 01234 5 6 7 9 8 b b a b c a b b {bb} {abca} {aba} {ababb,bb} : goto function : failure function { } : output Jump( q, X) =  AC ( q, X.u) Output( q, X)={  |v|,     o(  AC (q, v)), v is a prefix of X.u} (  AC is a transition function of AC machine) ( o is an output function of AC machine)

15 Enumeration of Output(q, X) Enumerate Output( q, X)  Enumerate Occ( , X.u) Y.uZ.u Period ? Enumerate for each case of X e.g. Enumerate Occ*( , Y.u  Z.u) for X=Y  Z Single pattern case Multiple pattern case

16 Enumeration of Occ*( , x  y) O(m 2 ) time and space preprocessing  ={abcabc, cabb, abca} abccabcab abcabca abcabc abcab ccabcab cabca cabc cab cabcab abca abc ab 11 22 33 11 11 33 abca ca a bcbcabcabc Suffixes of  Prefixes of  11 nil 11 33 (p x, p y ) pxpx pypy m is the total length of the patterns in 

17 Enumeration of Occ( , (Y.u) k ) Reduce to the single pattern case –If Y.u  Y.u is a substring of a pattern in , Add a list of the patterns that occur in X.u with covering Y.u 2. The number of substring that is a square is O(m).  O(m 2 ) space Generalized Suffix trie GST  {  1,  3,  6 } (Y.u) 2 is a substring of  1, a nd |Y.u| is a period of  1. (same for  3,  6 ) (Y.u) 2 is a substring of  1, a nd |Y.u| is a period of  1. (same for  3,  6 ) X=Y k Y.u 11 {  1,  3,  6 } m is the total length of the patterns in 

18 Our Results Theorem The multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved in O( ( || D || + | S | ) ・ height( D ) + m 2 + r ) time, using O( || D || + m 2 ) space. If D contains no truncation operation, it can be solved in O( || D || + | S | + m 2 + r ) time. Theorem The multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved in O( ( || D || + | S | ) ・ height( D ) + m 2 + r ) time, using O( || D || + m 2 ) space. If D contains no truncation operation, it can be solved in O( || D || + | S | + m 2 + r ) time. m is the total length of the patterns in  r is the number of pattern occurrences

19 Multi-pattern version of BM type algorithm Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

20 Boyer-Moore type algorithm A Boyer-Moore type algorithm for compressed pattern matching, CPM2000 –Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa –O( (|| D ||+| S |) ・ height( D ) + |  | ・ | S | + |  | 2 + r ) time –O( || D || + |  | 2 ) space –If no truncation, O( || D ||+| S |+ |  | ・ | S | + |  | 2 + r ) time r is the number of pattern occurrences m is the total length of the patterns in  Theorem The BM type algorithm for multiple pattern serching on collage system runs in –O( (|| D ||+| S |) ・ height( D ) + |  | ・ | S | + m 2 + r ) time –O( || D || + m 2 ) space –If no truncation, O( || D ||+| S |+ m| S | + m 2 + r ) time Theorem The BM type algorithm for multiple pattern serching on collage system runs in –O( (|| D ||+| S |) ・ height( D ) + |  | ・ | S | + m 2 + r ) time –O( || D || + m 2 ) space –If no truncation, O( || D ||+| S |+ m| S | + m 2 + r ) time

21 Boyer-Moore Type Algorithm S ・・ Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 Xi5Xi5 Xi6Xi6 Xi7Xi7 ・・・・ CTTAATTAAGCCTGCTAAGCAT Original text Pattern occurrences Shift by  1.Enumerate Occ( , S [i].u) 2.Enumerate Occ*( , q  S [i].u). 3.Calculate the maximal safe shift Δ Calculate Shift (lpps( S [i+1].u ), S [i]) Calculate the smallest k s.t. 4. i:= i +  1.Enumerate Occ( , S [i].u) 2.Enumerate Occ*( , q  S [i].u). 3.Calculate the maximal safe shift Δ Calculate Shift (lpps( S [i+1].u ), S [i]) Calculate the smallest k s.t. 4. i:= i +  Shift(lpps( S [i+1].u), S [i])   (| S [i+j].u|)  |lpps( S [i].u)|. j =0 k Same way of AC type O(m)O(m)

22 Calculate Shift(lpps( S [i+1].u), S [i]) rightmost_occ  (w) = minl > 0  [m  l  |w| : m  l ] = w, or  [1: m  l ] is a suffix of w text ll a suffix of w w ww ww w   rightmost_occ  (w) = min  {rightmost_occ  (w)}

23 Calculate Shift(lpps( S [i+1].u), S [i]) Shift(lpps( S [i+1].u), X) = rightmost_occ  (X.u ・ lpps( S [i+1].u)) O( || D || ・ height( D )+ m 2 ) time and O(|| D ||+ m 2 ) space S[i]S[i] Shift  =3 Shift(lpps( S [i+1].u), S [i])   (| S [i+j].u|)  |lpps( S [i].u)| j =0 k

24 Experimental Result AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Medline (English text) 60.3Mbyte AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Medline (English text) 60.3Mbyte 51015202530 Pattern length 0.0 0.3 0.4 0.5 0.8 0.1 0.2 0.6 0.7 CPU time (second) Search for uncompressed texts with KMP method. Search for uncompressed texts with Agrep. Search for texts compressed by BPE with AC type algorithm. * Agrep is a search tool developed by Wu and Manber. * BPE: Byte Pair Encoding Search for texts compressed by BPE with BM type algorithm. * A single pattern was inputted.

25 Parallel complexity of compressed pattern matching Compressed pattern matching Collage system Extension of Output function for multiple patterns Multi-pattern version of BM type algorithm Parallel complexity of compressed pattern matching Conclusion

26 Problem to consider Instance: A regular collage system 〈 D, S 〉 and a set  ={  1,  2, ,  s } of patterns. Question: Is there any pattern  j  that occurs in the text T represented by 〈 D, S 〉 ? Contains no truncation and repetition LogCFL Can be efficiently parallelized ! LogCFL  NC 2 *LogCFL is the class of problems logspace- reducible to a context-free language

27 The space of pushdown store is not bounded Nondeterministic Turing machine Idea of the Proof Using the lemma of I. Sudborough –LogCFL = AuxPDA( log n, n O(1) ) Using log n space worktape in n O(1) time *AuxPDA is an auxiliary pushdown automaton. Show such an AuxPDA M M accepts an input string if and only if there is some pattern that occurs in the text represented by 〈 D, S 〉.

28 AuxPDA M M ¢  1 #  2 #  #  s &X i 1 X i 2  X i n $ $ 100000.... Pushdown store Occ(  j, X i k.u) =  XikXik X i k.u[ l ]=  j [ t ] ? t

29 Conclusion Collage system is a formal system –Texts compressed by various compression method can be expressed by collage system. Two types of algorithm for multiple pattern matching on collage system –AC type O( ( || D || + | S | ) ・ height( D ) + m 2 + r ) time O( || D || + | S | + m 2 + r ) space –BM type O() time and O() space Compressed pattern matching can be efficiently parallelized in principle. –For regular collage systems –Not yet for general collage systems

30 Substituted text Text BPE (Byte Pair Encoding) ABABCDEBDEFABDEABC GGCHBHFGHGC GIHBHFGHI GGCDEBDEFGDEGC G H I 9 18 G H I AB DE GC dictionary → → →

31 D Collage System of BPE text Text ABABCDEBDEFABDEABC X 1 =A X 2 =B X 3 =C X 4 =D X 5 =E X 6 =F X 7 = X 1 ・ X 2 X 8 = X 4 ・ X 5 X 9 = X 7 ・ X 3 S X7 X9 X8 X2 X8 X6 X7 X8 X9X7 X9 X8 X2 X8 X6 X7 X8 X9 X7 X9 X8 X2 X8 X6 X7 X8 X9X7 X9 X8 X2 X8 X6 X7 X8 X9 || D || = 9 | S | = 9 G H I AB DE GC dictionary → → →

32 concatenation, repetition, truncation X=YZ X=Y k X=Y [k] or X= [k] Y LZSS LZ77 concatenation only X=YZ Re-Pair BPE Sequitur concatenation only X=YZ (|Y|=1 or |Z|=1) LZW LZ78 Hierarchy of Collage Systems Run-length


Download ppt "Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu."

Similar presentations


Ads by Google