Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Similar presentations


Presentation on theme: "A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,"— Presentation transcript:

1

2 A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics, Kyushu University, Japan

3 2 Contents Pattern matching and compressed pattern matching Previous results Collage system Proposed algorithm Conclusion

4 3 Pattern Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach. text:= pattern:= compress

5 4 Compressed Pattern Matching Compressed Text OriginalText Compressed Text Pattern Matching Machine Machine New Machine ! decompress

6 Previous Results(1) 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ77 1996 Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ77 1997 Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW yearresearchercompression

7 yearresearchercompression 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries 1999 Kida, Takeda, Shinohara, and Arikawa LZW 1999 Shibata, et al. Byte pair encoding Kida, et al. 1999 Dictionary based methods (Collage system) 1999 Navarro and Raffinot LZ family Today’s talk Today’s talk Previous Results(2) 1998 de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding faster than Agrep! faster than Agrep!

8 7 Motivation Previous: Compression APM Algorithm A Compression BPM Algorithm B Compression CPM Algorithm C Ours: General Pattern matching algorithm on the unifying framework Compression A Compression B Compression C Collage system

9 Collage System Definition and Several Examples

10 9 Original text Original text Dictionary Based Compression compressed text compressed text Dictionary structure Dictionary structure encoding factorize into a series of phrases  How to choose the phrases.  How to design the data structure of the dictionary.  How to encode phrases.

11 10 Definition of Collage System  Collage system is a pair 〈 D, S 〉 S : A sequence of variables defined in D (Compressed text) S := X i 1, X i 2, ・・・, X i l ( X i ∈ D ) D : A sequence of assignments (Dictionary structure) X 1 = expr 1 ; ・・・ X 2 = expr 2 ;X n = expr n ; ||D|| = n : number of assignments in D |S| = l : number of variables in S

12 11 Definition of Collage System where expr k are X 1 = expr 1 ; ・・・ X 2 = expr 2 ;X n = expr n ; a a ∈ Σ ∪ {ε}, (primitive assignment) X i ・ X j (concatenation) for i, j < k, ( X i ) j for i < k and integer j ( j times repetition) D : A sequence of assignments (Dictionary structure) [ j ] X i (prefix truncation) for i < k and integer j X i [ j ] (suffix truncation) for i < k and integer j

13 Example of Collage System X 1 = a ; X 2 = b ; D : S :S : X 3, X 6, X 4, X 7 abbabbababba X 7 = X 6 ・ X 4 ; X 6 = [ 3 ] X 5 ; X 5 = ( X 3 ) 3 ; X 4 = X 2 ・ X 1 ; X 3 = X 1 ・ X 2 ; babba bab ababab ba ab X7X7 X6X6 X4X4 X5X5 X3X3 X1X1 X2X2 X2X2 X1X1 a b ) 3 ) [ 3 ] (( ba prefix truncation 3 times repetition T(X7)T(X7) height(X 7 ) = 4 height(D) = 4

14 13 Example of Collage System Byte Pair Encoding (BPE) D:D: X 1 = a ; X 2 = b ; X 4 = X 1 ・ X 2 ; X 5 = X 4 ・ X 3 ; Original Text: a b a b c b a b c c a b c a c b D D c b D c c D c a c b D E b E c E a c b ab  D Dc  E X 3 = c ; S : X 4, X 5, X 2, X 5, X 3, X 5, X 1, X 3, X 2 ab  D Dc  E

15 14 Example of Collage System (LZSS[gzip]) X q+1, X q+2, ・・・, X q+n X q+1 = ( ( [i 1 ] X l(1) ・ X l(1)+1 ・・・ X r(1) ) m 1 ) [ j 1 ] b 1 ; X q+2 = ( ( [i 2 ] X l(2) ・ X l(2)+1 ・・・ X r(2) ) m 2 ) [ j 2 ] b 2 ; X q+n = ( ( [i n ] X l(n) ・ X l(n)+1 ・・・ X r(n) ) m n ) [ j n ] b n ; D:D:X 1 = a 1 ;X 2 = a 2 ;X q = a q ; ・・・ S :S :

16 15 What is ‘Collage’? This is college!

17 16 Collage is... an artistic composition technique. 1. Cut or tear up materials. 2. Paste the pieces over a surface.

18 Our Algorithm Pattern Matching Algorithm on a Collage System

19 Compressed pattern matching on a collage system The problem of compressed pattern matching can be solved in O( (||D||+|S|) ・ height(D) + m 2 + r ) time using O( ||D|| + m 2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m 2 + r ) time. m : pattern length r : number of pattern occurrences ||D|| : number of assignments in D |S| : number of variables in S O(compressed text length+m 2 +r)

20 19 state: 012343 45 1 1 2 4 1 S : Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 7 : goto function : failure function a 012 4 5 b b a b 3 Pattern π= a b a b b Basic Idea original text: abababba

21 20 The set Output( j, u) ={1 ≦ i ≦ |u| | P = a suffix of P[1: j] ・ u[1: i]} The function Jump( j, u) =δ KMP ( j, u) This set contains the pattern occurrences. The domain is Q×D It simulates the sequence of state transitions for u. Jump and Output Reply in O(1) time Reply in O(1) time Reply in O( l ) time Reply in O( l ) time

22 21 Realization of Jump for Jump( q, X k ), if X k is... a X i ・ X j O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. [ j ] X i X i [ j ] O( height(X i ) ) time ( X i ) j O(1) time

23 22 Factor Concatenation Problem example: P =COPACABANA OPA, CABANOPACABAN ‘Yes’! P[2:9] concatenate Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

24 23 Solution to the problem Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m 2 ) time and space. Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m 4 ) time and space preprocessing. It can be solved in O(1) time after O(m 2 ) space and time preprocessing.

25 24 Realization of Output a X i ・ X j O(1) time [ j ] X i X i [ j ] O( l ・ height(X i ) ) time ( X i ) j O( l ) time for Output( q, X k ), if X k is... It can be enumerate in O( l ) time from Output of X i and X j. Size of the set Output

26 Outline of Our Algorithm Input. pattern P and Collage system: 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. Input. pattern P and Collage system: 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. /* preprocess for D and P */ preprocess(D); preprocess(P); l:=0; q:=0; for j:=1 to n do begin for each d  Output(q, X i j ) do report ‘pattern occurs at position l+d ’; q:= Jump(q, X i j ); /* state transition */ l:= l + | X i j | ; /* calculate the offset */ end

27 Concluding Remarks Conclusion and Future Works

28 27 Our Results If D contains no truncation : O( ||D|| + |S| + m 2 + r ) time 1998 Kida, et al. ( LZW ) : O( n + m 2 ) space O( n + m 2 + r ) time LZ78, LZW, BPE, Run-length, etc... LZ78, LZW, BPE, Run-length, etc... no truncation LZ77, LZSS, etc... truncation Complexity of our algorithm:O( ||D|| + m 2 ) space O( (||D|| + |S| ) ・ height(D) + m 2 + r ) time

29 28 Conclusion We introduced a general framework for compressed pattern matching (Collage system) We proposed a compressed pattern matching algorithm on collage system and showed its complexity. O( (||D||+|S|) ・ height(D) + m 2 + r ) time O( ||D|| + m 2 ) space ( If no truncation ) O( ||D|| + |S| + m 2 + r ) time

30 29 Future Works Can we reduce the complexity of the preprocessing? O(m 2 )  O(m) To improve our algorithm for dealing with multiple patterns. To develop an approximate pattern matching algorithm on a collage system. To develop a new compression which is suitable for compressed pattern matching.


Download ppt "A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,"

Similar presentations


Ads by Google