Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,

Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa

Contents Pattern matching on compressed text. A unifying framework for compressed pattern matching (Collage System) Byte pair encoding (BPE). Pattern matching algorithm on BPE compressed text. Experimental result. Conclusion.

matching Pattern matching is one of the most fundamental operations in string processing. matching Recently, a new trend for accelerating pattern matching has matching emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up matching the pattern matching since an extra work is needed to keep track of compression mechanism. matching Pattern matching is one of the most fundamental operations in string processing. matching Recently, a new trend for accelerating pattern matching has matching emerged: Speeding up pattern matching by text compression. From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time, adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed up matching the pattern matching since an extra work is needed to keep track of compression mechanism. Pattern Matching Problem matching Pattern Text Knuth-Morris-Pratt (1974) Boyer-Moore (1977) Aho-Corasick (1975) Shift-Or (1992)

Pattern Matching on Compressed Text Expand on Memory File transfer on Secondary disk storage original text File transfer on Memoryon Secondary disk storage compressed text Search Search It requires extra time and space.

Pattern Matching on Compressed Text File transfer on Memoryon Secondary disk storage compressed text Search directly To perform a faster search in compressed texts in comparison with a regular decompression followed by an ordinary search. GOAL 1 To perform a faster search in compressed texts in comparison with an ordinary search in the original texts. GOAL 2 Speeding up pattern matching by text compression

Previous Results(1) 1988 Eliam-Tsoreff and Vishkinrun-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1995 Farach and ThorupLZ77 1996 Amir, Benson and FarachLZW 1997 Karpinski, Rytter, and Shinoharastraight-line programs 1996 Gasieniec, et al.LZ77 1997 Miyazaki, Shinohara, and Takedastraight-line programs 1992 Amir and Benson two-dimensional run-length Amir, Benson, and Farach 1994 two-dimensional run-length 1997 Takedafinite state encoding 1998 Shibatabyte pair encoding 1994 Manberoriginal compression scheme 1998 Fukamachi, Shinohara, and TakedaHuffman encoding 1998 Kida, et al.LZW yearresearchercompression

yearresearchercompression 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionary based 1999 Kida, Takeda, Shinohara, and Arikawa LZW 2000 Shibata, et al. Byte pair encoding 1999 Navarro and Raffinot LZ family Today’s talk Previous Results(2) 1998 de Moura, Navarro, Ziviani, and Baeza-Yates Word based encoding Unifying framework Unifying framework Kida, et al. 1999 Dictionary based methods (Collage system)

A Unifying Framework for Compressed Pattern Matching Previous: Compression APM Algorithm A Compression BPM Algorithm B Compression CPM Algorithm C Collage system Kida et al.[1999]: Pattern matching algorithm on the unifying framework Compression A Compression B Compression C

Collage System Definition and Several Examples

Original text Original text Dictionary Based Compression compressed text compressed text Dictionary structure Dictionary structure encoding factorize into a series of phrases How to choose the phrases. How to design the data structure of the dictionary. How to encode phrases.

Collage System Collage system is a pair 〈 D, S 〉 S : A sequence of variables defined in D (Compressed text) S = X i 1, X i 2, ・・・, X i l ( X i ∈ D ) D : A sequence of assignments (Dictionary structure) X 1 : = expr 1 ; ・・・ X 2 : = expr 2 ;X n : = expr n ; ||D|| = n : number of assignments in D |S| = l : number of variables in S

where expr k are... X 1 = expr 1 ; ・・・ X 2 = expr 2 ;X n = expr n ; D : A sequence of assignments (Dictionary structure) a a ∈ Σ ∪ {ε}, (primitive assignment) X i ・ X ｊ (concatenation) for i, j < k, ( X i ) j for i < k and integer j ( j times repetition) [ j ] X i (prefix truncation) for i < k and integer j X i [ j ] (suffix truncation) for i < k and integer j Collage System

Example of Collage System X 1 = a ; X 2 = b ; D : S :S : X 3, X 6, X 4, X 7 abbabbababba X 7 = X 6 ・ X 4 ; X 6 = [ 3 ] X 5 ; X 5 = ( X 3 ) 3 ; X 4 = X 2 ・ X 1 ; X 3 = X 1 ・ X 2 ; babba bab ababab ba ab X7X7 X6X6 X4X4 X5X5 X3X3 X1X1 X2X2 X2X2 X1X1 a b ) 3 ) [ 3 ] (( ba prefix truncation 3 times repetition T(X7)T(X7) height(X 7 ) = 4 height(D) = 4

??? Pattern Matching Algorithm on a Collage System

Compressed pattern matching on a collage system m m : pattern length r r : number of pattern occurrences ||D|| ||D|| : number of assignments in D |S| |S| : number of variables in S Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|) ・ height(D) + m 2 + r ) time O( ||D|| + m 2 ) space using O( ||D|| + m 2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time. Theorem[Kida et al. 1999] Problem of compressed pattern matching can be solved in O( (||D||+|S|) ・ height(D) + m 2 + r ) time O( ||D|| + m 2 ) space using O( ||D|| + m 2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time.

state: 0 : goto function : failure function Pattern π= a b a b b Basic Idea original text: abababba 0 a 12 ba 3 b 4 b 5 123434 5 1 S ： Xi1Xi1 Xi2Xi2 Xi3Xi3 Xi4Xi4 abababba

The set Output( j, u) ={1 ≦ i ≦ |u| | P = a suffix of P[1: j] ・ u[1: i]} The function Jump( j, u) =δ KMP ( j, u) This set contains the pattern occurrences. The domain is Q×D It simulates the sequence of state transitions for u. Jump and Output Reply in O(1) time Reply in O(1) time Reply in O( l ) time Reply in O( l ) time

Realization of Jump and Output for Jump( q, X k ), if X k is... a X i ・ X ｊ O(1) time If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time. a X i ・ X ｊ O(1) time for Output( q, X k ), if X k is... It can be enumerate in O( l ) time from Output of X i and X ｊ. Size of the set Output

Factor Concatenation Problem example: P = COPACABANA OPA, CABANOPACABAN ‘Yes’! P[2:9] concatenate Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

Solution to the problem Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m 2 ) time and space. Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m 4 ) time and space preprocessing. It can be solved in O(1) time after O(m 2 ) space and time preprocessing.

Outline of Our Algorithm Input. pattern P and collage system 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. Input. pattern P and collage system 〈 D, S 〉 ( S := X i 1, X i 2, ・・・, X i n ) Output. All occurrences of the patterns. /* preprocessing of D and P */ preprocess(D); preprocess(P); l:=0; q:=0; for j:=1 to n do begin for each d  Output(q, X i j ) do report ‘pattern occurs at position l+d ’; q:= Jump(q, X i j ); /* state transition */ l:= l + | X i j | ; /* calculation of the offset */ end

Compressed pattern matching on a collage system O( ||D|| + |S| + m 2 + r ) time LZ78, LZW, BPE BPE, Run-length, etc... LZ78, LZW, BPE BPE, Run-length, etc... no truncation LZ77, LZSS, etc... truncation O( (||D|| + |S| ) ・ height(D) + m 2 + r ) time not suitable for speeding up pattern matching

Byte Pair Encoding original encoding algorithm and modified algorithm

A B C D E F G H I Code Pair Pair Table Byte Pair Encoding Text: T = ABABCDEBDEFABDEABC GGCHBHFGHGC GIHBHFGHI GGCDEBDEFGDEGC AB AB→G DE DE→H GC GC→I AB C D E F Used Character ABABABAB DEDEDE GCGC

Byte Pair Encoding “collage system” ABABABAB Text: T = ABABCDEBDEFABDEABC GCGC GGCHBHFGHGC GIHBHFGHI DEDE GGCDEBDEFGDEGC AB→G DE→H GC→I X 1 = A; X 2 = B ; D : X 7 = X 1 ・ X 2 ; X 6 = F ; X 5 = E ; X 4 = D ; X 3 = C ; X 8 = X 4 ・ X 5 ; X 9 = X 7 ・ X 3 ; S : X 7, X 9, X 8, X 2, X 8, X 6, X 7, X 8, X 9

Speeding up of compression Time complexity of BPE O(uN) u : The number of character codes ， N : Text length using doubly-linked list O(u + N) time

Speed-up of compression original text: we apply the BPE algorithm to the first block. X 1 = A X 2 = C X 3 = X 2 ・ X 1 X 255 = X 247 ・ X 8 X 256 = X 125 ・ X 48 D: Pattern Matching Machine for multiple replacement [Arikawa et al. 1984] BPE compressed text:

BPE CompressGzip originalmodified Brown corpus ( 6.8Mb) Medline (60.3Mb) Genbank (17.1Mb) 51.0 56.2 30.832.5 59.0 26.8 42.3 43.739.0 33.3 23.1 Brown corpus Medline Genbank 196.9 1699.9 440.616.5 60.7 8.0 19.3 73.3 12.737.7 242.2 100.9 Comparison of Compression Ratio and time compression Ratio(%) compression time(sec) BPE are worse than those of “Compress” and “Gzip” It is drastically accelerated by our modification

Compressed pattern matching on BPE compressed text Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time. Problem of compressed pattern matching on BPE compressed text can be solved in O( ||D|| + |S| + m 2 + r ) time O( ||D|| + |S| + m 2 + r ) time. ||D|| ≦ 256 -The dictionary D is encoded separately from the sequence S. -The size of D is small enough. -The variables of S are encoded using a fixed length code.

Experimental resultKMPKMP AgrepAgrep our algorithm Medline data (compression ratio is 59%) Genbank data (compression ratio is 32%) Ultra... a clinically- oriented subset of Medlin a data set from GenBank

Concluding Remarks Conclusion and Future Works

Conclusion We introduced compressed pattern matching from practical viewpoints. We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case. We also observed that it is occasionally faster than Agrep ．

Future Works Can we reduce the complexity of the preprocessing? O(m 2 )  O(m) To develop a sublinear algorithm on BPE compressed texts. To develop an approximate pattern matching algorithm on a collage system. To develop a new compression which is suitable for compressed pattern matching. More recent work

A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] compressed pattern matching [CPM2000] A Boyer-Moore type algorithm for compressed pattern matching [CPM2000] compressed pattern matching [CPM2000] We proposed a Boyer-Moore (BM) type algorithm for pattern matching in BPE compressed texts. Does text compression speed up such a sublinear time algorithm?

More recent work KMP Agrep our algorithm most recent work KMP Agrep our algorithm most recent work Medline data (compression ratio is 59%) Genbank data (compression ratio is 32%)

Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,

Similar presentations

Presentation on theme: "Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,

Similar presentations

Presentation on theme: "Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,"— Presentation transcript:

Similar presentations

About project

Feedback