Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.

Similar presentations


Presentation on theme: "Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia."— Presentation transcript:

1 Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE 2012 @ Cartagena, Colombia

2 Outline SPIRE 2012 @ Cartagena, Colombia  Background  LZ78 Factorization  Straight Line Programs (SLP)  Algorithms  LZ78 factorization using suffix trees  SLP to LZ78  Improvements

3 Background SPIRE 2012 @ Cartagena, Colombia Compr essed Repres entatio n of String BIG String This work: LZ78 factorization of grammar compressed strings Compressed String Processing (CSP)  compress string for storage … but … don’t decompress all of it when using it!  can be faster than processing the uncompressed text, by exploiting regularities identified by compression  regard compression as a generic preprocessing! Pattern Matching process directly Edit Distance Pattern Mining etc.

4 LZ78 Factorization [Ziv&Lempel ’78] SPIRE 2012 @ Cartagena, Colombia The LZ78-factorization of string S is a factorization S = f 1 f 2... f m where f i is the longest prefix of f i... f m such that f i = f j c for some 0 ≤ j < i (let f 0 = ε) S = a l a b a r a l a l a b a r d a $ 0 0 1 1 a 2 2 l 3 3 b 4 4 r 5 5 l 7 7 b 6 6 a 8 8 d 9 9 $ LZ78 trie of S (0, a ) f1f1 (0, l ) f2f2 (1, b ) f3f3 (1, r ) f4f4 (1, l ) f5f5 (5, a ) f6f6 (0, b ) f7f7 (5, d ) f8f8 (1, $ ) f9f9 O(N log σ) time O(m) space

5 Straight Line Programs SPIRE 2012 @ Cartagena, Colombia CFG in Chomsky normal form that derives single string. Can efficiently model outputs of many compression algorithms: REPAIR, SEQUITUR, LZ78, etc. Straight Line Program X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 1 X 3 X 5 = X 4 X 3 X 6 = X 4 X 5 X 7 = X 6 X 5 SLP, n=7 Derivation tree S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5

6 Problem: SLP to LZ78 SPIRE 2012 @ Cartagena, Colombia Input: SLP Output: LZ78 Factorization (Trie) X 1 = a X 5 = X 4 X 3 X 2 = b X 6 = X 4 X 5 X 3 = X 1 X 2 X 7 = X 6 X 5 X 4 = X 1 X 3 0 0 1 1 5 5 2 2 3 3 4 4 6 6 a a b a b b Why “re-compress” a compressed representation?  Convert the representation  Some CSP algorithms require specific compression  Re-compress an SLP modified by ad-hoc edits  Dynamic compressed texts  Compute Normalized Compression Distance [Li et al. 2004]  Clustering & classification w/o decompression C LZ78 (x), C LZ78 (y), C LZ78 (xy) from SLPs of x, y Computer Scientist Make Sleeping Files Walk in their Sleep!

7 Our Results SPIRE 2012 @ Cartagena, Colombia Algorithms to compute LZ78 from SLP AlgorithmTimeSpace Direct (uncompressed) O(N log σ ) O(m) Decompress + Direct O(N log σ ) O(n+m) SLP (partial decompressions) O(nN ½ + m log N)O(nN ½ + m) SLP + Doubling O(nL + m log N)O(nL + m) SLP + Redundancy Reduction O(N α + m log N)O(N α + m) N : length of uncompressed string S σ: alphabet size n : size of SLP representing SL : length of longest LZ78 factor N α = N – α ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ) α ≥ 0 is a quantity that represents the amount of redundancy in the string that is captured by the SLP

8 LZ78 Factorization using a Suffix Tree SPIRE 2012 @ Cartagena, Colombia

9 Suffix Tree & LZ78 SPIRE 2012 @ Cartagena, Colombia The LZ78 trie can be superimposed on the suffix tree S 12345678910111213 suffix tree of S LZ78 trie of S aabaababaabab 10 a 5 5 8 8 7 7 9 9 12 1 1 4 4 2 2 3 3 13 b a a b a b a 11 6 6 a b a b a a b a b b a a b a b b a b a b a a b a b a a b a b a b a b a a b a b b a a b a b 0 0 1 1 3 3 2 2 5 5 6 6 4 4 a a b a b b 0 0 1 1 3 3 2 2 5 5 6 6 4 4 a a b a b b

10 10 a 5 5 8 8 7 7 9 9 12 1 1 4 4 2 2 3 3 13 b a a b a b a 11 6 6 a b a b a a b a b b a a b a b b a b a b a a b a b a a b a b a b a b a a b a b b a a b a b 3 3 1 1 2 2 LZ78 Factorization on Suffix Tree SPIRE 2012 @ Cartagena, Colombia aabaababaabab S 12345678910111213 0 0 5 5 4 4 6 6 Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked Find longest prefix of S[i:N] in LZ78 trie  O(1) time by dynamic nearest marked ancestor queries [Westbrook, ‘92] Make new node of LZ78 trie on ST  O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94] Compute next position i  i + |f i | LZ78 factorization in O(m) time, given suffix tree preprocessed for nma & la queries i Next factor is prefix of S[i:N]. Find node in ST corresponding to S[i:N]

11 SLP to LZ78 SPIRE 2012 @ Cartagena, Colombia

12 Our algorithm: SLP to LZ78 SPIRE 2012 @ Cartagena, Colombia We only need a suffix tree that contains all distinct substrings of S with length at most c N  Build GST from a set of substrings of S that contain all distinct length-c N substrings of S Main Idea For any string of length N, the length of any LZ78 factor f i satisfies: |f i | ≤ c N = (2N+¼) ½ – ½ = O(N ½ ) For any string of length N, the length of any LZ78 factor f i satisfies: |f i | ≤ c N = (2N+¼) ½ – ½ = O(N ½ ) Key Observation

13 Important Concept: Stabbing SPIRE 2012 @ Cartagena, Colombia X i stabs an interval [u:v] of S, when it is the shortest variable that derives the interval (any interval is stabbed by a unique variable) X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 1 X 3 X 5 = X 4 X 3 X 6 = X 4 X 5 X 7 = X 6 X 5 e.g.: aaba at [9:12] is stabbed by X 5 X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 X5X5 12345678910111213

14 Substrings stabbed by X i SPIRE 2012 @ Cartagena, Colombia All length-q substrings stabbed by X i are contained in a string t i (q) of length at most 2(q – 1) Xl(i)Xl(i) Xr(i)Xr(i) XiXi q – 1 q q Any length-q substring of S is stabbed by some unique variable X i, and therefore is a substring of some t i (q) { t i (c N ) : |X i | ≥ c N, 1 ≤ i ≤ n } will contain all distinct length-c N substrings of S ti(q)ti(q)

15 LZ78 Factorization from SLP SPIRE 2012 @ Cartagena, Colombia Algorithm: 1. Compute { t i (c N ) : |X i | ≥ c N, 1 ≤ i ≤ n } 2. Build generalized suffix tree (GST) for strings { t i (c N ) : |X i | ≥ c N, 1 ≤ i ≤ n } 3. Run LZ78 Factorization algorithm using GST O(nc N ) time/space

16 Example SPIRE 2012 @ Cartagena, Colombia  N = 13, c N = 4, n = 7  { t 5 (4), t 6 (4), t 7 (4) } = { aabab, aabaab, babaab } S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 12345678910111213

17 GST & LZ78 Factors SPIRE 2012 @ Cartagena, Colombia The LZ78 trie superimposed on GST of {t 5 (4), t 6 (4), t 7 (4)} aabaababaabab S 12345678910111213 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b GST of {t 5 (4),t 6 (4),t 7 (4)} LZ78 trie of S 0 0 1 1 3 3 2 2 5 5 6 6 4 4 a a b a b b 0 0 1 1 3 3 2 2 5 5 6 6 4 4 a a b a b b a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617

18 Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i  i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 12345678910111213 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b 1 1 LZ78 Factorization on GST SPIRE 2012 @ Cartagena, Colombia 0 0 c N = 4 i O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries

19 a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b 1 1 2 2 LZ78 Factorization on GST SPIRE 2012 @ Cartagena, Colombia 0 0 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 12345678910111213 c N = 4 i Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i  i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries

20 a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 12345678910111213 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b 1 1 3 3 2 2 LZ78 Factorization on GST SPIRE 2012 @ Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i  i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries

21 a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 12345678910111213 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b 1 1 3 3 2 2 LZ78 Factorization on GST SPIRE 2012 @ Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries 4 4 Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i  i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries

22 a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 12345678910111213 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b 1 1 3 3 2 2 LZ78 Factorization on GST SPIRE 2012 @ Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries 4 4 5 5 Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i  i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries

23 a a b a b a a b a a b b a b a a b t 5 (4) t 6 (4) t 7 (4) 1234567891011121314151617 S X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 1234567891011 1213 a a b a b a b b b a a 3 3 8,14 b 7,13 9,15 4,10,16 5,11,17 1 1 6 6 a b b 2 2 3 3 12 a b a b 1 1 3 3 2 2 LZ78 Factorization on GST SPIRE 2012 @ Cartagena, Colombia 0 0 c N = 4 i LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries 4 4 5 5 6 6 Find longest prefix of S[i:N] in LZ78 trie Make new node for LZ78 trie on ST Compute next position i  i + |f i | Next factor is prefix of S[i:N]. Find node in GST corresponding to S[i:N] O(log N) time w/ random access on SLP [Bille et al. 2011] O(1) time w/ dynamic nma queries

24 Summary of Basic Algorithm SPIRE 2012 @ Cartagena, Colombia Extreme Cases:  If the string is compressible, n = O(log N), m = O(N ½ ), so O(nc N + m log N) = O(N ½ log N) = o(N)  If the string is not compressible, n, m = O(N) and O(nc N + m log N) = O(N 1.5 ) AlgorithmTimeSpace Direct (uncompressed) O(N log σ)O(m) Decompress + Direct O(N log σ)O(n+m) SLP O(nc N + m log N)O(nc N + m) c N = O(N ½ ) can we do better than just revert to decompress & process?

25 (1) Improving nc N term to nL ≤ nc N SPIRE 2012 @ Cartagena, Colombia Let L denote length of longest LZ78 factor of S  We built GST for distinct substrings of length at most c N but actually, we only need substrings of length at most L  However, L is not known beforehand… O(nc N + mlogN) time, O(nc N + m) space  O(nL + mlogN) time, O(nL + m) space  Assume L = 2 and run algorithm.  If LZ78 trie expands beyond GST, L  2×L, rebuild GST and LZ78 trie, and continue  Total time complexity for rebuild: Σ i=1..log L O(n2 i +m) = O(nL+mlogL) Doubling Technique:

26 (2) Improving nc N term to N α ≤ N SPIRE 2012 @ Cartagena, Colombia We can replace GST with suffix tree of trie for q = c N Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of size N α = N – α (q) ≤ N,where α (q) = Σ i:|X i | ≥ q (vOcc(X i ) – 1) (|t i (q)| – (q – 1)) ≥ 0 vOcc(X i ) : # of times X i occurs in derivation tree Lemma [Goto et al. CPM 2012] The suffix tree of a reverse trie can be constructed in linear time. Lemma [Shibuya 2003] O(nc N + mlogN) time, O(nc N + m) space  O(N α + mlogN) time, O(N α + m) space The trie can be computed in time linear of its size. N α = O(nc N )

27 Example: Trie of size N α for q = 4 SPIRE 2012 @ Cartagena, Colombia X7X7 X2X2 X1X1 X6X6 X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X1X1 X3X3 X4X4 X3X3 X4X4 X5X5 aabaababaabab S aabab aab bab X2X2 X1X1 X1X1 X3X3 X2X2 X1X1 X4X4 X3X3 X5X5 Σ|t i (q)| : 17 Text size: 13 Trie size: 11 We can aggregate all t i (q) into a trie of size at most the text size

28 Summary SPIRE 2012 @ Cartagena, Colombia  Showed algorithm for SLP  LZ78 factorization  at least as fast as naïve decompress & process  better when string is compressible AlgorithmTimeSpace Direct (uncompressed) O(N log σ ) O(m) Decompress + Direct O(N log σ ) O(n+m) SLP (partial decompressions) O(nN ½ + m log N)O(nN ½ + m) SLP + Doubling O(nL + m log N)O(nL + m) SLP + Redundancy Reduction O(N α + m log N)O(N α + m) N : length of uncompressed string S σ: alphabet size n : size of SLP representing SL : length of longest LZ78 factor N α = N – α(c N ) ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ)


Download ppt "Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia."

Similar presentations


Ads by Google