Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

LZD Factorization: Simple and Practical Online Grammar Compression with Variable-to-Fixed Encoding
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University This is a joint work with Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda.

Overview We propose a novel online grammar compression algorithm called LZ Double, which is based on LZ78 With slight modification to LZ78, LZ Double achieves better compression ratio Moreover, compared to previous online grammar compression algorithms, Compression ratios of LZ Double are better Compression speed of LZ Double is competitive In this talk, we propose a novel online grammar compression algorithm called LZ Double which is based on LZ78. With slight modification to LZ78, LZ Double achieves better compression ratio Moreover, compared to previous online grammar compression algorithms, compression ratios of LZ Double are better, and also compression speed is competitive

LZ78 Factorization [Ziv and Lempel 1978]
Definition Let f0 = ε, LZ78 factorization of a string T is f0 f1… fm such that, for fj starting at i = 1+| f0 f1… fj-1|, fj is fk c (0 ≦ k < j ≦ m), where fk ∈{f0 f1… fj-1} is the longest previous factor(LPF) that occurs at i c is a following character T [i + | fk|] f0 f1 (f0, a) T = abaabababaaaaab・・ f2 (f0, b) f3 (f1, a) f4 (f2, a) f5 (f4, b) f6 (f3, a) f7 (f3, b) f1 a f2 b I will explain the definition of LZ78, let f0 be a empty string, LZ78 factorization of a string T is a factorization f0 to fm such that for a factor fj and its starting position i, fj is a pair of a factor fk and a character c, where fk is the longest previous factor LPF that occurs at i, We will call LPF for the longest previous factor in the rest of talk. and c is a following character. In this example, a occurs for the first time, so f0 is LPF, and f1 is a pair of f0 and a. b also occurs for the first time, so f2 is a pair of f0 and b. Next, f1 is LPF, so f3 is a pair of f1 and a following character a. In this way LZ78 factorizes a string, and it can be represented by a sequence of pairs that the LPF and a following character. f3 a f4 a f6 a f7 b f5 b

Theorem For a string T of length N over an alphabet of size σ, LZ78 factorization can be computed in online manner in O(N log σ) time and O(m) space N f0 f1 (f0, a) T = abaabababaaaaab・・ f2 (f0, b) f3 (f1, a) f4 (f2, a) f5 (f4, b) f6 (f3, a) f7 (f3, b) a f2 b Each LPF can be computed by traversing the trie, Since the total length of factors is N, LZ78 can be computed in online manner in O(N log σ) time and O(m) space, where σ is the alphabet size. f1 a a f3 f4 a b b m f6 f7 f5

□ Good: simple, and easy to implement □ Bad: low compression ratio new factor can be at most 1 character longer than the longest previous factors LZ78: fj = fk c Good point of LZ78 is that the definition is very simple and it is easy to implement. We just compute LPF and, create a new factor LPF and a following character But, the compression ratio is not good. This is because new factor can be at most 1 longer than the maximum length of previous factors we modify this

concatenation of two previous longest factors
Idea of LZ Double In the LZD factorization, the new factor is the concatenation of two previous longest factors □ Good: still simple, easy to implement, 　　AND better compression ratio New factor can be twice as long as the longest previous factor LZ Double solve the problem by modifying the definition of a factor. A factor of LZ double is represented as a pair of LPF and following LPF, so the length of the new factor can be twice as long as the maximum length of previous factors. fj = fk c LZ78: LZ Double: fj = fl(j) fr( j) concatenation of two previous longest factors

Formal Definition of LZ Double Factorization
Let f0 = ε, LZD factorization of T is f0 f1 … fm such that, for i = 1 + | f0 f1 … fj-1|, fj is fl(j) fr(j) , where fl(j) ∈ { f1 … fj-1}∪ Σ is LPF that occurs at i fr(i) = { f0 f1 … fj-1} is LPF that occurs at i + |fl(i)| f0 f1 a f2 b T = abaabaaabaaababaabb・・・ f1 (a, f0) f2 (b, f0) f3 (f1, f1) f4 (f2, f3) f5 (f1, f4) f6 (f1, f2) This is the formal definition. In this example, a and b occurs for the first time, so f1 is a pair of (a, f0), and f2 is (b, f0), where f0 is the empty string Next, the LPF starting at position 3 is f1, and the LPF starting at position 4 is also f1, so f3 is a pair of f1 and f1, Next the LPF starting at position 5 is f2, the LPF starting at position 6 is f3, so f4 is a pair of f2 and f3． In this way, LZD factorizes a string, and it can be represented by a sequence of pairs of LPF and LPF. f5 a b f3 a f4 a

LZ Double Factorization
Definition Let f0 = ε, LZD factorization of T is f0 f1 … fm such that, for i = 1 + | f0 f1 … fj-1|, fj is fl(j) fr(j) , where fl(j) ∈ { f1 … fj-1}∪ Σ is LPF that occurs at i fr(i) = { f0 f1 … fj-1} is LPF that occurs at i + |fl(i)| f0 a b T = abaabaaabaaababaabb・・・ f7 (f5, f2) f1 f2 In this way, LZD factorizes a string, and it can be represented by a sequence of pairs of LPF and LPF f1 f2 f3 f4 f5 f6 a b a f6 a a (a, f0) (b, f0) (f1, f1) (f2, f3) (f1, f4) (f1, f2) a f3 f5 f4 f7 b

Naive LZD Factorization Algorithm
Stores all previous factors in a Patricia tree, and marks their corresponding nodes Traverses the tree with suffix T[i..N], and then the deepest marked node in this path is LPF Inserts a new factor, and marks corresponding node f0 T = abaabaaabaaababaabb・・・ f1 (a, f0) f2 (b, f0) f3 (f1, f1) f4 (f2, f3) f5 (f1, f4) f1 a f2 b Naive algorithm stores all previous factors in a patricia tree not trie, and marks nodes which correspond to factors. In order to compute LPF starting at i, traverse the tree with suffix i, and then the deepest marked node in the path f5 a b f3 a f4 a

Stores all previous factors in a Patricia tree, and marks their corresponding nodes Traverses the tree with suffix T[i..N], and then the deepest marked node in this path is LPF Inserts a new factor, and marks corresponding node f0 T = abaabaaabaaababaabb・・・ f1 (a, f0) f2 (b, f0) f3 (f1, f1) f4 (f2, f3) f5 (f1, f4) f1 a f2 b f6 After computing two LPF, it inserts a new factor node, and marks it. b f3 a f4 a (f1, f2) f6 a a f5

Theorem For a string T of length N over an alphabet of size σ, Let m be the number of factors, and M be the maximum length of factors, LZD factorization can be computed in O(m (M + min(m, M) log σ)) time and O(m) working space f0 T = abaabaaabaaababaabb・・・ f1 a f2 b f1 f2 f3 f4 f5 f6 Let M be the maximum length of factors, and m be the number of factors, to compute LPF, the algorithm traverses the tree with at most M characters and encounters at most minimum of m and M branch nodes. Therefore in total, LZD can be computed in this time, and O(m) working space. b f3 a f4 a M (a, f0) (b, f0) (f1, f1) (f2, f3) (f1, f4) (f1, f2) f6 a a f5 m

Ideas of O(N log σ) time Algorithm
It superimposes the Patricia tree for previous factors into Suffix Tree a b f0 f1 a f2 b f3 f4 f5 f6 O(N log σ) time algorithm does similar process of Naive algorithm, to compute LPF, finds the deepest node u, and computes its nearest marked ancestor nma(u). We will explain that these two process can be computed efficiently. For each position i, the deepest node u by traversing with suffix i can be computed in amortized O(log σ) time Second, its nearest marked ancestor nma(u) can be computed in constant time

It superimposes the Patricia tree for previous factors into Suffix Tree For each position i, the deepest node which represents a prefix of T[i..N] can be computed in amortized O(log σ) time using Ukkonen’s algorithm a b f0 f1 a f2 b f3 f4 f5 f6 O(N log σ) time algorithm does similar process of Naive algorithm, to compute LPF, finds the deepest node u, and computes its nearest marked ancestor nma(u). We will explain that these two process can be computed efficiently. For each position i, the deepest node u by traversing with suffix i can be computed in amortized O(log σ) time Second, its nearest marked ancestor nma(u) can be computed in constant time

For each node u of a growing tree, the nearest marked ancestor of u can be computed in amortized constant time, and an unmarked node can be marked in amortized constant time [Amir+, 1995] a b nma(u) T = abaabaaabaaababaabb・・・ f1 (a, f0) f2 (b, f0) f3 (f1, f1) f4 (f2, f3) f5 (f1, f4) u O(N log σ) time algorithm does similar process of Naive algorithm, to compute LPF, finds the deepest node u, and computes its nearest marked ancestor nma(u). We will explain that these two process can be computed efficiently. For each position i, the deepest node u by traversing with suffix i can be computed in amortized O(log σ) time Second, its nearest marked ancestor nma(u) can be computed in constant time

O(N log σ) LZD Algorithm by Ukkonen’s Suffix Tree
Theorem For a string T of length N over an alphabet of size σ, LZD factorization can be computed in O(N log σ) time and O(N) space In this way, each LPF can be computed in amortized O(log σ) time by computing the deepest node traversing with suffix i, and its nearest marked ancestor. In total, LZD can be computed in O(N log σ) time and O(N) space by using Suffix Trees.

LZD factorization with Variable-to-Fixed-Encoding
LZDVF is a restricted version of LZD that remembers at most 2L previous factors for each step, where L is the bit-length of codewords representing factors Therefore, each factor is the concatenation of two longest previous factors in the dictionary containing at most 2L previous factors We propose two variants of LZDVF LZDVF Prefix : based on least recently used (LRU) strategy LZDVF Count : based on least frequently used (LFU) strategy LZDVF remembers only at most 2 to L previous factors in every step, where L is the length of the encoding for each LPF. Therefore, each new factor is encoded in 2L bits. We propose two variants, which are based on least recently used strategy LZDVF Prefix, and least frequently used strategy LZDVF Count Each of which deletes factors when the space is full for efficiently using limited space. Both are different for the point how selects nodes will be deleted.

T = abaabaaabaaababaabb・・・
LZDVF Prefix When a new factor fi is added, LZDVF Prefix considers that all factors, which are non-empty prefixes of fi , are used at this step E.g. when creating a factor f7 = (f5, f2) we consider f7, f5, f6, f1, f0 are used in this order f0 a b f1 f2 a b a T = abaabaaabaaababaabb・・・ When a new factor fi is added, LZDVF Prefix considers that all factors which are prefix of i are used. For example, When f7 is added, Since f7, f5, f6, f1 are all the prefix of f7, we consider these factors are used in this order f7 (f5, f2) f6 a a f1 f2 f3 f4 f5 f6 a f3 f5 f4 (a, f0) (b, f0) (f1, f1) (f2, f3) b (f1, f4) (f1, f2) f7

LZDVF Prefix Manages factors in a doubly linked list in frequently used order Deleting and using a factor can be done in constant time most recently used least recently used f1 f6 f5 f2 f4 f3 Theorem Assume 2L < N for simplicity. For a string T of length N over an alphabet of size σ, LZDVF Prefix can be computed in O(N + 2L (M + min(M, 2L) log σ)) time and O(2L) working space We manage all nodes in a doubly linked list in frequently used order, which supports deleting and using a factor in constant time. Assume 2 to L is less than the length of input string, LZDVF Prefix can be computed in this time and O(2 to L) working space.

T =abaabaaabaaababaabb・・
LZDVF Count When a new factor fi is added, LZDVF Count considers that at this step, each factor is used by the number of occurrences in the derivation tree of fi E.g. when creating a factor f7 = (f5, f2), f5, f4, f3 are used once, f2 is used twice, and f1 is used 3 times The derivation tree of f7 f7 f5 f2 T =abaabaaabaaababaabb・・ Next, I will explain LZDVF Count. When a new factor fi is added, LZDVF Count considers that all factors that occur in the derivation tree of fi are used. For example, this is the derivation tree of f7, and in the derivation tree, f5, f4, f3 occur one time, f2 is two time, and f1 is three time. We consider these factors are used the number of times of their occurrences. f7 (f5, f2) f1 f4 f1 f2 f3 f4 f5 f6 f2 f3 (a, f0) (b, f0) (f1, f1) (f2, f3) (f1, f4) (f1, f2) f1 f1 a b a a b

LZDVF Count Manages frequencies for a factor fi by Count( fi )
When a new factor fi is added, Count( fj ) is increased by vOcc( fi, fj ) for fj such that vOcc(fi, fj ) > 0, where vOcc( fi, fj ) is the number of occurrences of fj in the derivation tree of fi If the dictionary is full, Count(fi ) is decreased one for all factors, and delete factors fi such that Count(fi ) = 0 Theorem This is a formal description. We used Count(fi) as a counter that counts the number of occurrences of fi in a local period. If the space is full, we subtract one from Count for all factors, and delete all factors such that Count(fi) = 0 Notice that when fi is added, Count for fi-1 must be 1, so we only have to subtract one from Count. Assume 2L < N for simplicity. For a string T of length N over an alphabet of size σ, LZDVF Prefix can be computed in O(N + 2L (M + min(M, 2L) log σ) time and O(2L) working space

Computational Experiments
We compared our algorithms LZD (Naive), LZDVF Prefix, Count, and previous online grammar compression algorithms LZ78, FOLCA[Maruyama+, 2013] and OLCA[Maruyama, 2011] LZD and LZ78 are once transformed to Straight Line Programs, and encoded in same encoding of OLCA Data Pizza Chili corpus (ENGLISH, DNA, PROTEINS, DBLP, SOURCES) We compared our algorithms, naive version of LZD, LZDVF Prefix and Count, and previous online grammar compression LZ78, and FOLCA, OLCA which are state of the art online grammar compression algorithms. We explained how to encode for LZDVF, but not for LZD and LZ78. In the experiments, LZD and LZ78 are once transformed to Straight Line Programs, and encoded in same way of OLCA. We experiment for non-highly-repetitive data from pizza chili corpus, and highly-repetitive large data of size 10GB from wikipedia.

Compression Speed and Ratio for pizza chili corpus
Compression speed of LZD/LZDVFs is slow Compression ratio of LZD/LZDVF Count is better than the others LZD LZDVF Prefix(L=10) LZDVF Prefix(L=16) LZDVF Count(L=10) LZDVF Count(L=16) LZ78 OLCA FOLCA Faster OLCA LZVF Count x axis indicates compression ratios y axis indicates compression speed which is the number of characters each algorithm processed per second. This figure plots the compression ratio, and the compression speed for several data in the corpus. In this experiment, The compression ratio of LZD is almost better than FOLCA, OLCA and LZ78 LZDVFs compress well, and run faster than the others. LZ78 LZD Better

Computational Experiments
We compared our algorithms LZD (Naive), LZDVF Prefix, Count, and previous online grammar compression algorithms LZ78, FOLCA[Maruyama+, 2013] and OLCA[Maruyama, 2011] LZD and LZ78 are once transformed to Straight Line Programs, and encoded in same encoding of OLCA Data Non highly-repetitive, Pizza Chili corpus Highly-repetitive large texts, 10GB of English Wikipedia edit history We compared our algorithms, naive version of LZD, LZDVF Prefix and Count, and previous online grammar compression LZ78, and FOLCA, OLCA which are state of the art online grammar compression algorithms. We explained how to encode for LZDVF, but not for LZD and LZ78. In the experiments, LZD and LZ78 are once transformed to Straight Line Programs, and encoded in same way of OLCA. We experiment for non-highly-repetitive data from pizza chili corpus, and highly-repetitive large data of size 10GB from wikipedia.

Computational Experiments for Highly-repetitive Large Texts
LZDVF Prefix and Count with Lossy FOLCA, Freq FOLCA[Maruyama and Tabei, 2014], and ADS[Sekine+, 2014] which are previous online grammar compression for highly-repetitive large texts For each algorithm except Lossy FOLCA, the parameter of the number of rules locally stored is varied to 212, 214, 216 For Lossy FOLCA, the parameter of the block size separating the input is varied to 100MB, 500MB, 1000MB, and for ADS, it is fixed to 100MB We experiment for highly-repetitive large texts, we compared LZDVFs and grammar compressions which used only constant space, Lossy-and-Freq FOLCA based on FOLCA, and ADS based on RE-PAIR Some of these algorithms need to be set the dictionary size, and the block size. We set the dictionary size for 2to12, 2to14, and 2to16, and the block size for 100MB and 200MB

Compression Speed for Highly-repetitive Large Texts
Speed of LZDVF Prefix is almost 8 times better than Freq FOLCA, Lossy FOLCA, and ADS LZDVF Prefix LZDVF Count Freq FOLCA Lossy FOLCA ADS LZDVF Prefix(216 rules) LZDVF Count(216 rules) ADS(216 rules, 100MB of block size) 読む Freq FOLCA( 214, 216 rules, ) Freq FOLCA(500MB, 1000MB of block size

Decompression Speed for Highly-repetitive Large Texts
Speed of LZDVF Prefix is almost 2 times better than ADS, and 3-5 times better than Freq FOLCA and Lossy FOLCA LZDVF Prefix(216 rules) LZDVF Prefix LZDVF Count Freq FOLCA Lossy FOLCA ADS ADS(216 rules, 100MB of block size) LZDVF Count(216 rules) 読む Freq FOLCA( 214, 216 rules, ) Freq FOLCA(500MB, 1000MB of block size

Summary Future Work We proposed LZ78 like online grammar compressions
LZ Double LZ Double with Variable-to-Fixed Encoding Find a worst case instance for Naive algorithm for LZD Develop an algorithm to compute LZD using less space Analyze approximation ratio of LZD to the smallest grammar Future Work This is summary We proposed LZ78 like online grammar compressions. In our experiments, our new algorithms runs fast than previous online grammar compressions. For highly-repetitive large texts, LZDVF Prefix improved much the compression and decompression speeds. Tha’s all Thank you.

Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

Similar presentations

Presentation on theme: "Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

Similar presentations

Presentation on theme: "Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda"— Presentation transcript:

Similar presentations

About project

Feedback