Download presentation

Presentation is loading. Please wait.

Published byLisbeth Trivett Modified over 2 years ago

1
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research Cambridge

2
t three trial triangle trie triple triply iree εε l εεεgle hr e p a l n Node label Branching character Compacted tries ye

3
Applications String dictionaries – With prefix lookup, predecessor, … – Exploit prefix compression Monotone perfect hash functions – “Hollow” or “Blind” tries [ALENEX 09] – Binary tree (no need store branching chars) – No need to store node labels, just lengths (skips)

4
Height vs. performance Tries can be deep – no guarantee on height Bad with pointer-based trees – ~1 cache miss per child operation Worse with succinct tree encodings – Need to access several directories – Many cache misses per child operation – Large constants hidden in the O(1)

5
Path decomposition t iree εε l εεεgle hr e p a l n ye he p l Query: triple Recurse here with suffix le

6
Centroid path decomposition Decompose along the heavy paths – choose the edge that has most descendants Height of the decomposed tree: O(log n) – Usually lower Average height Web QueriesURLsSynthetic Compacted trie11.018.1504.4 Centroid trie5.26.22.8 Hollow trie50.867.31005.3 Centroid hollow trie8.09.22.8

7
Succinct encoding [PODS 08] presents a succinct data structure for centroid path-decomposed tries Not practical: need complex operations on succinct trees We introduce a simpler and practical encoding This encoding enables also simple compression of the labels

8
Succinct encoding he p l L : t1ri2a1ngle BP: ( ((( ) B : h epl (spaces added for clarity) Node label written literally, interleaved with number of other branching characters at that point in array L Corresponding branching characters in array B Tree encoded with DFUDS in bitvector BP – Variant of Range Min-Max tree [ALENEX 10] to support Find{Close,Open}, more space-efficient (Range Min tree)

9
Compression of L...$...index.html$....html$....html$...index.html$...$...35$...5$...5$...35$ … 3 index … 5.html … Dictionary Dictionary codewords shared among labels Codewords do not cross label boundaries ( $ ) Use vbyte to compress the codeword ids

10
Compression of L Node labels ( t1ri2a1ngle, l1e, …): – each label is suffix of a string in the set – interleaved with few “special characters” 1, 2, 3,… Compressible if strings are compressible Dictionary and parsing computed with modified Re-Pair – Domain-specific compression can be used instead Decompression overhead negligible

11
Experimental results (time) Experiments show gains in time comparable to the gains in height Confirm that bottleneck is traversal operations Web QueriesURLsSynthetic Trie3.57.0119.8 Centroid trie2.44.35.1 Hollow trie [ALENEX 09]16.622.4462.7 Hollow trie7.213.9137.1 Centroid hollow trie2.84.411.1 (microseconds, lower is better) Code available at https://github.com/ot/path_decomposed_tries

12
Experimental results (space) For strings with many common prefixes, even non-compressed trie is space-efficient Labels compression considerably increases space-efficiency Decompression time overhead: ~10% Web QueriesURLsSynthetic Hu-Tucker Front Coding40.9%24.4%19.1% Centroid trie55.6%22.4%17.9% Centroid trie + compression31.5%13.6%0.4% (compression ratio, lower is better) Code available at https://github.com/ot/path_decomposed_tries

13
Thanks for your attention! Questions?

14
References [ALENEX 10] D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84–97, 2010. [ALENEX 09] D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In SODA, pages 785– 794, 2009. [PODS 08] P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In PODS, pages 181–190, 2008.

Similar presentations

OK

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on swine flu Ppt on sports and games free download Ppt on shielded metal arc welding Ppt on amplitude shift keying Ppt on subconscious mind power Ppt on pop art Ppt on ms-excel functions Ppt on sports day dallas Ppt on condition monitoring training Ppt on different occupations in a hospital