Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.

Similar presentations


Presentation on theme: "Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman."— Presentation transcript:

1 Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman

2 Given static sequence of positive integers, such that Support Problem Minimising space for storing according to some compressibility criteria. Supporting Sum rapidly Trivial solution Explicitly store Sum values Requires: bits, support Sum in O(1) time PSDS Prefix Sums problem

3 Motivation: Inverted List Locations of keywords in main text Positive sequence of strictly increasing integers Term=Moses: 650, 687, 696 Bible doc: Moses…… ….Moses… ………………..Moses…

4 Store differences Significant space saving, standard technique [Managing Gigabytes, Witten, Moffat, Bell] We can store in PSDS => th location of keyword Direct access to individual, help to answer conjunction queries. Inverted List

5 String Collection Collection of non-empty stings, Store in PSDS, where Concatenate strings Store concatenated string in array Store concatenated string in compress self-index e.g. FM-index or CSA Get th string Offset =, Length = Text string: selmanjava3d programming2000” Offsets:0,6, 13,10, 24

6 URLs Web search engines with large database of URLs. URLs are strings URLs are 60 chars long average compressed fairly well Explicit pointer for each URL requires 64 bits www.cs.le.ac.ukwww.cs.le.ac.uk/people/ond1www.le.ac.uk/librarywww.star.ac.uk

7 XML Documents XML Doc: selman java3d programming 2000 book titleauthoryear #doc 13 2 46 7 “Java3d programming” “[cr][sp]” “selman” 5 “2000” “[cr][sp]” selman java3d programming 2000  Text nodes 10-12 chars in length in avg. Compressed to average 3-4 bytes  32-bit pointer overhead for each string (naive)

8 Related Work [CJM]: Clark thesis,Jacobson FOCS 89, Clark,Munro SODA 96 [Geary et al.]: Geary, Raman, Rahman CPM 04,TCS 06 [Kim et al.]: Kim, Chae Na, Kim, Park WEA 05 [Gupta et al. (a)]: Gupta, Hon, Shah, Vitter DCC 06 [Gupta et al. (b)]: Gupta, Hon, Shah, Vitter WEA 06 [GV]: Grossi, Vitter STOC 00, SICOMP 05 [MG]: Witten, Moffat, Bell, Managing Gigabytes

9 Select Space usage: bits Time:. Bitvector Representation Write in unary is “0001” |B|= m bits B: 0 0 1 0 1 1 0 0 0 1 position of the iith 1 bit in B [CJM, KIM et al.] } B: 0 0 1 0 1 1 0 0 0 1

10 Given the # of 1s in B is n different bit sequences Lower bound to store all L sequences bits. space usage is based on Average. Could we do better? Succinct Bound

11 Data-aware encoding Exploit skewed distribution. Self-delimiting encodings of values. concat. unary and binary.. add up to, average value is then.,

12 Data-aware encoding Golomb(b,x) Concat. in unary and in binary using or bits b=3 Golomb(3,9)= q=2 in unary(q+1)=001 and r=2 => 001 11. Best encoding for inverted lists if - [MG]. - [Gupta et. al. (b)] Not achievable

13 a) GOLOMBSUCCINCT b) New Select DS c) Data aware PSDS. Space:, bits Time: d) Implementation and Experimental Evaluation Contributions of paper

14 If..- [GV, Elias] Succinct vs Golomb

15 Succinct PSDS Given Compute 11 01 10 10 11 01 00 10 Lower-half: Lower order bits of, so we take bits 1011011001010100 Upper-half: Multiplicities bits. Upper-order bits, i.e= 0,1,1,2,2,4,5,6 Space usage: time: - [CJM] V: Simple to do Succinct i.e. 5= 00101 11 01 10 10 11 01 00 10 i.e. 5= 00101=>1 1011011001010100 B: e.g [GV, Elias] 0 1 2 3 4 5 6 7 get(B,4)=10 11 01 10 10 11 01 00 10 1011011001010100

16 New select DS = position of the th 1 bit in bitstring B of length N Extracted string & contracted string [Kim et al.] Remove zero blocks [Geary et al.]: Fast select – every block has at least a single 1 bit. Block of zeros 001……..1.. 001……000000000..1.. XX A: P: 0 0 1 0 0 1 0 0 0 A’:

17 New select DS Assume BS of N bits Results Select & rank: O(1) time, space: N+o(N) bits Select1 and select0 Partitioned BV [Delpratt, Raman, Rahman, WEA 06] In practice Joint fastest with CJM

18 New select DS TypicalWorst-case NewDSCJMKIMNewDSCJMKIM Input BS(1- )NNNNNN Select(1- )0.94N(1+ )0.52N(1+ )0.63N0.94N2.77N1.17N rank0.03N0.5N0.25N0.02N0.5N0.25N sum~2N ~1.94N4.27N2.42N Reliable space bound Speed evaluation: Orders.xml NewDS=0.101, CJM=0.105, KIM=0.178 oper./per sec

19 Data aware tree PSDS Results Space usage: bits, Time: [Gupta et al. (b)] achieved, Time:

20 Delete larger child Indicate nodes removed. n-1 extra bits Data aware tree PSDS 59 3623 6 171521 59 3623 6 171521

21 Implemented Succinct, Explicit- & Succinct- PSDS Gamma tree PSDS Remove right child nodes vs largest node = negligible difference Tree is slow Data: Lengths of Text node strings Compressibility measures Succinct measure close to GOLOMB measure Implementation and Experimental Evaluation File#Text nodesGapGOLOMBSucc. orders.xml150K2.564.994.874.71 Xpath.xml1.7m3.266.414.424.37 per node

22 Experimental Evaluation Results Comparative space usage for data structures Linux machine, 8 million random operation calls, 10 repeated runs Time: sec per operation FileSuccinctExplicit-Succinct- orders.xml0.1010.2350.306 Xpath.xml0.3060.4530.564 Succinct PSDS performed best

23 Compression of Prefix sums is important Space efficient data-aware PSDS Succinct PSDS was more appropriate in our application New select DS Future improvements Succinct- more competitive: single -decode x20 faster than single select To data aware tree PSDS Conclusions and future work

24 Thank you!


Download ppt "Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman."

Similar presentations


Ads by Google