Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.

Similar presentations


Presentation on theme: "The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work."— Presentation transcript:

1 The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work done while at Microsoft Research Cambridge

2 Disclaimer THIS HAS NOTHING TO DO WITH WAVELETS!

3 Indexed String Sequences (foo, bar, foobar, foo, bar, bar, foo) Queries – Access(i): access the i-th element Access(2) = foobar – Rank(s, pos): count occurrences of s before pos Rank( bar, 5) = 2 – Select(s, i): find the i-th occurrence of a s Select( foo, 2) = 6 0 1 2 3 4 5 6

4 Prefix operations (foo, bar, foobar, foo, bar, bar, foo) Queries – RankPrefix(p, pos): count strings prefixed by p before pos RankPrefix( foo, 5) = 3 – SelectPrefix(p, i): find the i-th string prefixed by p SelectPrefix( foo, 2) = 3 0 1 2 3 4 5 6

5 Example: storing relations Write the columns as string sequences – Store them separately – Reduce relational operations to sequence queries User Leonard Penny Sheldon Penny Leonard Sheldon Likes URL battle.net/wow/ tmz.com battle.net/wow/ thecheesecakefactory.com wikipedia.org/Star_Trek wikipedia.org/String_theory marvel.com 0 1 2 3 4 5 6 0 1 2 3 4 5 6 What does Sheldon like? Who likes pages from domain wikipedia.org? Other operations: range counting, …

6 Dynamic sequences We want to support the following operations: Insert(s, pos): insert the string s immediately before position pos Append(s): append the string s at end of the sequence (special case of Insert) Delete(pos): delete the string at position pos If data structure only supports Append, we call it append-only, otherwise dynamic (or fully dynamic)

7 Requirements Store the sequence in as little space as possible – Close to the information-theoretic lower bound But still be able to support all the described operations (query and update) efficiently – Aim for worst-case polylog operations

8 Some notation (foo, bar, foobar, foo, bar, bar, foo) Sequence S, |S| = n – In the example n = 7 String set S set is unordered set of distinct strings appearing in S – In the example, {foo, bar, foobar}, |S set | = 3 – Also called alphabet Sequence symbols can also be integers, characters, … – As long as they are binarized to strings 0 1 2 3 4 5 6

9 Wavelet Trees Introduced in 2003 to represent Compressed Suffix Arrays Support Access/Rank/Select on sequences on a finite alphabet (of integers) – Reduces to operations on bitvectors by recursively partitioning the alphabet String sequences can be reduced to integer sequences

10 Wavelet Trees S = (a, b, r, a, c, a, d, a, b, r, a), S set ={a, b, c, d, r} abracadabra 00101010010 abaaaba 0100010 rcdr 1011 rdr 101 a rd cb {c, d, r}{a, b} {d, r}

11 Wavelet Trees Space equal to entropy of the sequence – Plus negligible terms Supports Access/Rank/Select in O(log |S set |) Later extended to support Insert/Delete… – … but tree structure is fixed a priori – String set S set is cannot be changed! – Unrealistic restriction in many database applications

12 The Wavelet Trie The Wavelet Trie is a Wavelet Tree on sequences of binary strings (S set ⊂ {0, 1} * ) Supports Access/Rank(Prefix)/Select(Prefix) Fully dynamic… … or append only (with better bounds) The string set need not be known in advance

13 010111 0100110 0100001 0101010 0100110 010111 010110 010111 0100110 0100001 0101010 0100110 010111 010110 Wavelet Trie: Construction Common prefix: α Branching bit: β α: 010 β: 1001011 11 010 11 10 110 001 110 01 Sequence of binary strings

14 Wavelet Trie: Construction α: 010 β: 1001011 α: ε β: 101 α: 01 α: 10 α: ε β: 1011 010111 0100110 0100001 0101010 0100110 010111 010110 α: 10 α: ε β: 110 α: ε

15 Wavelet Trie: Access α: 010 β: 1001011 α: ε β: 101 α: 01 α: 10 α: ε β: 1011 010111 0100110 0100001 0101010 0100110 010111 010110 α: 10 α: ε β: 110 α: ε 01234560123456 Access(5) = Rank is similar 010 1 1 α: 010 β: 1001011 α: ε β: 1011 α: ε β: 101 1

16 Wavelet Trie: Select 010111 0100110 0100001 0101010 0100110 010111 010110 α: 010 β: 1001011 α: ε β: 101 α: 01 α: 10 α: ε β: 1011 α: 10 α: ε β: 110 α: ε 01234560123456 Select( 0100110, 1) = α: 010 β: 1001011 α: ε β: 101 α: 01 α: 10 α: ε β: 1011 α: 10 α: ε β: 110 α: ε 4

17 Wavelet Trie: Append α: 010 β: 1001011 α: ε β: 101 α: 01 α: 10 α: ε β: 1011 010111 0100110 0100001 0101010 0100110 010111 010110 α: 10 α: ε β: 110 α: ε 010010 0 1 α: ε α: 0 α: ε β: 11 0 Insert/Delete are similar

18 Space analysis Information-theoretic lower bound – LB(S) = LT(S set ) + nH 0 (S) – LT is the information-theoretic lower bound for storing a set of strings Static WT: LB(S) + o(ĥn) Append-only WT: LB(S) + PT(S set )+ o(ĥn) – PT(S set ): space taken by the Patricia Trie Fully dynamic WT: LB(S) + PT(S set )+ O(nH 0 (S))

19 Operations time complexity Need new dynamic bitvectors to support initialization (create a bitvector 0 n or 1 n ) Static and Append-only Wavelet Trie – All supported operations in O(|s| + h s ) – h s is number of nodes traversed by string s Fully dynamic Wavelet Trie – All supported operations in O(|s| + h s log n) – Deletion may take O(|ŝ| + h s log n) where ŝ is longest string in the trie

20 Thanks for your attention! Questions?


Download ppt "The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work."

Similar presentations


Ads by Google