Download presentation

Presentation is loading. Please wait.

Published byMichael Parvin Modified over 4 years ago

1
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

2
Contents Introduction – Rank/select problem – Relations to compressed full-text indices Dynamic rank-select structure Extensions of the structure – For a large alphabet text – For a run-length encoded text

3
Rank-select problem For a given text T over σ-size alphabet, our structures support: – rank T (c, i): gives the number of character c’s up to position i in T – select T (c, k): gives the position of the k-th c E.g. T=acabbc – rank T (‘a’, 5) = 2 – select T (‘a’, 2) = 3

4
Rank-select problem Our structures support additional update operations – insert T (c, i): inserts character c between T[i] and T[i+1] – delete T (i): deletes T[i] from T E.g. T=acabbc aababc – rank T (‘a’, 5) = 2 rank T (‘a’, 5) = 3 – select T (‘a’, 2) = 3 select T (‘a’, 2) = 2

5
Why rank-select problem? In compressed full-text index – Rank-select structures are built on Burrows- Wheeler Transform (BWT) – Rank: backward search (Ferragina & Manzini) – Select: Psi-function in CSA (Grossi & Vitter) Dynamic BWT – Index for a collection of texts (Chan, Hon & Lam) – Add or remove a text from the collection

6
Example of select on BWT T=mississippi$ iPsiSASuffix 1612$ 2111i$ 388ippi$ 4115issippi$ 5122ississippi$ 651mississippi$ 7210pi$ 879ppi$ 937sippi$ 1044sissippi$ 1196ssippi$ 12103ssissippi$ Psi function – Order of the suffix at next position – E.g.. Psi[4] = 11, the order of ‘ssippi$’

7
Example of select on BWT T=mississippi$ iBWTPsiSASuffix 1i612$ 2p111i$ 3s88ippi$ 4s115issippi$ 5m122ississippi$ 6$51mississippi$ 7p210pi$ 8i79ppi$ 9s37sippi$ 10s44sissippi$ 11i96ssippi$ 12i103ssissippi$ Psi function – Order of the suffix at next position – E.g. Psi[4] = 11, the order of ‘ssippi$’ Duality between Psi-function and BWT (Hon, Sadakane & Sung) – BWT[i] = T[SA[i] – 1] – Psi[i] = select BWT (C[i], i – F[C[i]]) C[i]: T[SA[i]] F[c]: The number of x < c

8
Our results Dynamic rank-select on texts over a small alphabet (σ < log n) – Improve the binary-alphabet version by Makinen & Navarro – O(log n) time and nlogσ + o(nlogσ) bits Dynamic rank-select for a large alphabet (σ < n) – Use wavelet trees to extend our small-alphabet structure – O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits Application to RLE texts

9
Static rank-select

10
Dynamic rank-select

11
Dynamic rank-select preliminary We assume RAM model with: – Word size w = θ(log n) bits – +, -, *, / and bitwise operations in O(1) time We process a word-size text of θ(log n/log ) characters in O(1) time

12
Dynamic rank-select preliminary Partition of text – Blocks of sizes from ½ log n words to 2log n words – Bit vector representation, I Give block number b and offset r for position i Employ binary rank-select by Makinen & Navarro: O(log n) time & O(n) bits E.g. – T = babc abab abca b = rank I (‘1’, 10) = 3 – I = 1000 1000 1000 r = 10 - select I (‘1’, 3) + 1 = 2

13
Dynamic rank-select preliminary Over-block/in-block operation – rank T (c, i): rank-over T (c, b): The number of c’s before the b-th block rank Tb (c, r): The number of c’s up to position r in T b – E.g. T = babc abab abca : rank T (‘a’,10) = rank-over T (‘a’, 3) I = 1000 1000 1000 + rank T3 (‘a’, 2)

14
Dynamic rank-select preliminary Over-block/in-block operation – select T (c, k): select-over T (c,k): The block number containing the k-th c select Tb (c,k’): The offset of the k’-th c in T b – Update operation In-block update: change the text itself Over-block update: change the statistics of the text

15
Over-block structures Sorted character-block pair – Character-block pair (T[i], b): T[i] in the b-th block E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)

16
Over-block structures Sorted character-block pair – Character-block pair (T[i], b): T[i] in the b-th block – Sorted pairs: partially non-decreasing (Hon, Sadakane & Sung) E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3) (a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)

17
Over-block structures Differential encoding of sorted pairs – A bit vector B of O(n) bits – For each distinct pair: 1: the difference of block number 0: the number of the same pairs E.g. – T =... babc abab bbbb abcc … – … (c,5)(c,8)(c,8) … … 11111011100 …

18
Over-block structures Differential encoding of sorted pairs – A bit vector B of O(n) bits – For each distinct pair: 1: the difference of block number 0: the number of the same pairs E.g. – T = babc abab abca – B = 10100100 10010010 10110 ‘b’ group

19
Over-block rank-select rank-over T (c, b): – Find the position of the b-th ‘1’ in the group of c – Count ‘0’s representing c up to the position E.g. – T = babc abab abca – B = 10100100 10010010 10110 rank-over T (‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group

20
Over-block updates If the number of blocks is fixed – Insert or delete 0s at the b-th block in I and B – Rank-select remains correct E.g. – T = babc abab abca babc aabaaabb abca – I = 1000 1000 1000 1000 100000000 1000 – B = 10100100 10010010 10110 10100000100 100100010 10110

21
Over-block updates If the number of blocks is changing – Split or merge the b-th block in I and B – Call O( ) queries on B amortized ( < log n) E.g. – T = babc aabaaabb abca babc aaba aabb abca – I = 1000 10000000 1000 1000 1000 1000 1000 – B =10100000100 1001000010 10110 101000100100 10010100010 10110

22
In-block structures We use the hierarchy as Makinen & Navarro’s: word, sub-block and block Rank/select on word-size texts w – Convert w to a bit vector representing occurrences of c – E.g. w = abaacbab, mask = bbbbbbbb (log ) w XOR mask = x0xxx0x0 (log ) 01000101 (2) – O(1) time rank-select by tables of o(n) bits size

23
In-block structures Linked list over sub-blocks – A block contains ½log n to 2log n words – A sub-block contains √log n words – One extra sub-block is a buffer for updates Red-black tree over blocks – Leaf node: pointer to block, list of sub-blocks – Internal node: the number of blocks in its subtree

24
In-block rank-select Rank Tb (c, r) in O(log n) time – Traverse the tree to find the b-th block – Scan the b-th block of θ(log n) words abbabc 2 2 3 5

25
In-block updates Update words in the list in O(log n) time Process carry characters using the extra space in a block abbcbc c 2 2 3 5

26
In-block updates Split or merge the block of out of the range Update tree nodes from leaf to root abbcbcacbaba 2 2 3 5 bc

27
In-block updates Split or merge the block of out of the range Update tree nodes from leaf to root abbcac ba 2 2 2 4 6 bc

28
Extension of our structure Dynamic rank-select on plain texts over a large alphabet, σ < n – Use k-ary wavelet trees – O(log n logσ /loglog n) time & nlogσ + O(nlogσ /loglog n) bits Application to run-length encoded texts – Start from RLFM (Makinen & Navarro) – Support dynamic BWT

29
Application to RLE Run-Length Encoding (RLE) of T – Character of runs: text T’ – Length of runs: bit vector L – E.g. T = aaabbaacccc T’=abac, L=10010101000 RLE of BWT (Makinen & Navarro) – Run-Length based FM-index – The number of runs in BWT(T) ≤ min(n, nH k ) + σ k

30
Application to RLE Assume rank/select on L and T’ – Total size of structure: O(n + n’logσ) – Operation time: O(log n + log n logσ/loglog n) Some additional vectors – Sorted length vector: L’ – Frequency table F’: count characters in T’ – E.g. T = bb aa bbbb cc aaa aa aaa bb bbbb cc L = 10 10 1000 10 100 L’ = 10 100 10 1000 10 T’ = babca F’ = 001 001 01

31
Conclusion Rank-select structure is an essential ingredient of compressed full-text indices We propose dynamic rank-select for a small alphabet and its large-alphabet version We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection

Similar presentations

OK

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google