# Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

## Presentation on theme: "Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ."— Presentation transcript:

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

Contents Introduction – Rank/select problem – Relations to compressed full-text indices Dynamic rank-select structure Extensions of the structure – For a large alphabet text – For a run-length encoded text

Rank-select problem For a given text T over σ-size alphabet, our structures support: – rank T (c, i): gives the number of character c’s up to position i in T – select T (c, k): gives the position of the k-th c E.g. T=acabbc – rank T (‘a’, 5) = 2 – select T (‘a’, 2) = 3

Rank-select problem Our structures support additional update operations – insert T (c, i): inserts character c between T[i] and T[i+1] – delete T (i): deletes T[i] from T E.g. T=acabbc aababc – rank T (‘a’, 5) = 2  rank T (‘a’, 5) = 3 – select T (‘a’, 2) = 3 select T (‘a’, 2) = 2

Why rank-select problem? In compressed full-text index – Rank-select structures are built on Burrows- Wheeler Transform (BWT) – Rank: backward search (Ferragina & Manzini) – Select: Psi-function in CSA (Grossi & Vitter) Dynamic BWT – Index for a collection of texts (Chan, Hon & Lam) – Add or remove a text from the collection

Example of select on BWT T=mississippi\$ iPsiSASuffix 1612\$ 2111i\$ 388ippi\$ 4115issippi\$ 5122ississippi\$ 651mississippi\$ 7210pi\$ 879ppi\$ 937sippi\$ 1044sissippi\$ 1196ssippi\$ 12103ssissippi\$ Psi function – Order of the suffix at next position – E.g.. Psi[4] = 11, the order of ‘ssippi\$’

Example of select on BWT T=mississippi\$ iBWTPsiSASuffix 1i612\$ 2p111i\$ 3s88ippi\$ 4s115issippi\$ 5m122ississippi\$ 6\$51mississippi\$ 7p210pi\$ 8i79ppi\$ 9s37sippi\$ 10s44sissippi\$ 11i96ssippi\$ 12i103ssissippi\$ Psi function – Order of the suffix at next position – E.g. Psi[4] = 11, the order of ‘ssippi\$’ Duality between Psi-function and BWT (Hon, Sadakane & Sung) – BWT[i] = T[SA[i] – 1] – Psi[i] = select BWT (C[i], i – F[C[i]]) C[i]: T[SA[i]] F[c]: The number of x < c

Our results Dynamic rank-select on texts over a small alphabet (σ < log n) – Improve the binary-alphabet version by Makinen & Navarro – O(log n) time and nlogσ + o(nlogσ) bits Dynamic rank-select for a large alphabet (σ < n) – Use wavelet trees to extend our small-alphabet structure – O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits Application to RLE texts

Static rank-select

Dynamic rank-select

Dynamic rank-select preliminary We assume RAM model with: – Word size w = θ(log n) bits – +, -, *, / and bitwise operations in O(1) time We process a word-size text of θ(log n/log  ) characters in O(1) time

Dynamic rank-select preliminary Partition of text – Blocks of sizes from ½ log n words to 2log n words – Bit vector representation, I Give block number b and offset r for position i Employ binary rank-select by Makinen & Navarro: O(log n) time & O(n) bits E.g. – T = babc abab abca  b = rank I (‘1’, 10) = 3 – I = 1000 1000 1000 r = 10 - select I (‘1’, 3) + 1 = 2

Dynamic rank-select preliminary Over-block/in-block operation – rank T (c, i): rank-over T (c, b): The number of c’s before the b-th block rank Tb (c, r): The number of c’s up to position r in T b – E.g. T = babc abab abca : rank T (‘a’,10) = rank-over T (‘a’, 3) I = 1000 1000 1000 + rank T3 (‘a’, 2)

Dynamic rank-select preliminary Over-block/in-block operation – select T (c, k): select-over T (c,k): The block number containing the k-th c select Tb (c,k’): The offset of the k’-th c in T b – Update operation In-block update: change the text itself Over-block update: change the statistics of the text

Over-block structures Sorted character-block pair – Character-block pair (T[i], b): T[i] in the b-th block E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)

Over-block structures Sorted character-block pair – Character-block pair (T[i], b): T[i] in the b-th block – Sorted pairs: partially non-decreasing (Hon, Sadakane & Sung) E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)  (a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)

Over-block structures Differential encoding of sorted pairs – A bit vector B of O(n) bits – For each distinct pair: 1: the difference of block number 0: the number of the same pairs E.g. – T =... babc abab bbbb abcc … – … (c,5)(c,8)(c,8) …  … 11111011100 …

Over-block structures Differential encoding of sorted pairs – A bit vector B of O(n) bits – For each distinct pair: 1: the difference of block number 0: the number of the same pairs E.g. – T = babc abab abca – B = 10100100 10010010 10110 ‘b’ group

Over-block rank-select rank-over T (c, b): – Find the position of the b-th ‘1’ in the group of c – Count ‘0’s representing c up to the position E.g. – T = babc abab abca – B = 10100100 10010010 10110 rank-over T (‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group

Over-block updates If the number of blocks is fixed – Insert or delete 0s at the b-th block in I and B – Rank-select remains correct E.g. – T = babc abab abca  babc aabaaabb abca – I = 1000 1000 1000  1000 100000000 1000 – B = 10100100 10010010 10110  10100000100 100100010 10110

Over-block updates If the number of blocks is changing – Split or merge the b-th block in I and B – Call O(  ) queries on B  amortized (  < log n) E.g. – T = babc aabaaabb abca  babc aaba aabb abca – I = 1000 10000000 1000  1000 1000 1000 1000 – B =10100000100 1001000010 10110  101000100100 10010100010 10110

In-block structures We use the hierarchy as Makinen & Navarro’s: word, sub-block and block Rank/select on word-size texts w – Convert w to a bit vector representing occurrences of c – E.g. w = abaacbab, mask = bbbbbbbb (log  ) w XOR mask = x0xxx0x0 (log  )  01000101 (2) – O(1) time rank-select by tables of o(n) bits size

In-block structures Linked list over sub-blocks – A block contains ½log n to 2log n words – A sub-block contains √log n words – One extra sub-block is a buffer for updates Red-black tree over blocks – Leaf node: pointer to block, list of sub-blocks – Internal node: the number of blocks in its subtree

In-block rank-select Rank Tb (c, r) in O(log n) time – Traverse the tree to find the b-th block – Scan the b-th block of θ(log n) words abbabc 2 2 3 5

In-block updates Update words in the list in O(log n) time Process carry characters using the extra space in a block abbcbc c 2 2 3 5

In-block updates Split or merge the block of out of the range Update tree nodes from leaf to root abbcbcacbaba 2 2 3 5 bc

In-block updates Split or merge the block of out of the range Update tree nodes from leaf to root abbcac ba 2 2 2 4 6 bc

Extension of our structure Dynamic rank-select on plain texts over a large alphabet, σ < n – Use k-ary wavelet trees – O(log n logσ /loglog n) time & nlogσ + O(nlogσ /loglog n) bits Application to run-length encoded texts – Start from RLFM (Makinen & Navarro) – Support dynamic BWT

Application to RLE Run-Length Encoding (RLE) of T – Character of runs: text T’ – Length of runs: bit vector L – E.g. T = aaabbaacccc  T’=abac, L=10010101000 RLE of BWT (Makinen & Navarro) – Run-Length based FM-index – The number of runs in BWT(T) ≤ min(n, nH k ) + σ k

Application to RLE Assume rank/select on L and T’ – Total size of structure: O(n + n’logσ) – Operation time: O(log n + log n logσ/loglog n) Some additional vectors – Sorted length vector: L’ – Frequency table F’: count characters in T’ – E.g. T = bb aa bbbb cc aaa aa aaa bb bbbb cc L = 10 10 1000 10 100  L’ = 10 100 10 1000 10 T’ = babca F’ = 001 001 01

Conclusion Rank-select structure is an essential ingredient of compressed full-text indices We propose dynamic rank-select for a small alphabet and its large-alphabet version We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection

Similar presentations