Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

Slides:



Advertisements
Similar presentations
Introduction to Algorithms
Advertisements

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Succinct Data Structures for Permutations, Functions and Suffix Arrays
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Interplay between Stringology and Data Structure Design Roberto Grossi.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Modern Information Retrieval
BTrees & Bitmap Indexes
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
PhD Thesis Iwona Bialynicka-Birula Ranked Queries in Index Data Structures.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Indexing and Searching
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
Huffman Encoding Veronica Morales.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Linear Time Suffix Array Construction Using D-Critical Substrings
15-853:Algorithms in the Real World
Succinct Data Structures
Succinct Data Structures
CS 728 Advanced Database Systems Chapter 18
COMP9319 Web Data Compression and Search
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
Succinct Data Structures
Reducing the Space Requirement of LZ-index
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Suffix trees.
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
CPS216: Advanced Database Systems
Presentation transcript:

Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.

Contents Introduction – Rank/select problem – Relations to compressed full-text indices Dynamic rank-select structure Extensions of the structure – For a large alphabet text – For a run-length encoded text

Rank-select problem For a given text T over σ-size alphabet, our structures support: – rank T (c, i): gives the number of character c’s up to position i in T – select T (c, k): gives the position of the k-th c E.g. T=acabbc – rank T (‘a’, 5) = 2 – select T (‘a’, 2) = 3

Rank-select problem Our structures support additional update operations – insert T (c, i): inserts character c between T[i] and T[i+1] – delete T (i): deletes T[i] from T E.g. T=acabbc aababc – rank T (‘a’, 5) = 2  rank T (‘a’, 5) = 3 – select T (‘a’, 2) = 3 select T (‘a’, 2) = 2

Why rank-select problem? In compressed full-text index – Rank-select structures are built on Burrows- Wheeler Transform (BWT) – Rank: backward search (Ferragina & Manzini) – Select: Psi-function in CSA (Grossi & Vitter) Dynamic BWT – Index for a collection of texts (Chan, Hon & Lam) – Add or remove a text from the collection

Example of select on BWT T=mississippi$ iPsiSASuffix 1612$ 2111i$ 388ippi$ 4115issippi$ 5122ississippi$ 651mississippi$ 7210pi$ 879ppi$ 937sippi$ 1044sissippi$ 1196ssippi$ 12103ssissippi$ Psi function – Order of the suffix at next position – E.g.. Psi[4] = 11, the order of ‘ssippi$’

Example of select on BWT T=mississippi$ iBWTPsiSASuffix 1i612$ 2p111i$ 3s88ippi$ 4s115issippi$ 5m122ississippi$ 6$51mississippi$ 7p210pi$ 8i79ppi$ 9s37sippi$ 10s44sissippi$ 11i96ssippi$ 12i103ssissippi$ Psi function – Order of the suffix at next position – E.g. Psi[4] = 11, the order of ‘ssippi$’ Duality between Psi-function and BWT (Hon, Sadakane & Sung) – BWT[i] = T[SA[i] – 1] – Psi[i] = select BWT (C[i], i – F[C[i]]) C[i]: T[SA[i]] F[c]: The number of x < c

Our results Dynamic rank-select on texts over a small alphabet (σ < log n) – Improve the binary-alphabet version by Makinen & Navarro – O(log n) time and nlogσ + o(nlogσ) bits Dynamic rank-select for a large alphabet (σ < n) – Use wavelet trees to extend our small-alphabet structure – O(log n logσ / loglog n) time and nlogσ + o(nlogσ) bits Application to RLE texts

Static rank-select

Dynamic rank-select

Dynamic rank-select preliminary We assume RAM model with: – Word size w = θ(log n) bits – +, -, *, / and bitwise operations in O(1) time We process a word-size text of θ(log n/log  ) characters in O(1) time

Dynamic rank-select preliminary Partition of text – Blocks of sizes from ½ log n words to 2log n words – Bit vector representation, I Give block number b and offset r for position i Employ binary rank-select by Makinen & Navarro: O(log n) time & O(n) bits E.g. – T = babc abab abca  b = rank I (‘1’, 10) = 3 – I = r = 10 - select I (‘1’, 3) + 1 = 2

Dynamic rank-select preliminary Over-block/in-block operation – rank T (c, i): rank-over T (c, b): The number of c’s before the b-th block rank Tb (c, r): The number of c’s up to position r in T b – E.g. T = babc abab abca : rank T (‘a’,10) = rank-over T (‘a’, 3) I = rank T3 (‘a’, 2)

Dynamic rank-select preliminary Over-block/in-block operation – select T (c, k): select-over T (c,k): The block number containing the k-th c select Tb (c,k’): The offset of the k’-th c in T b – Update operation In-block update: change the text itself Over-block update: change the statistics of the text

Over-block structures Sorted character-block pair – Character-block pair (T[i], b): T[i] in the b-th block E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)

Over-block structures Sorted character-block pair – Character-block pair (T[i], b): T[i] in the b-th block – Sorted pairs: partially non-decreasing (Hon, Sadakane & Sung) E.g. T = babc abab abca (b,1)(a,1)(b,1)(c,1) (a,2)(b,2)(a,2)(b,2) (a,3)(b,3)(c,3)(a,3)  (a,1)(a,2)(a,2)(a,3)(a,3) (b,1)(b,1)(b,2)(b,2)(b,3) (c,1)(c,3)

Over-block structures Differential encoding of sorted pairs – A bit vector B of O(n) bits – For each distinct pair: 1: the difference of block number 0: the number of the same pairs E.g. – T =... babc abab bbbb abcc … – … (c,5)(c,8)(c,8) …  … …

Over-block structures Differential encoding of sorted pairs – A bit vector B of O(n) bits – For each distinct pair: 1: the difference of block number 0: the number of the same pairs E.g. – T = babc abab abca – B = ‘b’ group

Over-block rank-select rank-over T (c, b): – Find the position of the b-th ‘1’ in the group of c – Count ‘0’s representing c up to the position E.g. – T = babc abab abca – B = rank-over T (‘b’, 3): count ‘0’s up to 3rd ‘1’ in ‘b’ group

Over-block updates If the number of blocks is fixed – Insert or delete 0s at the b-th block in I and B – Rank-select remains correct E.g. – T = babc abab abca  babc aabaaabb abca – I =  – B = 

Over-block updates If the number of blocks is changing – Split or merge the b-th block in I and B – Call O(  ) queries on B  amortized (  < log n) E.g. – T = babc aabaaabb abca  babc aaba aabb abca – I =  – B = 

In-block structures We use the hierarchy as Makinen & Navarro’s: word, sub-block and block Rank/select on word-size texts w – Convert w to a bit vector representing occurrences of c – E.g. w = abaacbab, mask = bbbbbbbb (log  ) w XOR mask = x0xxx0x0 (log  )  (2) – O(1) time rank-select by tables of o(n) bits size

In-block structures Linked list over sub-blocks – A block contains ½log n to 2log n words – A sub-block contains √log n words – One extra sub-block is a buffer for updates Red-black tree over blocks – Leaf node: pointer to block, list of sub-blocks – Internal node: the number of blocks in its subtree

In-block rank-select Rank Tb (c, r) in O(log n) time – Traverse the tree to find the b-th block – Scan the b-th block of θ(log n) words abbabc

In-block updates Update words in the list in O(log n) time Process carry characters using the extra space in a block abbcbc c

In-block updates Split or merge the block of out of the range Update tree nodes from leaf to root abbcbcacbaba bc

In-block updates Split or merge the block of out of the range Update tree nodes from leaf to root abbcac ba bc

Extension of our structure Dynamic rank-select on plain texts over a large alphabet, σ < n – Use k-ary wavelet trees – O(log n logσ /loglog n) time & nlogσ + O(nlogσ /loglog n) bits Application to run-length encoded texts – Start from RLFM (Makinen & Navarro) – Support dynamic BWT

Application to RLE Run-Length Encoding (RLE) of T – Character of runs: text T’ – Length of runs: bit vector L – E.g. T = aaabbaacccc  T’=abac, L= RLE of BWT (Makinen & Navarro) – Run-Length based FM-index – The number of runs in BWT(T) ≤ min(n, nH k ) + σ k

Application to RLE Assume rank/select on L and T’ – Total size of structure: O(n + n’logσ) – Operation time: O(log n + log n logσ/loglog n) Some additional vectors – Sorted length vector: L’ – Frequency table F’: count characters in T’ – E.g. T = bb aa bbbb cc aaa aa aaa bb bbbb cc L =  L’ = T’ = babca F’ =

Conclusion Rank-select structure is an essential ingredient of compressed full-text indices We propose dynamic rank-select for a small alphabet and its large-alphabet version We can apply our structures to indices that uses BWT, such as RLFM and index for texts collection