A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

Succinct Data Structures for Permutations, Functions and Suffix Arrays
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Interplay between Stringology and Data Structure Design Roberto Grossi.
Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees Jérémy Barbay, Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Tries Standard Tries Compressed Tries Suffix Tries.
Succinct Representations of Trees S. Srinivasa Rao Seoul National University.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo Navarro University of Chile BWTRLFID.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Chair of Software Engineering Einführung in die Programmierung Introduction to Programming Prof. Dr. Bertrand Meyer Exercise Session 10.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.
Succinct Representations of Trees
Space Efficient Data Structures for Dynamic Orthogonal Range Counting Meng He and J. Ian Munro University of Waterloo.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Succinct Ordinal Trees Based on Tree Covering Meng He, J. Ian Munro, University of Waterloo S. Srinivasa Rao, IT University of Copenhagen.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
COMP9319 Web Data Compression and Search
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Succinct Data Structures
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reducing the Space Requirement of LZ-index
13 Text Processing Hongfei Yan June 1, 2016.
Tries 2/27/2019 5:37 PM Tries Tries.
Presentation transcript:

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo

The Problem Initial Problem  Text searching: Finding occurrences of a pattern string in a large (static) document Solution  Text indexing: Trading space for time New Problem  Succinct Text indexes: Reducing the space cost

Pattern Searching Give a text string T of length n and a pattern string P of length m, we look for the occurrences of P in T. Three types of Queries  Existential queries: Does P occur in T?  Cardinality queries: How many times does P occur in T?  Listing queries: Where does P occur in T?

Text Indexing Inverted files  Word index  Need to store the text as well as the index Suffix trees  Efficient full-text index  4n lg n to 6n lg n bits! Suffix arrays  n lg n bits in basic form, but  3n lg n bits (with LCP data)

Applications Text databases  electronic encyclopedias, dictionaries, books, etc. Web search engines  Google, Altavista, etc. Bioinformatics  gene databases More…

Related Work Compressed Suffix Arrays  Grossi & Vitter 2000  Sadakane 2000  Grossi, Gupta & Vitter 2003 FM-index  Ferragina & Manzini 2000 & 2001

Assumptions & Notation Alphabet: Σ = {a, b} Text: T[1..n]  T[n] = #, where a < # < b Pattern: P[1..m]

Permutations and Suffix Arrays An observation  Permutations: n!  Suffix arrays: 2 n-1  Not all permutations are suffix arrays An example  A suffix array: 4, 7, 5, 1, 8, 3, 6, 2 Text: abbaaba#  A permutation: 4, 7, 1, 5, 8, 2, 3, 6 Not a suffix array of any binary text

Two Features of Suffix Arrays Suffix Array Another Permutation Ascending-to-max Non-nesting

A Categorization Theorem A permutation is a suffix array iff it is:  Ascending-to-max  Non-nesting An immediate application:  Checking whether a permutation is a suffix array in O(n) time using n + O(1) additional words in memory.

Application: Space Efficient Suffix Array Text: abaaabbaaabaabb# SA: Ba:Ba: Bb:Bb:

Basic Searching Algorithm: Answering Cardinality Queries Basic Idea: backward search  Start from the end of the pattern P  For i = m, m-1, …, 1, compute the interval [s, e] of SA whose corresponding suffixes are prefixed with P[i, m] SA: P = aba

More Algorithms and Tradeoffs Answering listing queries Speeding up the reporting of Occurrences of Long Patterns Self-indexing Time-space tradeoff: multi-level structure

Putting it all together space (bits)pattern searching Index 1n+o(n)O(m) (existential & cardinality queries only) Index 22n+o(n)O(m + occ) (m=Ω(lg 1+ε n)) O(m + occ lg n) (otherwise) Index 3O(n)O(m + occ) (m=Ω(lg 1+ε n)) O(m + occ lg λ n) (otherwise) Three index structures:

Conclusion Summary  A theorem that characterizes a permutation as the suffix array of a binary string  An efficient algorithm checking whether a permutation is a suffix array  Three space efficient text indexing methods

Conclusions (Continued) Related subsequent work  Generalization to larger alphabets Open problem  O(n)-bits text index supporting searching in O(m+occ) time.

Thank You.