A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.

Slides:

Advertisements

Similar presentations

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.

Advertisements

Succinct Data Structures for Permutations, Functions and Suffix Arrays

Fast Algorithms For Hierarchical Range Histogram Constructions

1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.

Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.

An Improved Succinct Dynamic k-Ary Tree Representation (work in progress) Diego Arroyuelo Department of Computer Science, Universidad de Chile.

Sparse Compact Directed Acyclic Word Graphs

Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.

Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Two implementation issues Alphabet size Generalizing to multiple strings.

Constant-Time LCA Retrieval

1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.

Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Tries Standard Tries Compressed Tries Suffix Tries.

Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b

Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,

1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.

Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.

Indexing and Searching

C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.

Compressed suffix arrays and suffix trees with applications to text indexing and string matching.

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.

Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.

Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)

On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.

Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.

Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.

Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.

Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi

Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

Tries 07/28/16 11:04 Text Compression

Tries 5/27/2018 3:08 AM Tries Tries.

Mark Redekopp David Kempe

Reducing the Space Requirement of LZ-index

Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

13 Text Processing Hongfei Yan June 1, 2016.

Strings: Tries, Suffix Trees

Trees Lecture 9 CS2110 – Fall 2009.

Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.

Tries 2/27/2019 5:37 PM Tries Tries.

Suffix Arrays and Suffix Trees

Sequences 5/17/ :43 AM Pattern Matching.

Huffman Coding Greedy Algorithm

Trees Lecture 10 CS2110 – Spring 2013.

Presentation transcript:

A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School of Electrical and Computer Engineering, Pusan National Univ. 2 College of Information and Communications, Hanyang Univ.

Contents Preliminaries Previous results Our contribution Conclusion

Suffix Tree The suffix tree (ST) of a text T A compacted trie for all the suffixes of T. An example for accagat#. agat# c accagat#at# gat#t# a g c g cagat# ccagat# t # # t c a We assume that # is the lexicographically smallest special symbol.

Suffix Array The suffix array (SA) of a text T pos array lcp array

Suffix Array pos The pos array of T stores the starting positions of the lexicographically sorted suffixes of T. 18# 21a c c a g a t # 34a g a t # 46a t # 53c a g a t # 62c c a g a t # 75g a t # 87t # The suffix array (SA) of a text T pos array lcp array T = accagat#

Suffix Array pos lcp The pos array of T stores the starting positions of the lexicographically sorted suffixes of T. The lcp array of T stores the length of the longest common prefix of every adjacent suffixes in the pos array. For example, lcp[3] stores 1 that is the length of the longest common prefix of accagat# and agat#. 18# 210a c c a g a t # 341a g a t # 461a t # 530c a g a t # 621c c a g a t # 750g a t # 870t # The suffix array (SA) of a text T pos array lcp array T = accagat#

Storing Suffix Trees in Arrays Suffix trees can be stored in arrays if it is used as a static data structure.  If a suffix tree is used as a static data structure, they can be implemented using arrays instead of using nodes and pointers in a similar way a complete binary tree is stored in an array. Array-based data structures storing suffix trees Enhanced suffix arrays (ESA) Linearlized suffix trees (LST)

Enhanced Suffix Array Enhanced suffix array developed by Abouelhoda et al. [SPIRE ’02, WABI ’02, JDA ’04] a pos array + an lcp array + a child table The child table is an array implementation of the suffix tree topology whose node branching is implemented by the linked list. Pattern search takes O(m| Σ|) time.  m: pattern length, |Σ|: size of alphabets

Linearlized Suffix Tree Linearlized suffix tree An improvement on ESA developed by Kim et al. [SPIRE ’04] a pos array + an lcp array + a new child table The new child table is an array implementation of the suffix tree topology whose node branching is implemented by the complete binary tree. Pattern search takes O(m log | Σ|) time.  m: pattern length, |Σ|: size of alphabets

Compressed Full-text Indices Compressed full-text indices  Occupy O(n log|Σ|)-bit space.  All full-text indices (ST, SA, ESA, LST) we just introduced occupy O(n)-word space. Compressed suffix array (CSA)  Succinct representation of pos array. Compressed suffix tree (CST)  Succinct representation of a pos array, an lcp array, and a suffix tree topology.

Previous Results Munro et al. [1998], Sadakane[2002]  A succinct representation of a suffix tree topology Grossi and Vitter [2000]  A succinct representation of a pos array Sadakane [2002]  A succinct representation of an lcp array These data structures require O(n log|Σ|)-bit space, however, when they were introduced, the working space is more than O(n log|Σ|) bits.

Previous Results Hon et al.[2002][2003] developed O(n log|Σ|)-bit working space algorithms for constructing CSTs and CSAs that run in O(n log ε n) time. Their construction algorithm for CSTs can construct CSTs supporting O(n log ε n |Σ|)-time pattern search. However, it cannot construct CSTs supporting O(n log ε n log|Σ|)-time pattern search.

Our Contribution We first present a new CST supporting O(n log ε n log|Σ|)- time pattern search. Then, we present an algorithm for constructing the new CST running in optimal O(n log|Σ|)-bit working space and O(n log ε n) time.

New Compressed Suffix Tree Our new compressed suffix tree is a succinct representation of the linearlized suffix tree (LST).  a succinct representation of a pos array,  a succinct representation of an lcp array, and  a succinct representation of a child table, which stores a suffix tree topology.

New Compressed Suffix Tree Succinct representation of a pos array and an lcp array are the same as before.  a succinct representation of a pos array (Grossi & Vitter)  a succinct representation of an lcp array (Sadakane) Succinct representation of a child table, which stores a suffix tree topology, is a new one.

Previous Compressed Suffix Tree Topology Previous succinct representation of a suffix tree is a Parentheses representation. In this representation, every node is represented by a pair of parentheses. A pair of parentheses of a node encloses its children’s parentheses ( () (() () ()) (() ()) () ())

Previous Compressed Suffix Tree Topology ( () (() () ()) (() ()) () ()) In this representation, parent-child relationship is stored implicitly. To find a child of a node, a range-minima query is required.

New compressed tree topology Our succinct representation differs from the previous one in that we store the parent-child relationship explicitly rather than implicitly. Range-minima query is not required.

Child Table We first describe a child table and then the succinct representation of a child table, i.e., the compressed child table. A child table stores an lcp-interval tree that is a modification of a suffix tree.  We first show how to modify a suffix tree to an lcp- interval tree.  Then, how to store an lcp-interval tree into a child table.

Child Table suffix tree  lcp-interval tree  child table agat#accagat#at# gat# t# cagat# ccagat# # The suffix tree for accagat# agat#accagat# at# gat# t# cagat# ccagat# # The suffix tree for accagat# whose node branching is a complete binary tree

Child Table suffix tree  lcp-interval tree  child table agat# accagat# at# gat# t# cagat# ccagat# [1] # agat#accagat# at# gat# t# cagat# ccagat# # [1..8] [1..6][7..8] [1..4] [5..6] [2..4] [2..3] Each node in the suffix tree is replaced by the interval in the pos array which stores the suffixes in the subtree rooted at the node. [2][3] [5] [6] [4] [7] [8] lcp-interval tree

Child Table suffix tree  lcp-interval tree  child table lcp-interval tree [1..8] [1..6][7..8] [1..4] [5..6] [2..4] [2..3] [1] [2] [3] [8] [7] [4] [5] [6] child table Each interval [i..j] have only to store the first index of its right child, denoted by child (i,j), so that it can compute its two children.  Interval [1..8] have only to store 7 to compute its two children [1..6] and [7..8].  Interval [1..6] stores 5 to compute its two children [1..4] and [5..6].

Child Table suffix tree  lcp-interval tree  child table lcp-interval tree [1..8] [1..6][7..8] [1..4] [5..6] [2..4] [2..3] [1] [2] [3] [8] [7] [4] [5] [6] child table Where is child (i,j) stored? We store child (i,j) in cldtab [i] or cldtab [j].  If [i..j] is a right child, child (i,j) is stored in cldtab [i].  If [i..j] is a left child, child (i,j) is stored in cldtab [j].  Interval [7..8] is a right child so child (7,8) = 8 is stored in cldtab [7].  Interval [1..6] is a left child so child (1,6) = 5 is stored in cldtab [6].

Compressed Child Table [1..8] [1..6][7..8] [1..4] [5..6] [2..4] [2..3] [1] [2] [3] [8] [7] [4] [5] [6] child table difference child table diff sign child table  difference child table  compressed child table Difference child table  diff array  sign array

Compressed Child Table [1..8] [1..6][7..8] [1..4] [5..6] [2..4] [2..3] [1] [2] [3] [8] [7] [4] [5] [6] child table difference child table diff sign child table  difference child table  compressed child table Difference child table  diff array  sign array In a diff array, instead of storing child (i,j), we store min{j- child (i,j), child (i,j)-i}.  For an interval [1..4] whose child(1,4) = 2, we compute 4-2=2 and 2-1=1 and the minimum 1 is stored in diff [4].

Compressed Child Table [1..8] [1..6][7..8] [1..4] [5..6] [2..4] [2..3] [1] [2] [3] [8] [7] [4] [5] [6] child table difference child table diff sign Difference child table  diff array  sign array In a diff array, instead of storing child (i,j), we store min{j- child (i,j), child (i,j)-i}.  For an interval [1..4] whose child(1,4) = 2, we compute 4-2=2 and 2-1=1 and the minimum 1 is stored in diff [4]. Whether diff [i] stores j- child (i,j) or child (i,j)-i is indicated by sign [i]. It stores 0 if j- child (i,j) is stored in diff [i] and 1 if child (i,j)-i is stored.  Since diff [4] stores child(1,4)-1, sign [4] stores 1. child table  difference child table  compressed child table

Compressed Child Table Compressed child table  Compressed diff array  sign array child table  difference child table  compressed child table

Compressed Child Table diff Compressed child table  Compressed diff array  C array: a concatenated bit string of the integers in the diff array  D array: a bit string of the same length as C array where most bits are 0 except the starting bit of each integer in the diff array  Data structures for rank and select for D array to find the ith leftmost 1 in the D array  sign array child table  difference child table  compressed child table C array D array

Compressed Child Table Space consumption of a compressed child table  Compressed child table requires 5n + o(n) bits.  C array: 2n bits  D array: 2n bits  Data structures for rank and select: o(n) bits  sign array: n bits

Construction Algorithm We construct the compressed child table directly from the lcp array without building a suffix tree or an lcp-interval tree as intermediate data structures.  The child table can be constructed directly from the lcp array in O(n) time due to Kim et al [SPIRE2004].  They first construct the extended the lcp array and then compute the child table. We modify their construction algorithm so that it constructs the compressed child table directly from the compressed lcp array.

Construction Algorithm The construction algorithm consists of two procedures EXTLCP and CHILD.  Procedure EXTLCP constructs the compressed extended lcp array from the compressed lcp array.  Procedure CHILD constructs the compressed child table which are the C, D, and sign arrays from the compressed extended lcp array.

Construction Algorithm Pseudo-code for EXTLCP

Construction Algorithm To optain the O(n log|Σ|)-bit working space, the size of temporary data structures should be O(n log|Σ|).

Construction Algorithm To optain the O(n log|Σ|)-bit working space, the size of temporary data structures should be O(n log|Σ|). Arrays ranking an numchild is of size O(n log|Σ|) because a node may have |Σ| childrens and each entry of the array consumes log|Σ| bits

Construction Algorithm To optain the O(n log|Σ|)-bit working space, the size of temporary data structures should be O(n log|Σ|). The size of the stack is O(n log|Σ|) because it can be encoded by δ-code.

Construction Algorithm Pseudo-code for CHILD We also developed some techniques to reduce the working space.

Conclusion We presented a new compressed suffix tree supporting O(n log ε n log|Σ|)-time pattern search that consumes 5n + o(n) bit-space. We also presented a construction algorithm for our compressed suffix tree running in O(n log|Σ|)-bit working space and O(n log ε n) time.

Compressed Child Table Space consumption of a compressed child table  Compressed child table requires 5n + o(n) bits.  C array: 2n bits  S(n) = max {k=1..n/2} {S(k)+S(n-k)+log(k+1)}  D array: 2n bits  Data structures for rank and select: o(n) bits  sign array: n bits