Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Slides:



Advertisements
Similar presentations
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Advertisements

Equivalence, Order, and Inductive Proof
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Compressing Two-Dimensional Routing Tables Author: Subhash Suri, Tuomas Sandholm, Priyank Warkhede. Publisher: ALGO'03 Presenter: Yu-Ping Chiang Date:
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Space Efficient Linear Time Construction of Suffix Arrays
Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.
1 Welcome to ! Theory Of Automata. 2 Text and Reference Material 1.Introduction to Computer Theory, by Daniel I. Cohen, John Wiley and Sons, Inc., 1991,
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
L ECTURE 3 Chapter 4 Regular Expressions. I MPORTANT T ERMS Regular Expressions Regular Languages Finite Representations.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Introduction to Theory of Automata By: Wasim Ahmad Khan.
Notes on the optimal encoding scheme for self-assembly Days 10, 11 and 12 Of Comp Sci 480.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Data Compression Reduce the size of data.  Reduces storage space and hence storage cost. Compression ratio = original data size/compressed data size.
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
1 Section 4.3 Order Relations A binary relation is an partial order if it transitive and antisymmetric. If R is a partial order over the set S, we also.
CS 1501: Algorithm Implementation LZW Data Compression.
Su ffi x Tree of Alignment: An E ffi cient Index for Similar Data JOONG CHAE NA1, HEEJIN PARK2, MAXIME CROCHEMORE3, JAN HOLUB4, COSTAS S. ILIOPOULOS3, LAURENT.
Lecture # 4.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
1 CS Review, iClicker -Questions Week 15. ANY QUESTIONS? 2.
Section Recursion  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Linear Time Suffix Array Construction Using D-Critical Substrings
COMP9319 Web Data Compression and Search
Theory of Computation Lecture #
Lecture # 2.
Implementation of Haskell Modules for Automata and Sticker Systems
Regular Languages, Regular Operations, Closure
COMP9319 Web Data Compression and Search
Andrzej Ehrenfeucht, University of Colorado, Boulder
Bucket-Sort and Radix-Sort
Applied Algorithmics - week7
RE-Tree: An Efficient Index Structure for Regular Expressions
Data Compression Reduce the size of data.
Recap lecture 29 Example of prefixes of a language, Theorem: pref(Q in R) is regular, proof, example, Decidablity, deciding whether two languages are equivalent.
Turing -Recognizable vs. -Decidable
Presentation transcript:

Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea

Overview Background  Suffix arrays(SA)  Compressed suffix arrays (CSA) Problem definition Previous works Our contributions Description of our algorithm Conclusions

Background (1) Given a string T of length n over an alphabet Σ, Suffix array (SA) of T [Manber&Myers ’93]  Lexicographically sorted list of the suffixes of T i SA T 19$ 28a $ 34a a b b a $ 42a b a a b b a $ 55a b b a $ 67b a $ 73b a a b b a $ 81b a b a a b b a $ 96b b a $ T : b a b a a b b a $ O(n log n) -bits

Background (2) Compressed suffix array (CSA) [Grossi&Vitter ’00]  Compressed version of SA  Space requirement of O(n log|Σ|) -bit FM-index [Ferragina&Manzini 2000] i SA T ΨTΨT 198$ 281a $ 345a a b b a $ 427a b a a b b a $ 559a b b a $ 672b a $ 733b a a b b a $ 814b a b a a b b a $ 966b b a $ T : b a b a a b b a $ O(n log |Σ|) -bits

Problem definition Constructing SA, CSA and FM-index using  o(n log n) -time and  o(n log n) -bit working space  Working space  Temporary space required for executing an algorithm  Not including the space for the input and output

Related works Constructing SA and CSA ※ O(n log n) -bit working space  Manber & Myers [1993] : O(n log n) -time  Kim et al. [2003] : O(n ) -time  Kärkkäinen & Sanders [2003] : O(n ) -time  Ko & Aluru [2003]: O(n ) -time ※ O(n log |Σ| ) -bit working space  Lam et al. [COCOON 2002]: O(|Σ|n log n ) -time  Hon et al. [ISAAC 2003]: O(n log n ) -time None of these algorithms satisfy both time and space requirement of our problem.

Previous results Hon et al. [FOCS 2003]  An algorithm using O(n loglog|Σ|) -time and O(n log|Σ|) -bit working space  The first algorithm using o(n log n) -time and o(n log n) -bit working space  following ½-recursion (the odd-even scheme)

Our contributions Another algorithm using o(n log n) -time and o(n log n) -bit working space  O(n) -time and O(n log|Σ|·log |Σ| α n) -bit working space  α = log 3 2 ≈ 0.63  The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n) -bit working space  Following ⅔-recursion (the skew scheme)

Hon et al. vs. Our results Hon et al.Our results Time O(n loglog|Σ|)O(n) Space (bit) O(n log|Σ|)O(n log|Σ|·log |Σ| α n) Scheme½-recursion⅔-recursion (merging)complexsimple (encoding)*implicit *The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.

Description of our algorithm

Overview Preliminaries Basic definitions and notations Main technique Outline of our algorithm

Preliminaries-Ψ function T[k..n] : lexicographically the i th smallest suffix of T ■ SA[i] = k ■ i SA T ΨTΨT 198$ 281a $ 345a a b b a $ 427a b a a b b a $ 559a b b a $ 672b a $ 733b a a b b a $ 814b a b a a b b a $ 966b b a $ T : b a b a a b b a $ The position in SA where T[k+1..n] is stored

Preliminaries-Lemmas Text, Ψ → SA, CSA  O(n) time, O(n log|Σ|)-bit working space Text, Ψ → C array (BWT) → FM-index  O(n) time, O(n log|Σ|)-bit working space Note : goal  Text → Ψ Hon et al. [FOCS 2003]

Basic def. and not. (1) Residue-1 suffixes of T  T[3i-2..n] for 1 ≤ i ≤ n/3  T[1..n], T[4..n], T[7..n],… Residue-2 suffixes of T  T[3i-1..n] for 1 ≤ i ≤ n/3  T[2..n], T[5..n], T[8..n],… Residue-3 suffixes of T  T[3i..n] for 1 ≤ i ≤ n/3  T[3..n], T[6..n], T[9..n],… T[1..n] =babaabba$ babaabba$ aabba$ ba$ abaabba$ abba$ a$ baabba$ bba$ $

Basic def. and not. (2) length : ⅔ n alphabet : Σ 3 SA 12 : suffix array of T 12 length : ⅓ n alphabet : Σ 3 SA 3 : suffix array of T T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba alphabet Σ T 12 [1.. ⅔ n] = T[1..n]T[2..n]T[1]T 3 [1.. ⅓ n] = T[3..n]T[1]T[2]

Main technique–Ψ’ function Ψ’ is just like Ψ, but Ψ’ is defined in SA 12 and SA 3 Ψ’ points to the position in SA 12 or SA 3 where T[k+1..n] (the next suffix of current suffix T[k..n] ) is stored. ※ Note that Ψ’ is not the Ψ-function of T 12 and T 3. Ψ’-function consists of Ψ’ T 12, and Ψ’ T 3

Ψ’ function (residue-1) Ψ’ T 12 (residue-1 suffixes of T)  Let T[3k-2..n] be a suffix stored in SA 12 [i].  Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes of T) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes of T) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

Ψ’ function (residue-1) T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba

Ψ’ function (residue-2) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes)  Let T[3k-1..n] be a suffix stored in SA 12 [i].  Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

Ψ’ function (residue-2) T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba

Ψ’ function (residue-3) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes)  Let T[3k..n] be a suffix stored in SA 3 [i].  Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

Ψ’ function (residue-3) T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba

Framework- outline How to construct Ψ function of T  Bottom-up approach Ψ T Ψ T T 12 Ψ T 12 … Use any linear time construction algorithm step 0 step 1 … step h h = log 3 log |Σ| n lengthalphabet step i

Step i - outline S S 12 Ψ S 12 S3S3 Ψ S 12 (from step i+1) Ψ’ S 12 Ψ’S3Ψ’S3 → Ψ’ S 12 Ψ’S3Ψ’S3 ΨSΨS merge ΨSΨS

Merging step i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba i SA T ΨTΨT 198$ 281a$ 355aabba$ 427abaabba$ 559abba$ 672ba$ 733baabba$ 814babaabba$ 966bba$ba * Comparing entries of SA 12 with entries of SA 3 in order - compare two suffixes by following Ψ’- functoin at most twice

Conclusions & future works We presented an alphabet-independent linear- time algorithm to construct SA, CSA, FM-index using o(n log n) -bit working space Future works  To Construct SA, CSA, and FM-index optimally, i.e., using O(n) -time and O(n log|Σ|) -bit working space