Space Efficient Linear Time Construction of Suffix Arrays

Slides:



Advertisements
Similar presentations
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. 1 Chapter 2.
Advertisements

Longest Common Subsequence
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Theory of Algorithms: Divide and Conquer
CS223 Advanced Data Structures and Algorithms 1 Divide and Conquer Neil Tang 4/15/2010.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
8/29/06CS 6463: AT Computational Geometry1 CS 6463: AT Computational Geometry Spring 2006 Convex Hulls Carola Wenk.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
Divide-and-Conquer1 7 2  9 4   2  2 79  4   72  29  94  4.
Merge sort, Insertion sort
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Merge sort, Insertion sort. Sorting I / Slide 2 Sorting * Selection sort or bubble sort 1. Find the minimum value in the list 2. Swap it with the value.
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
Design and Analysis of Algorithms - Chapter 41 Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two.
Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Design and Analysis of Algorithms - Chapter 41 Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two.
Insertion sort, Merge sort COMP171 Fall Sorting I / Slide 2 Insertion sort 1) Initially p = 1 2) Let the first p elements be sorted. 3) Insert the.
Searching – Linear and Binary Searches. Comparing Algorithms Should we use Program 1 or Program 2? Is Program 1 “fast”? “Fast enough”? P1P2.
Divide-and-Conquer 7 2  9 4   2   4   7
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
File Organization and Processing Week 13 Divide and Conquer.
Algorithms  Al-Khwarizmi, arab mathematician, 8 th century  Wrote a book: al-kitab… from which the word Algebra comes  Oldest algorithm: Euclidian algorithm.
Growth of Functions. Asymptotic Analysis for Algorithms T(n) = the maximum number of steps taken by an algorithm for any input of size n (worst-case runtime)
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.
Theory of Algorithms: Brute Force. Outline Examples Brute-Force String Matching Closest-Pair Convex-Hull Exhaustive Search brute-force strengths and weaknesses.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Merge sort, Insertion sort. Sorting I / Slide 2 Sorting * Selection sort (iterative, recursive?) * Bubble sort.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
Divide and Conquer Applications Sanghyun Park Fall 2002 CSE, POSTECH.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin
Quicksort CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
1 Recursive algorithms Recursive solution: solve a smaller version of the problem and combine the smaller solutions. Example: to find the largest element.
Lecture # 6 1 Advance Analysis of Algorithms. Divide-and-Conquer Divide the problem into a number of subproblems Similar sub-problems of smaller size.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
IT 60101: Lecture #181 Foundation of Computing Systems Lecture 18 Sorting Algorithms III.
Randomized Quicksort (8.4.2/7.4.2) Randomized Quicksort –i = Random(p, r) –swap A[p]  A[i] –partition A(p, r) Average analysis = Expected runtime –solving.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Linear Time Suffix Array Construction Using D-Critical Substrings
Advanced Sorting 7 2  9 4   2   4   7
COMP9319 Web Data Compression and Search
COMP9319 Web Data Compression and Search
Divide-and-Conquer 6/30/2018 9:16 AM
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Divide-and-Conquer 7 2  9 4   2   4   7
String Data Structures and Algorithms
String Data Structures and Algorithms
Recursion.
Suffix Arrays and Suffix Trees
Divide-and-Conquer 7 2  9 4   2   4   7
Divide and Conquer Neil Tang 4/24/2008
Divide-and-Conquer 7 2  9 4   2   4   7
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. Electrical and Computer Engineering Laurence H. Baker Center for Bioinformatics and Biological Statistics Iowa State University Ames, Iowa 50011, USA {kopang, aluru}@iastate.edu

Suffix Array Sorted order of suffixes of a string T. Represented by the starting position of the suffix. Text M I S S I S S I P P I $ Index 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array 12 11 8 5 2 1 10 9 7 4 6 3

Brief History Introduced by Manber and Myers in 1989. Takes O(n log n) time, and 8n bytes. Many other non-linear time algorithms. Authors Time Space (bytes) Manber & Myers n log n 8n Sadakane 9n String-sorting n2 log n 5n Radix-sorting n2

Our Result Among the first linear-time direct suffix array construction algorithms. Solves an important open problem. For constant size alphabet, only uses 8n bytes. Easily implementable. Can also be used as a space efficient suffix tree construction algorithm.

Notation String T = t1…tn. Over the alphabet Σ = {1…n}. tn = ‘$’, ‘$’ is a unique character. Ti = ti…tn, denotes the i-th suffix of T. For strings  and ,  <  denotes  is lexicographically smaller than .

Overview Divide all suffixes of T into two types. Type S suffixes = {Ti | Ti < Ti+1} Type L suffixes = {Tj | Tj > Tj+1} The last suffix is both type S and L. Sort all suffixes of one of the types. Obtain lexicographical order of all suffixes from the sorted ones.

Identify Suffix Types Type L S L L S L L S L L L L/S Text M I S S I S $ The type of each suffix in T can be determined in one scan of the string. M > I T1 > T2 T1 is type L I < S T2 < T3 T2 is type S S = S, so check next character S > I T3 > T4 > T5 T3 and T4 are type L

Type S positions/ characters Notation Type S S S S Text M I S S I S S I P P I $ Type S suffixes Type S positions/ characters Type S substrings

Sorting Type S Suffixes Sort all type S substrings. Replace each type S substrings by its bucket number. New string is the sequence of bucket numbers. Sorting all type S suffixes = Sorting all suffixes of the new string.

Sorting Type S Substrings Substitute the substrings with the bucket numbers to obtain a new string. Apply sorting recursively to the new string. Text 3 M 3 I 2 S S 1 I S S I P P I $ Bucket Sort $ I I I According to the 1st character P S S According to the 2nd character Sort each substring until the next type S character Bucket Sort takes potentially O(n2) time P S S I I I Break into 2 buckets $ Break into 2 more buckets

Solution Observation: Each character participates in the bucket sort at most twice. Type L characters only participate in the bucket sort once. Solution: Sort all the characters once. Construct m lists according the distance to the closest type S character to the left

Illustration Type S S S S Index Text M I S S I S S I P P I $ Distance 1 2 3 4 5 6 7 8 9 10 11 12 Text M I S S I S S I P P I $ Distance 1 2 3 1 2 3 1 2 3 4 Sorted Order of characters 12 2 5 8 11 1 3 4 6 7 9 10 Sort the type S substrings using the lists The Lists

Construct Suffix Array for all Suffixes The first suffix in the suffix array is a type S suffix. For 1 ≤ i ≤ n, if TSA[i]-1 is type L, move it to the current front of its bucket $ I M P S 12 11 8 5 2 1 10 9 7 4 6 3 Sorted order of type S suffixes

Run-Time Analysis Identify types of suffixes -- O(n) time. Bucket sort type S (or L) substrings -- O(n) time. Construct suffix array from sorted type S (or L) suffixes -- O(n) time. T(n) = T(n/2) + O(n) = O(n)

Space Requirement For each element in an array use 2 integers to mark the beginning and the end of their bucket. Use Boolean array to mark the boundary of each bucket. This takes 12n bytes for |Σ|~n. At most 3n bits is needed for the Boolean arrays.

Constant size Alphabet No need to create the m lists. Because bucket sort takes O(n) time. Reducing the space needed to 4n bytes for the 1st recursion step. Maximum space used at anytime is 8n bytes (and maximum of 3 Boolean arrays).

Suffix Trees Most suffix tree construction algorithms take O(n|Σ|) time. (Ukkonen, Weiner, McCreight) Farach’s algorithm takes O(n) time for large integer alphabet. Hard to implement. Uses a lot of space. Our algorithm can be used to build the suffix array then convert it into the suffix tree. (By using linear time LCP computation algorithm by Kasai et. al.)

Conclusion Among the first suffix array construction algorithm takes O(n) time. The algorithm can be easily implemented in 8n bytes (plus a few Boolean arrays). Equal or less space than most non-linear time algorithm. Can be used as a space efficient suffix tree construction algorithm.