Presentation is loading. Please wait.

Presentation is loading. Please wait.

Space Efficient Linear Time Construction of Suffix Arrays

Similar presentations


Presentation on theme: "Space Efficient Linear Time Construction of Suffix Arrays"— Presentation transcript:

1 Space Efficient Linear Time Construction of Suffix Arrays
Pang Ko and Srinivas Aluru Dept. Electrical and Computer Engineering Laurence H. Baker Center for Bioinformatics and Biological Statistics Iowa State University Ames, Iowa 50011, USA {kopang,

2 Suffix Array Sorted order of suffixes of a string T.
Represented by the starting position of the suffix. Text M I S S I S S I P P I $ Index 1 2 3 4 5 6 7 8 9 10 11 12 Suffix Array 12 11 8 5 2 1 10 9 7 4 6 3

3 Brief History Introduced by Manber and Myers in 1989.
Takes O(n log n) time, and 8n bytes. Many other non-linear time algorithms. Authors Time Space (bytes) Manber & Myers n log n 8n Sadakane 9n String-sorting n2 log n 5n Radix-sorting n2

4 Our Result Among the first linear-time direct suffix array construction algorithms. Solves an important open problem. For constant size alphabet, only uses 8n bytes. Easily implementable. Can also be used as a space efficient suffix tree construction algorithm.

5 Notation String T = t1…tn. Over the alphabet Σ = {1…n}.
tn = ‘$’, ‘$’ is a unique character. Ti = ti…tn, denotes the i-th suffix of T. For strings  and ,  <  denotes  is lexicographically smaller than .

6 Overview Divide all suffixes of T into two types.
Type S suffixes = {Ti | Ti < Ti+1} Type L suffixes = {Tj | Tj > Tj+1} The last suffix is both type S and L. Sort all suffixes of one of the types. Obtain lexicographical order of all suffixes from the sorted ones.

7 Identify Suffix Types Type L S L L S L L S L L L L/S Text M I S S I S
$ The type of each suffix in T can be determined in one scan of the string. M > I T1 > T2 T1 is type L I < S T2 < T3 T2 is type S S = S, so check next character S > I T3 > T4 > T5 T3 and T4 are type L

8 Type S positions/ characters
Notation Type S S S S Text M I S S I S S I P P I $ Type S suffixes Type S positions/ characters Type S substrings

9 Sorting Type S Suffixes
Sort all type S substrings. Replace each type S substrings by its bucket number. New string is the sequence of bucket numbers. Sorting all type S suffixes = Sorting all suffixes of the new string.

10 Sorting Type S Substrings
Substitute the substrings with the bucket numbers to obtain a new string. Apply sorting recursively to the new string. Text 3 M 3 I 2 S S 1 I S S I P P I $ Bucket Sort $ I I I According to the 1st character P S S According to the 2nd character Sort each substring until the next type S character Bucket Sort takes potentially O(n2) time P S S I I I Break into 2 buckets $ Break into 2 more buckets

11 Solution Observation: Each character participates in the bucket sort at most twice. Type L characters only participate in the bucket sort once. Solution: Sort all the characters once. Construct m lists according the distance to the closest type S character to the left

12 Illustration Type S S S S Index Text M I S S I S S I P P I $ Distance
1 2 3 4 5 6 7 8 9 10 11 12 Text M I S S I S S I P P I $ Distance 1 2 3 1 2 3 1 2 3 4 Sorted Order of characters 12 2 5 8 11 1 3 4 6 7 9 10 Sort the type S substrings using the lists The Lists

13 Construct Suffix Array for all Suffixes
The first suffix in the suffix array is a type S suffix. For 1 ≤ i ≤ n, if TSA[i]-1 is type L, move it to the current front of its bucket $ I M P S 12 11 8 5 2 1 10 9 7 4 6 3 Sorted order of type S suffixes

14 Run-Time Analysis Identify types of suffixes -- O(n) time.
Bucket sort type S (or L) substrings -- O(n) time. Construct suffix array from sorted type S (or L) suffixes -- O(n) time. T(n) = T(n/2) + O(n) = O(n)

15 Space Requirement For each element in an array use 2 integers to mark the beginning and the end of their bucket. Use Boolean array to mark the boundary of each bucket. This takes 12n bytes for |Σ|~n. At most 3n bits is needed for the Boolean arrays.

16 Constant size Alphabet
No need to create the m lists. Because bucket sort takes O(n) time. Reducing the space needed to 4n bytes for the 1st recursion step. Maximum space used at anytime is 8n bytes (and maximum of 3 Boolean arrays).

17 Suffix Trees Most suffix tree construction algorithms take O(n|Σ|) time. (Ukkonen, Weiner, McCreight) Farach’s algorithm takes O(n) time for large integer alphabet. Hard to implement. Uses a lot of space. Our algorithm can be used to build the suffix array then convert it into the suffix tree. (By using linear time LCP computation algorithm by Kasai et. al.)

18 Conclusion Among the first suffix array construction algorithm takes O(n) time. The algorithm can be easily implemented in 8n bytes (plus a few Boolean arrays). Equal or less space than most non-linear time algorithm. Can be used as a space efficient suffix tree construction algorithm.


Download ppt "Space Efficient Linear Time Construction of Suffix Arrays"

Similar presentations


Ads by Google