Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

Similar presentations


Presentation on theme: "Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,"— Presentation transcript:

1 Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea

2 Overview Background  Suffix arrays(SA)  Compressed suffix arrays (CSA) Problem definition Previous works Our contributions Description of our algorithm Conclusions

3 Background (1) Given a string T of length n over an alphabet Σ, Suffix array (SA) of T [Manber&Myers ’93]  Lexicographically sorted list of the suffixes of T i SA T 19$ 28a $ 34a a b b a $ 42a b a a b b a $ 55a b b a $ 67b a $ 73b a a b b a $ 81b a b a a b b a $ 96b b a $ T : b a b a a b b a $ O(n log n) -bits

4 Background (2) Compressed suffix array (CSA) [Grossi&Vitter ’00]  Compressed version of SA  Space requirement of O(n log|Σ|) -bit FM-index [Ferragina&Manzini 2000] i SA T ΨTΨT 198$ 281a $ 345a a b b a $ 427a b a a b b a $ 559a b b a $ 672b a $ 733b a a b b a $ 814b a b a a b b a $ 966b b a $ T : b a b a a b b a $ O(n log |Σ|) -bits

5 Problem definition Constructing SA, CSA and FM-index using  o(n log n) -time and  o(n log n) -bit working space  Working space  Temporary space required for executing an algorithm  Not including the space for the input and output

6 Related works Constructing SA and CSA ※ O(n log n) -bit working space  Manber & Myers [1993] : O(n log n) -time  Kim et al. [2003] : O(n ) -time  Kärkkäinen & Sanders [2003] : O(n ) -time  Ko & Aluru [2003]: O(n ) -time ※ O(n log |Σ| ) -bit working space  Lam et al. [COCOON 2002]: O(|Σ|n log n ) -time  Hon et al. [ISAAC 2003]: O(n log n ) -time None of these algorithms satisfy both time and space requirement of our problem.

7 Previous results Hon et al. [FOCS 2003]  An algorithm using O(n loglog|Σ|) -time and O(n log|Σ|) -bit working space  The first algorithm using o(n log n) -time and o(n log n) -bit working space  following ½-recursion (the odd-even scheme)

8 Our contributions Another algorithm using o(n log n) -time and o(n log n) -bit working space  O(n) -time and O(n log|Σ|·log |Σ| α n) -bit working space  α = log 3 2 ≈ 0.63  The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n) -bit working space  Following ⅔-recursion (the skew scheme)

9 Hon et al. vs. Our results Hon et al.Our results Time O(n loglog|Σ|)O(n) Space (bit) O(n log|Σ|)O(n log|Σ|·log |Σ| α n) Scheme½-recursion⅔-recursion (merging)complexsimple (encoding)*implicit *The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.

10 Description of our algorithm

11 Overview Preliminaries Basic definitions and notations Main technique Outline of our algorithm

12 Preliminaries-Ψ function T[k..n] : lexicographically the i th smallest suffix of T ■ SA[i] = k ■ i SA T ΨTΨT 198$ 281a $ 345a a b b a $ 427a b a a b b a $ 559a b b a $ 672b a $ 733b a a b b a $ 814b a b a a b b a $ 966b b a $ T : b a b a a b b a $ The position in SA where T[k+1..n] is stored

13 Preliminaries-Lemmas Text, Ψ → SA, CSA  O(n) time, O(n log|Σ|)-bit working space Text, Ψ → C array (BWT) → FM-index  O(n) time, O(n log|Σ|)-bit working space Note : goal  Text → Ψ Hon et al. [FOCS 2003]

14 Basic def. and not. (1) Residue-1 suffixes of T  T[3i-2..n] for 1 ≤ i ≤ n/3  T[1..n], T[4..n], T[7..n],… Residue-2 suffixes of T  T[3i-1..n] for 1 ≤ i ≤ n/3  T[2..n], T[5..n], T[8..n],… Residue-3 suffixes of T  T[3i..n] for 1 ≤ i ≤ n/3  T[3..n], T[6..n], T[9..n],… T[1..n] =babaabba$ babaabba$ aabba$ ba$ abaabba$ abba$ a$ baabba$ bba$ $

15 Basic def. and not. (2) length : ⅔ n alphabet : Σ 3 SA 12 : suffix array of T 12 length : ⅓ n alphabet : Σ 3 SA 3 : suffix array of T T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba alphabet Σ T 12 [1.. ⅔ n] = T[1..n]T[2..n]T[1]T 3 [1.. ⅓ n] = T[3..n]T[1]T[2]

16 Main technique–Ψ’ function Ψ’ is just like Ψ, but Ψ’ is defined in SA 12 and SA 3 Ψ’ points to the position in SA 12 or SA 3 where T[k+1..n] (the next suffix of current suffix T[k..n] ) is stored. ※ Note that Ψ’ is not the Ψ-function of T 12 and T 3. Ψ’-function consists of Ψ’ T 12, and Ψ’ T 3

17 Ψ’ function (residue-1) Ψ’ T 12 (residue-1 suffixes of T)  Let T[3k-2..n] be a suffix stored in SA 12 [i].  Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes of T) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes of T) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

18 Ψ’ function (residue-1) T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba

19 Ψ’ function (residue-2) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes)  Let T[3k-1..n] be a suffix stored in SA 12 [i].  Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

20 Ψ’ function (residue-2) T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba

21 Ψ’ function (residue-3) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes)  Let T[3k..n] be a suffix stored in SA 3 [i].  Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

22 Ψ’ function (residue-3) T =babaabba$ T 12 =babaabba$abaabba$ba$b T 3 =baabba$ba i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba

23 Framework- outline How to construct Ψ function of T  Bottom-up approach Ψ T Ψ T T 12 Ψ T 12 … Use any linear time construction algorithm step 0 step 1 … step h h = log 3 log |Σ| n lengthalphabet step i

24 Step i - outline S S 12 Ψ S 12 S3S3 Ψ S 12 (from step i+1) Ψ’ S 12 Ψ’S3Ψ’S3 → Ψ’ S 12 Ψ’S3Ψ’S3 ΨSΨS merge ΨSΨS

25 Merging step i SA 12 Ψ’ T a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba i SA T ΨTΨT 198$ 281a$ 355aabba$ 427abaabba$ 559abba$ 672ba$ 733baabba$ 814babaabba$ 966bba$ba * Comparing entries of SA 12 with entries of SA 3 in order - compare two suffixes by following Ψ’- functoin at most twice

26 Conclusions & future works We presented an alphabet-independent linear- time algorithm to construct SA, CSA, FM-index using o(n log n) -bit working space Future works  To Construct SA, CSA, and FM-index optimally, i.e., using O(n) -time and O(n log|Σ|) -bit working space


Download ppt "Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,"

Similar presentations


Ads by Google