# Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,

## Presentation on theme: "Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,"— Presentation transcript:

Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea

Overview Background  Suffix arrays(SA)  Compressed suffix arrays (CSA) Problem definition Previous works Our contributions Description of our algorithm Conclusions

Background (1) Given a string T of length n over an alphabet Σ, Suffix array (SA) of T [Manber&Myers ’93]  Lexicographically sorted list of the suffixes of T i SA T 19\$ 28a \$ 34a a b b a \$ 42a b a a b b a \$ 55a b b a \$ 67b a \$ 73b a a b b a \$ 81b a b a a b b a \$ 96b b a \$ T : b a b a a b b a \$ O(n log n) -bits

Background (2) Compressed suffix array (CSA) [Grossi&Vitter ’00]  Compressed version of SA  Space requirement of O(n log|Σ|) -bit FM-index [Ferragina&Manzini 2000] i SA T ΨTΨT 198\$ 281a \$ 345a a b b a \$ 427a b a a b b a \$ 559a b b a \$ 672b a \$ 733b a a b b a \$ 814b a b a a b b a \$ 966b b a \$ T : b a b a a b b a \$ O(n log |Σ|) -bits

Problem definition Constructing SA, CSA and FM-index using  o(n log n) -time and  o(n log n) -bit working space  Working space  Temporary space required for executing an algorithm  Not including the space for the input and output

Related works Constructing SA and CSA ※ O(n log n) -bit working space  Manber & Myers [1993] : O(n log n) -time  Kim et al. [2003] : O(n ) -time  Kärkkäinen & Sanders [2003] : O(n ) -time  Ko & Aluru [2003]: O(n ) -time ※ O(n log |Σ| ) -bit working space  Lam et al. [COCOON 2002]: O(|Σ|n log n ) -time  Hon et al. [ISAAC 2003]: O(n log n ) -time None of these algorithms satisfy both time and space requirement of our problem.

Previous results Hon et al. [FOCS 2003]  An algorithm using O(n loglog|Σ|) -time and O(n log|Σ|) -bit working space  The first algorithm using o(n log n) -time and o(n log n) -bit working space  following ½-recursion (the odd-even scheme)

Our contributions Another algorithm using o(n log n) -time and o(n log n) -bit working space  O(n) -time and O(n log|Σ|·log |Σ| α n) -bit working space  α = log 3 2 ≈ 0.63  The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n) -bit working space  Following ⅔-recursion (the skew scheme)

Hon et al. vs. Our results Hon et al.Our results Time O(n loglog|Σ|)O(n) Space (bit) O(n log|Σ|)O(n log|Σ|·log |Σ| α n) Scheme½-recursion⅔-recursion (merging)complexsimple (encoding)*implicit *The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.

Description of our algorithm

Overview Preliminaries Basic definitions and notations Main technique Outline of our algorithm

Preliminaries-Ψ function T[k..n] : lexicographically the i th smallest suffix of T ■ SA[i] = k ■ i SA T ΨTΨT 198\$ 281a \$ 345a a b b a \$ 427a b a a b b a \$ 559a b b a \$ 672b a \$ 733b a a b b a \$ 814b a b a a b b a \$ 966b b a \$ T : b a b a a b b a \$ 1 2 3 4 5 6 7 8 9 The position in SA where T[k+1..n] is stored

Preliminaries-Lemmas Text, Ψ → SA, CSA  O(n) time, O(n log|Σ|)-bit working space Text, Ψ → C array (BWT) → FM-index  O(n) time, O(n log|Σ|)-bit working space Note : goal  Text → Ψ Hon et al. [FOCS 2003]

Basic def. and not. (1) Residue-1 suffixes of T  T[3i-2..n] for 1 ≤ i ≤ n/3  T[1..n], T[4..n], T[7..n],… Residue-2 suffixes of T  T[3i-1..n] for 1 ≤ i ≤ n/3  T[2..n], T[5..n], T[8..n],… Residue-3 suffixes of T  T[3i..n] for 1 ≤ i ≤ n/3  T[3..n], T[6..n], T[9..n],… 123456789 T[1..n] =babaabba\$ babaabba\$ aabba\$ ba\$ abaabba\$ abba\$ a\$ baabba\$ bba\$ \$

Basic def. and not. (2) length : ⅔ n alphabet : Σ 3 SA 12 : suffix array of T 12 length : ⅓ n alphabet : Σ 3 SA 3 : suffix array of T 3 123456789 T =babaabba\$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba\$abaabba\$ba\$b 3 4 56 7 89 1 2 T 3 =baabba\$ba alphabet Σ T 12 [1.. ⅔ n] = T[1..n]T[2..n]T[1]T 3 [1.. ⅓ n] = T[3..n]T[1]T[2]

Main technique–Ψ’ function Ψ’ is just like Ψ, but Ψ’ is defined in SA 12 and SA 3 Ψ’ points to the position in SA 12 or SA 3 where T[k+1..n] (the next suffix of current suffix T[k..n] ) is stored. ※ Note that Ψ’ is not the Ψ-function of T 12 and T 3. Ψ’-function consists of Ψ’ T 12, and Ψ’ T 3

Ψ’ function (residue-1) Ψ’ T 12 (residue-1 suffixes of T)  Let T[3k-2..n] be a suffix stored in SA 12 [i].  Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes of T) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes of T) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

Ψ’ function (residue-1) 123456789 T =babaabba\$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba\$abaabba\$ba\$b 3 4 56 7 89 1 2 T 3 =baabba\$ba i SA 12 Ψ’ T 1 2 161a\$b 224aab ba\$ 342aba abb a\$b 453abb a\$b 531ba\$ 613bab aab ba\$ i SA 3 Ψ’ T 3 136\$ba 212baa bba \$ba 325bba \$ba

Ψ’ function (residue-2) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes)  Let T[3k-1..n] be a suffix stored in SA 12 [i].  Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

Ψ’ function (residue-2) 123456789 T =babaabba\$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba\$abaabba\$ba\$b 3 4 56 7 89 1 2 T 3 =baabba\$ba i SA 12 Ψ’ T 1 2 161a\$b 224aab ba\$ 342aba abb a\$b 453abb a\$b 531ba\$ 613bab aab ba\$ i SA 3 Ψ’ T 3 136\$ba 212baa bba \$ba 325bba \$ba

Ψ’ function (residue-3) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes)  Let T[3k..n] be a suffix stored in SA 3 [i].  Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.

Ψ’ function (residue-3) 123456789 T =babaabba\$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba\$abaabba\$ba\$b 3 4 56 7 89 1 2 T 3 =baabba\$ba i SA 12 Ψ’ T 1 2 161a\$b 224aab ba\$ 342aba abb a\$b 453abb a\$b 531ba\$ 613bab aab ba\$ i SA 3 Ψ’ T 3 136\$ba 212baa bba \$ba 325bba \$ba

Framework- outline How to construct Ψ function of T  Bottom-up approach Ψ T Ψ T T 12 Ψ T 12 … Use any linear time construction algorithm step 0 step 1 … step h h = log 3 log |Σ| n lengthalphabet step i

Step i - outline S S 12 Ψ S 12 S3S3 Ψ S 12 (from step i+1) Ψ’ S 12 Ψ’S3Ψ’S3 → Ψ’ S 12 Ψ’S3Ψ’S3 ΨSΨS merge ΨSΨS

Merging step i SA 12 Ψ’ T 1 2 161a\$b 224aab ba\$ 342aba abb a\$b 453abb a\$b 531ba\$ 613bab aab ba\$ i SA 3 Ψ’ T 3 136\$ba 212baa bba \$ba 325bba \$ba i SA T ΨTΨT 198\$ 281a\$ 355aabba\$ 427abaabba\$ 559abba\$ 672ba\$ 733baabba\$ 814babaabba\$ 966bba\$ba * Comparing entries of SA 12 with entries of SA 3 in order - compare two suffixes by following Ψ’- functoin at most twice

Conclusions & future works We presented an alphabet-independent linear- time algorithm to construct SA, CSA, FM-index using o(n log n) -bit working space Future works  To Construct SA, CSA, and FM-index optimally, i.e., using O(n) -time and O(n log|Σ|) -bit working space

Download ppt "Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,"

Similar presentations