Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003.

Similar presentations


Presentation on theme: "Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003."— Presentation transcript:

1 Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003

2 Work 在分析 parallel algorithm 時,常用到二種 複雜度 : time and work complexity. 1)Time t(n) : 須執行多少步驟. 2)Work w(n): t(n) * ( 所用到的 processors 的數目 ). 這篇 paper 主要的貢獻在於它的方法應用在 External Memory 或 Cache Oblivious model 上也是 optimal, 而應用在 BSP 和 EREW-PRAM model 上則可以和現有 的演算法有相同的 work complexity, 但更好的 time complexity. 但以下報告內容將只針對 RAM model 的 time complexity 作分析.

3 Today’s Work Suffix ArrayDepth Array Suffix Tree

4 Model of Alphabet Constant alphabet: The size of alphabet is constant. Integer alphabet: Characters are integers in [1 … n], where n is the number of input characters.

5 A suffix array SA of s is the result of sorting the suffixes of s lexicographically. ex: 0 1 2 s = [ a b a ] s 0 = a b a s 1 = b a s 2 = a Topic 1: Suffix Array => SA = [ s 2 s 0 s 1 ] 012 [ 2 0 1 ] in implementation= We call the suffix starting from the the index i as the ith suffix. Some conventions: 除 3 不等於 0 的 suffix = { ith suffix| i != 0 mod 3} 除 3 等於 0 的 suffix = { ith suffix| i == 0 mod 3}

6 Suffix Array Problem Input: a string s with length n Output: a suffix array SA of s Time: O(n)

7 GetSA Algorithm Outline Step 1: SA ≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. Step 2: SA = 0 = sort the suffixes starting at position i = 0 mod 3. Step 3: SA = merge SA = 0 and SA ≠ 0.

8 選代表 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i Radix sort Step1: SA ≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 3321 554 Let 代 = [ 3 3 2 1 5 5 4 ] 14710258 => getSA( 代 ) = SA 代 = [ 10 7 4 1 8 5 2 ] in T(2n/3) Claim: SA ≠0 = SA 代 11 12 $ m i s s i s s i p p i 0 1 2 3 4 5 6 7 8 9 10

9 Why SA 代 = SA ≠0 ? 代 = [ 3 3 2 1 5 5 4 ] s = m i s s i s s i p p i 代 1 = 3 3 2 1 5 5 4 14710258 0 1 2 3 4 5 6 7 8 9 10 代4代4 代7代7 代 10 代2代2 代5代5 代8代8 = 3 2 1 5 5 4 = 2 1 5 5 4 = 1 5 5 4 = 5 5 4 = 5 4 = 4 s1s1 s4s4 s7s7 s 10 s2s2 s5s5 s8s8 = i s s i s s i p p i = i s s i p p i = i p p i = i = s s i s s i p p i s s i p p i = p p i = SA 代 = SA ≠ 0 = [ 10 7 4 1 8 5 2 ], It suffices to show that 代 i s i < s j.

10 Case 1: i = j mod 3 1 4 7 10 2 5 8 0 1 2 3 4 5 6 7 8 9 10 11 12 代 = [ 4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 10 2 5 8 4 5 6 7 8 9 10 11 12 代 4 = [ 4 3 2 6 6 5 ] s 4 = [ i s s i p p i $ $ ] 1 4 7 10 2 5 8 1 2 3 4 5 6 7 8 9 10 11 12 代 1 = [ 4 4 3 2 6 6 5 ] s 1 = [ i s s i s s i p p i $ $ ] 代 i s i < s j 代 4 < 代 1 s 4 < s 1

11 Case 2: i ≠ j mod 3 1 4 7 10 2 5 8 0 1 2 3 4 5 6 7 8 9 10 11 12 s 12 = [ 4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 10 2 5 8 4 5 6 7 8 9 10 11 12 代 4 = [ 4 3 2 6 6 5 ] s 4 = [ i s s i p p i $ $ ] 5 8 5 6 7 8 9 10 代 5 = [ 6 5 ] s 5 =[ s s i p p i ] 代 i s i < s j 代 4 < 代 5 s 4 < s 5

12 Step2: SA = 0 = sort the suffixes starting at position i = 0 mod 3. ∵ The rank of s j among {s k | k ≠ 0 mod 3 } was determined in Step1 for all j ≠ 0 mod 3. ∴ Let rank ≠0 (s j ) = rank of s j among {s k | k ≠ 0 mod 3 } for all j ≠ 0 mod 3. SA =0 = radix sort { (s[i], rank ≠0 (s i+1 ) ) | i = 0 mod 3 }.

13 Step 3: SA = merge SA = 0 and SA ≠ 0. SA = 0 = [s 0 s 9 s 6 s 3 ] SA ≠0 = [s 11 s 10 s 7 s 1 s 8 s 5 s 2 ] SA = merge SA = 0 and SA ≠0 =[s 11 s 10 s 7 s 4 s 1 s 0 s 9 s 8 s 6 s 3 s 5 s 2 ] It is in time O(n) if we can determine the relative order of S i  SA = 0 and S j  SA ≠0 in constant time.

14 Compare S i and S j where i = 0, j ≠ 0 mod 3: case 1: j = 1 mod 3 ∵ i + 1 = 1 mod 3, j+1 = 2 mod 3 ∴ compare (s[i], rank ≠0 (s i+1 ) ) with (s[j], rank ≠0 (s j+1 ) ) in constant time. case 2: j = 2 mod 3 ∵ i + 2 = 2 mod 3, j+2 = 1 mod 3 ∴ compare (s[i], s[i+1], rank ≠0 (s i+2 )) with (s[j], s[j+1], rank ≠0 (s j+2 )) in constant time

15 Time complexity analysis Step1: O(n) + T(2n/3) Step2: O(n) Step3: O(n) T(n) = O(n) + T(2n/3) = O(n)

16 Topic 2: Depth array Definition of Depth array: sksk sjsj DA[i] = longest common prefix of S j and S k i-1 i SjSj SA = SkSk DA = 0 1 n - 1... n - 1... i

17 Depth array problem Input: a string s and its suffix array SA. Output: a depth array DA of s. Time: O(|s|) = O(n)

18 s i ’ sisi DA[ rank( i ) ] = d i rank( i ) SiSi SA = S i ’ DA = 0 1 n - 1... n - 1... rank( i ) didi... S =... 0n - 1 ii’ SiSi S i ’... Lemma1: d i ≥ d i-1 -1

19 rank( i ) SiSi SA = S i ’ DA = rank( i ) rank( i - 1) S i - 1 S (i – 1)’ didi d i-1 rank( i - 1) sisi 1 s i- 1 didi d i-1 - 1 s i ’ s ( i- 1) ’ d i-1 S = i 0 n - 1 i-1... S i-1 SiSi... Lemma1: d i ≥ d i-1 -1

20 sisi 1 s i- 1 Pf: Lemma1: d i ≥ d i-1 -1 didi d i-1 - 1 s i ’ s ( i- 1) ’

21 sisi 1 s i- 1 < -><-=> if Pf: Lemma1: d i ≥ d i-1 -1 didi d i-1 - 1 s i ’ s ( i- 1) ’ s (i-1)’+1 s i ’ < s (i- 1)’+1 < s i

22 By Lemma1: d i ≥ d i-1 – 1, it suffices to compare s i and s i ’ from the d i-1 -th character. How to compute d i when d i-1 is given ? didi s i ’ sisi d i-1 - 1

23 Algorithm GetDepth Input: A string s and its suffix array SA 1. d 1 = by naïvely comparing s 1 and s 1’ ; 2. For i := 2 to n-1 do 3. d i = by comparing s i and s i ’ from the (d i-1 )-th character; 4. End for Time complexity Analysis : Iteration i: ( d i – d i-1 + 1) + 1 = d i – d i-1 + 2 Total =

24 Topic 3: Suffix Tree Problem Input: a string s with length n. Output: a suffix tree ST of s. Time: O(|s|) = O(n)

25 Algorithm GetST(s) 1. SA = suffix array of s; 2. DA = depth array of s; 3. For i:=0 to n-1 ST i = add the SA[i]-th suffix into ST i-1. 4. End for 5. Return ST n-1 ; GetST Algorithm Outline

26 How to add the SA[i]-th suffix into ST i-1 ? Observation: The SA[i-1]-th suffix is the right_most_path RP of ST i-1, so the longest common prefix of RP and SA[i]-th suffix is DA[ i ]. DA[ i ] [ (SA[i] + DA[i]), - ] i-1 i SjSj SA = SkSk DA = 0 1 n - 1... i

27 Each node is go over at most once DA[ i ] [ (SA[i] + DA[i]), - ] Nodes on this path will not be go over again.

28 Time Complexity Analysis Because each node is go over at most once and there are at most 2n nodes in the tree, the time complexity is O(n).

29 Conclusions Advantages :  Alphabet 的限制  硬碟的 I/O  Easy to show Disadvantages :  沒有 incremental 的特性


Download ppt "Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003."

Similar presentations


Ads by Google