Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.

Similar presentations


Presentation on theme: "1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006."— Presentation transcript:

1 1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006

2 2 Pattern finding & synthesis problems T = t 1 t 2 … t n, P = p 1 p 2 … p n, strings of symbols in finite alphabet Indexing problem: Preprocess T (build an index structure) such that the occurrences of different patterns P can be found fast –static text, any given pattern P Pattern synthesis problem: Learn from T new patterns that occur surprisingly often What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, …

3 3 1. Suffix tree 2. Suffix array 3. Some applications 4. Finding motifs

4 4 Suffix array: example suffix array = lexicographic order of the suffixes hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6

5 5 Suffix array suffix array SA(T) = an array giving the lexicographic order of the suffixes of T space requirement: 5|T| למה 5? practitioners like suffix arrays (simplicity, space efficiency) theoreticians like suffix trees (explicit structure)

6 6 Pattern search from suffix array hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti 11 7 2 1 10 5 9 4 8 3 6 att binary search

7 7 The search time is O(m log n), where m = length of search string, n = length of text (and size of suffix array). With LCA = longest common ancestor time = O(m + log n). pat l u l = m m pat l u m U = m pat l u m

8 8 Recent suffix array constructions Manber&Myers (1990): O(|T|log|T|) linear time via suffix tree January / June 2003: direct linear time construction of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03)

9 9 Kärkkäinen-Sanders algorithm 1.Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively. 2.Construct the suffix array of the remaining suffixes using the result of the first step. 3.Merge the two suffix arrays into one.

10 10 Notation string T = T[0,n) = t 0 t 1 … t n-1 suffix S i = T[i,0) = t i t i+1 … t n-1 for C  [0,n]: S C = {S i | i in C} suffix array SA[0,n] of T is a permutation of [0,n] satisfying S SA[0] < S SA[1] < … < S SA[n] T[SA[0],n)

11 11 Running example T[0,n) = y a b b a d a b b a d o 0 0 SA = (12,1,6,4,9,3,8,2,7,5,10,11,0) 0 1 2 3 4 5 6 7 8 9 10 11 120 8 b a d o 0 0 1 a b b a d a b b a d o 0 0 2 b b a d a b b a d o 0 0 6 a b b a d o 0 0 7 b b a d o 0 0 4 a d a b b a d o 0 0 5 d a b b a d o 0 0 9 a d o 0 0 10 d o 0 0 3 b a d a b b a d o 0 011o 0 0 0y a b b a d a b b a d o 0 0

12 12 Step 0: Construct a sample for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k} C = B1 U B2 sample positions S C sample suffixes Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}

13 13 Step 1: Sort sample suffixes for k = 1,2, construct Rk = [t k t k+1 t k+2 ] [t k+3 t k+4 t k+5 ]… [t maxBk t maxBk+1 t maxBk+2 ] R = R1 º R2 (concatenation of R1 and R2) Suffixes of R correspond to S C : suffix [t i t i+1 t i+2 ]… corresponds to S i ; The correspondence is order preserving: Let R i’  S i and R j ’  S j. Then R i’ < R j ’ iff S i < S j

14 14 Sort the suffixes of R Radix sort the characters and rename with ranks to obtain R´. Example:R1 R2 R = [abb][ada][bba][do0] [bba][dab][bad][o00] 1 2 3 4 5 6 7 [abb][ada][bad][bba] [dab] [do0] [o00] R´ = (1,2,4,6,4,5,3,7) If all characters are different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders. Note: |R´| = 2n/3.

15 15 Step 1 (cont.) Once the sample suffixes are sorted, assign a rank to each: rank(S i ) = the rank of S i in S C ; rank(S n+1 ) = rank(S n+2 ) = 0 Example:R´ = (1,2,4,6,4,5,3,7) 0: ε3: 37 6: 537 1:124645374: 4537 7: 64537 2:24645,75: 464537 8: 7 SA R´ = (8,0,1,6,4,2,5,3,7) (The suffix array for R’) SA R´ -1 = (1 2 5 7 4 6 3 8) rank(S i ) (– 1 4 – 2 6 – 5 3 – 7 8 – 0 0 )

16 16 Step 2: Sort nonsample suffixes for each non-sample S i є S B0 (note that rank(S i+1 ) is always defined for i є B0): S i ≤ S j ↔ (t i,rank(S i+1 )) ≤ (t j,rank(S j+1 )) radix sort the pairs (t i,rank(S i+1 )). Example: S 12 < S 6 < S 9 < S 3 < S 0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)

17 17 יש לפרט יותר Example: S 12 < S 6 < S 9 < S 3 < S 0 because S 0 = yabbadabbado = yS 1 =(y, S 3 = badabbado = bS 4 =(b, S 6 = abbado = aS 7 =(a S 9 =ado = aS 10 =(a S 12 =0 = 0eps = (0,0) (0,0) < (a,5) < (a,7) < (b,2) < (y,1)

18 18 Step 3: Merge merge the two sorted sets of suffixes using a standard comparison-based merging: to compare S i є S C with S j є S B0, distinguish two cases: i є B1: S i ≤ S j ↔ (t i,rank(S i+1 )) ≤ (t j,rank(S j+1 )) i є B2: S i ≤ S j ↔ (t i,t i+1,rank(S i+2 )) ≤ (t j,t j+1,rank(S j+2 )) note that the ranks are defined in all cases! S 1 < S 6 as (a,4) < (a,5) and S 3 < S 8 as (b,a,6) < (b,a,7)  B1  B2

19 19 Running time O(n) excluding the recursive call, everything can be done in linear time the recursion is on a string of length 2n/3 thus the time is given by recurrence T(n) = T(2n/3) + O(n) hence T(n) = O(n)

20 20 Implementation about 50 lines of C++ code available e.g. via Juha Kärkkäinen’s home page

21 21 LCP table Longest Common Prefix of successive elements of suffix array: LCP[i] = length of the longest common prefix of suffixes S SA[i] and S SA[i+1] Algorithm: Enter the suffixes in a trie Find the lca. Complexity = O(n 2 )

22 22 Kasai et al, CPM2001 Key observation: Let LCP[q]=h>1, i.e., S SA[q] = t i t i+1 …a i+h-1 t i+h S SA[q+1] = t k t k+1 …t k+h-1 t k+h = t i t i+1 …t i+h-1 t i+h ( t k+h ≠t i+h ) Then t i+1 …t i+h-1 =t k+1 …t k+h-1,. Define p S SA[p] =t i+1 …t i+h-1 … therefore S SA[p+1] =t i+1 …t i+h-1 … i.e., LCP[p] ≥ h-1 When computing LCP[p] we can start the comparisons at position p+h-1.

23 23 The algorithm for(i=0; i<n; i++) /* compute SA -1 */ SA -1 [SA[i]] = i; h = 0; for(p=0; p<n; p++) { if(SA -1 [p] > 0){ r = SA [SA -1 [p]+1] ; while(T[r+h] = T[p+h]) h++; LCP[SA -1 [p]] = h; if(h > 0) h--; } Complexity: Since h is decreased at most n times, and h ≤ n, h can be increased at most 2n times; i.e., the innermost statement is executed ≤ 2n times. Total time = O(n). innermost statement

24 24 Suffix tree vs suffix array suffix tree  suffix array + LCP table First step S SA[0]

25 25 S SA[0] S SA[i-1] Step i S SA[i] Complexity: The final trie has 2n vertices. Each edge is traversed ≤ twice. Time = O(n). Which edge to split?


Download ppt "1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006."

Similar presentations


Ads by Google