Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Similar presentations


Presentation on theme: "Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and."— Presentation transcript:

1 Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and P. Sanders(2002) Seminar in advanced topics in data structures Presented by Kfir Amitai

2 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP information building  Some observations about linear time Building in O(n) - 2002 Results

3 Introduction Until now we observed suffix trees. The main problem with suffix trees is the coefficient of the linear space complexity. Suffix arrays present a much simpler data structures Suffix arrays allow us to search all appearances of a string of size P in a string of size N in O(P+logN) with a kind of binary search.

4

5 Introduction – What is a suffix array? A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S. S = “nahariya” S0S0 nahariya S1S1 ahariya S2S2 hariya S3S3 ariya S4S4 riya S5S5 iya S6S6 ya S7S7 a

6 The sorted suffixes will be represented by an array of integers - POS All suffixes S0S0 nahariya S1S1 ahariya S2S2 hariya S3S3 ariya S4S4 riya S5S5 iya S6S6 ya S7S7 a Sorted suffixes S7S7 a S1S1 ahariya S3S3 ariya S2S2 hariya S5S5 iya S0S0 nahariya S4S4 riya S6S6 ya POS 7 1 3 2 5 0 4 6

7 Some definitions and observations Pos[k] = i  S i is the kth smallest suffix in the set {S 0, S 1, S 2 …… S N-1 } For every string u, and a prefix of size p, we denote “< P ” as lexicographic order on the first p characters: v < P u  v 0 v 1 …v P-1 < u 0 u 1 …u P-1 Note that for every choice of p<N: A POS[0] ≤ P A POS[1] ≤ P A POS [2] ≤ P … ≤ P A POS [N-1] |W| = P.Note that W is a substring of A  there is an i such that W = P A POS[i]

8 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP information building  Some observations about linear time Building in O(n) - 2002 Results

9 The Binary search We define a search interval: L W = min {k | W ≤ P A POS[k] or k = N} R W = max {k | W ≥ P A POS[k] or k = -1} W matches a i a i+1...a i+P-1  i=POS[k] for some k [L W, R W ] If L W > R W => W is not a substring of A. Else: there are (R W -L W +1) matches - A POS[L W ],…, A POS[R W ] W> P A POS[k] W < P A POS[k] LWLW RWRW POS array

10 Search example Pos A = assassin Search 1 Search 3 Serach 4 Search 2 P=|W| L W = min {k | W ≤ P A POS[k] or k = N} R W = max {k | W ≥ P A POS[k] or k = -1} If found  (R W -L W +1) matches

11 Naïve search – O(P logN) We iterate over POS array in an ordinary binary search. There will be logN iterations of complexity P Initialize:  L=0  R=N-1 Step:  Set M=(L+R)/2  Set sets new L,R bounds according to a comparison of W with A POS[M]. Stop if reached L W = min {k | W ≤ P A POS[k] or k = N} and R W = max {k | W ≥ P A POS[k] or k = -1} L MR abcde...abcdf...abd... Pos W=“abcx”

12 Stop to think… What can we do better?

13 Let’s do it better… What we didn’t use is the fact that we searching suffixes of the same string… Let’s assume we have information on the lcp’s of pairs of the suffixes. For each iteration We define:  l = lcp(A POS[L], W)  r = lcp(W, A POS [R] )  L lcp[M] = lcp(A POS [L], A POS [M] )  R lcp[M] = lcp(A POS [M], A POS [R] ) An important point – we don’t need more than 2*N lcp pairs becuase for each search midpoint M there are well defined L and R!

14 Search in O(P + logN) using lcp’s Let’s look for W = “nahx”. If l≥r we will compare l and L lcp[M] and if l<r, we will compare r and R lcp[M]. I will show the case of l≥r, the other case is symmetric. Case 1 : l < L lcp[M] nahde... nah... nahdf…nazf… l=3r=2 LM R Llcp[M]=4 l = lcp(A POS[L], W)r = lcp(W, APOS [R] ) L lcp[M] = lcp(A POS [L], A POS [M] )

15 Search in O(P + logN) using lcp’s Case 1 : l < L lcp[M] (W = “nahx”)  We know that W>A POS[L]   W>A POS[M] because their lcp is bigger   We need to move L to be M   l is unchanged (again, their lcp is bigger)  We did it with no string comparison, only integers nahde... nah... nahdf…naz… l=3r=2 LM R Llcp[M]=4 l = lcp(A POS[L], W)r = lcp(W, APOS [R] ) L lcp[M] = lcp(A POS [L], A POS [M] )

16 Case 2 : l > L lcp[M] ( W = “nahx”)  W and A POS[L] have more in common (bigger lcp)   Therefore, because we know that A POS[L] < A POS[M]   W < A POS[M]   We need to move R to be M  Now we assign r  L lcp[M]  Again – no string comparison operations Search in O(P + logN) using lcp’s nahde... nai...nak...naxf…naz… l=3r=2 LM R Llcp[M]=2 l = lcp(A POS[L], W)r = lcp(W, APOS [R] ) L lcp[M] = lcp(A POS [L], A POS [M] )

17 Case 3 : l = L lcp[M] ( W = “nahx”)  Now we got to the only case we have to compare strings. We are not sure if we have to go left or right using our lcp information.  What we do know is that the first l characters of W and A POS[M] are similar.  We compare the l+1 st character, the l+2 nd, and so on, until we find j such that W l+j ≠ l+j A POS[M]  The l+jth character determines if we go left or right. In either way, we know the new value of l/r. Search in O(P + logN) using lcp’s nahde... nai...nak...nahp…naz… l=3r=2 LM R Llcp[M]=3 l = lcp(A POS[L], W)r = lcp(W, APOS [R] ) L lcp[M] = lcp(A POS [L], A POS [M] )

18 Search in O(P + logN) using lcp’s Time complexity If we analyze the number of single character comparisons we do in this step, in an amortized manner, we can say that it equals:  ( max(l,r) of last step) – ( max(l,r) initially ) + 1.  All together – not bigger that P, together with the steps, we get O(P + logN)

19 Search in O(P + logN) using lcp’s Space complexity The implementation uses three N-sized arrays of integers – POS, L lcp and R lcp (that we didn’t show how to use in the example). It is used in the cases were r>l in the same way. Now we move on to see how to prepare those 3 arrays, whilst sorting.

20

21 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP information building  Some observations about linear time Building in O(n) - 2002 Results

22 Sorting the suffixes We will see a variation of radix sort. We will sort in O(logN) stages, and call the stages 1,2,4,8,… We name the stage 2 i, H-stage. In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H) If A i, A j H-bucket, we Sort them by the Next H symbols in the 2H stage.

23 The general idea If A i, A j H-bucket, we Sort them by the Next H symbols in the 2H stage, but Their next H symbols = first H symbols of A i+H and A j+H which are already sorted in phase H. abef…abcd…ab…bb...bb…cd… ef… H=2 AiAi AjAj A j+H A i+H first bucketfourth bucketthird bucketsecond bucket

24 The sorting algorithm We go over the semi-sorted suffix array:  The first stage involves only bucket sort of the first character.  Assume the suffixes are now ordered in ≤ H order.  For each A i : Move A i-H to next available place in its H- bucket.  The suffixes are now sorted according to ≤ 2H order.  Go over the array again, and decide which suffix opens a new 2H-bucket, use lcp knowledge (will be described later).  In this way, POS will get more and more sorted until every suffix is put in a bucket of it’s own.

25 An example of A = “assassin” A = assassin 01234567 assinassassininnsinssinsassinssassin H=1 A3A3 A2A2 A i sets A i-1

26 An example A = assassin 01234567 assinassassininnsassinsinssinssassin H=1 A0A0 A i sets A i-1 - not possible because i=0

27 An example A = assassin 01234567 assinassassininnsassinsinssinssassin H=1 A6A6 A5A5 A i sets A i-1

28 An example A = assassin 01234567 assinassassininnsassinsinssinssassin H=1 A6A6 A7A7 A i sets A i-1 – already the first in its bucket

29 An example A = assassin 01234567 assinassassininnsassinsinssinssassin H=1 A2A2 A1A1 A i sets A i-1

30 An example A = assassin 01234567 assinassassininnsassinsinssassinssin H=1 A5A5 A4A4 A i sets A i-1

31 An example A = assassin 01234567 assinassassininnsassinsinssassinssin H=1 A0A0 A1A1 A i sets A i-1

32 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=1 A3A3 A4A4 A i sets A i-1

33 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=1 lcp(sassin,sin)= 1+ lcp(assin,in)= 1+0=1 so “sin” opens a new 2-bucket Go over array to get new 2-buckets

34 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A0A0 A i sets A i-2 - not possible because i=0

35 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A3A3 A i sets A i-2 A1A1

36 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A6A6 A i sets A i-2 A4A4

37 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A7A7 A i sets A i-2 A5A5

38 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A0A0 A i sets A i-2 - but A i-2 is already the first in its bucket A2A2

39 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A3A3 A i sets A i-2 A5A5

40 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A i sets A i-2 - not possible because i=0 A1A1

41 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 A2A2 A i sets A i-2 A4A4

42 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=2 lcp(assassin,assin)= 2+ lcp( sassin,sin)= 2+1=3 so “assin” opens a new 4- bucket. Lcp(ssassin,ssin)= 2+ lcp(assin,in) = 2+0=2 so “ssin” opens a new 4-bucket. Go over array to get new 4-buckets back

43 An example A = assassin 01234567 assassinassininnsassinsinssassinssin H=4 We are done back

44 Complexity analysis First stage (bucket sort) was O(N). We had log(N) stages, each in O(N):  One traverse for the sorting  One traverse to determine new buckets. Total time complexity is O(N logN) Space complexity is:  We hold 3 integer arrays: POS PRM which is the inverse of POS: PRM[POS[i]] = I Another array to tell us who is the last moved suffix in every bucket  We hold 2 Boolean arrays to tell us where are the beginnings of each bucket of this stage and the last stage  All together – O(N). Still we have to show up we knew the lcp information.

45 Take a break

46 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP information building  Some observations about linear time Building in O(n) - 2002 Results

47 lcp information – general idea We used the lcp information to determine where to split buckets for next iteration. That’s why we are only interested in two suffixes A p, A q such that they are in the same H-bucket, but will not be in the same 2H-bucket. We also would like to do it concurrently while constructing the array.

48 Let’s see what we know of such A p and A q :  H ≤ lcp(A p, A q ) < 2H  lcp(A p, A q ) = H + lcp(A p+H, A q+H )  lcp(A p+H, A q+H ) < H   that is why A p+H and A q+H Are in different H-buckets. What we do is that along the algorithm, we will keep track of the lcp value between neighbors of adjacent buckets. What about suffixes that are not on adjacent buckets? lcp information – general idea Slide 41

49 lcp information – general idea Let’s notice something – if A POS[i] < A POS[j] then:  lcp( A POS[i], A POS[j] ) = {lcp(A POS[k],A POS[k+1] )}  That means that their lcp is the minimum of all the adjacent couples between them. assassinassininnsassinsinssassinssin A q+h A p+h 1 1 2 H=2 ApAp AqAq lcp(A p+H, A q+H ) = min{1,1,2} = 1

50 So, let’s conclude:  We don’t need to hold the lcp every pair, we can obtain it by knowing the minimum of all adjacent pairs between it.  We will hold an array Hgt[N-1] for that purpose.  We will use Interval Trees.  Interval trees are balanced trees that can hold this information for us. Their space complexity is O(N).  We will keep in the leaves the lcp of adjacent pairs, and internal nodes will hold the minimum of their children.  We will be able to obtain the information of any couple in log(N). lcp information – general idea

51 lcp information – data structures During the construction stage, we build an array Called Hgt[N-1]:  Hgt(i)=lcp(A POS[i-1], A POS[i] ), initialized so that Hgt[i]=N+1 for every i. In stage H=1: Hgt[i]=0 for A POS[i] that are first in their buckets. When moving from stage H to stage 2H we update every Hgt[i] that A POS[i] is the first in a newly created 2H bucket

52 lcp information – Hgt[] example H=1 assinassassininnsinssinsassinssassin 0009999 11 lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1+0=1 lcp(sin,sassin)=1+lcp(in,assin)=1+min{lcp(assin,assassin),lcp(assassin, in)}=1+0=1 Hgt[] = assinassassininnsinssinsassinssassin 00099 H=2 Hgt[] =

53 lcp information – Hgt[] example lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2 23 assinassassininnsin ssin sassinssassin H=4 00011 Hgt[] =

54 lcp information – the interval tree Hgt[] cells will become the leaves of T. (2,3)(3,4)(4,5)(5,6)(6,7)(1,2)(0,1) 0 9000 0 0 9 0 99 9 9 11 1 1 3 2

55 Complexity analysis This is an amortized analysis Each time a leaf opens a new bucket we change Hgt[i] for that leaf. That change requires O(logN) changes in the interval tree. There are O(N) leaves opening a new bucket. Total time complexity:  O(N logN) for all operations altogether.

56 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP information building  Some observations about linear time Building in O(n) - 2002 Results

57 Equally likely strings When we sorted POS, we said we are stopping when all suffixes are in there own buckets. Actually, we are doing r stages, when r is not N, but the longest repeated substring. If we assume all the strings are equally likely, the longest repeated substring is expected to be 2log |Σ| N. That means we can limit the number of stages to log| 2log |Σ| N | which is expected to be O(log(logN)). Total complexity of the sort will therefore be O(Nlog(logN)). Slide 44

58 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP information building  Some observations about linear time Building in O(n) - 2002 Results

59 Skew algorithm Step 1: SA ≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. (recursively) Step 2: SA = 0 = sort the suffixes starting at position i = 0 mod 3. Step 3: SA = merge SA = 0 and SA ≠ 0 012345678910 S=mississippi

60 Skew Algorithm step 1: sort the 3 first symbols of suffixes starting at position i ≠ 0 mod 3 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i 3321 554 11 12 m i s s i s s i p p i 0 1 2 3 4 5 6 7 8 9 10 Radix Sort Recursive call 3210654SA 12 = s 12 = That means sorting “3321”, “321”, “21”, “1”, “554”, “54”, “4”

61 Skew Algorithm step 2: sort the suffixes starting at position i = 0 mod 3 This step will be done in linear complexity of the size of the string we are dealing with (could be called recursively) The suffixes S i with i mod 3 = 0 are sorted by sorting the pairs (s[i], S i+1 ). We want to compare (s[j], S j+1 ) and (s[i], S i+1 ). i mod 3 = j mod 3 =0:  The relative order of S j+1 and S i+1 can be determined from their position in SA 12 in a constant time if we prepare the inverse array to SA inv 12. (same as we did in the O(N logN) algorithm).

62 To compare a suffix S j with j mod 3 = 0 with a suffix S i with i mod 3 ≠ 0, we distinguish two cases:  If i mod 3 = 1, we write S i as (s[i],S i+1 ) and S j as (s[j], S j+1 ). Since i + 1 mod 3 = 2 and j + 1 mod 3 = 1  If i mod 3 = 2, we compare the triples (s[i], s[i+1], S i+2 ) and (s[j], s[j+1], S j+2 ), by their lexicographic names in SA inv 12. Skew Algorithm step 3: Merging the results of SA 0 and SA 12 to get SA m i s s i s s i p p i 0 1 2 3 4 5 6 7 8 9 10S 3 = (‘s’,’issippi’) and S 6 = (‘s’,’ippi’) SA inv 12 (‘issippi’) > SA inv 12 (‘ippi’)  S 3 > S 6

63 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i 11 12 m i s s i s s i p p i 0 1 2 3 4 5 6 7 8 9 10 3210654SA 12 = 10741852SA 12 actually fit 0963And we get from step 2 Merging them will give us 10 7 4 1 0 9 8 6 3 5 2 Skew Algorithm step 3: Merging the results of S 0 and S 12 to get SA

64 Time complexity analysis Step1: O(n) + T(2n/3) Step2: O(n) Step3: O(n) Solving this column:  T(n) = O(n) + T(2n/3) = O(n)

65 Contents Introduction Searching a suffix array Building in O(n logn) - 1993  Sorting  LCP usage  Some observations about linear time Building in O(n) - 2002 Results

66 Some results from the first paper Empirical results for texts of length 100,000.

67 THE END FIBA Europe League – group B Team GP W/L F/APts 1.Strauss Iscar Nahariya44/0360/2978 2.JDA Dijon43/1347/3117 3.Besiktas Istanbul43/1339/3207 4.Ionikos Amaliada43/1344/3367 5.Azovmash Mariupol41/3329/3445 6.RBC Verviers-Pepinster41/3284/3075 7.BS|Energy Braunschweig41/3261/3175 8.Ural Great Perm40/4269/3014


Download ppt "Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and."

Similar presentations


Ads by Google