Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.

Similar presentations


Presentation on theme: "1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann."— Presentation transcript:

1 1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann

2 2 Introduction - Problem definition “Is W a substring of A?”  |A|=N and |W|=P  A = a 0 a 1 …a N-1  A i = suffix beginning at index i = a i a i+1 …a N-1 A= abccbbadgfbbcahgjf W= badgfbb A= abccbbadgfbbcahgjf

3 3 Introduction – what is a suffix array? Example: Pos Pos[2] = 6 (A 6 = in) A = assassin 01234567

4 4 Introduction – what is a suffix array? A lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i  A i is the kth smallest suffix in the set {A 0, A 1, A 2…… A N-1 }

5 5 Introduction – what is a suffix tree? Example: A trie that contains all suffixes of A: s a 4 3 s s s s a i n 0 i n 6 i n A = assassin 01234567 s i n a s s i n 2 i n 5 1 a s s i n

6 6 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

7 7 The Search algorithm - Definitions For any string u, u p = u 1 u 2 u 3……. u p (or u if |u| p) Let “ “ denote a Lexicographical order, We say u v  u p v p Note that for any choice of p: Note that W is a substring of A  there is an i such that W

8 8 The Search algorithm – how does the array help us know if W is a substring of A? We define a search interval: L W = min {k | W A Pos[k] or k = N} R W = max {k | W A Pos[k] or k = -1} W matches a i a i+1...a i+P-1  i=Pos[k] for some k [L W, R W ]

9 9 Example: Pos A = assassin 01234567 Option 1 Option 2 Option 3

10 10 Why finding L W, R W == Finding the matches: If L W > R W => W is not a substring of A. Else: there are (R W -L W +1) matches - A Pos[L W ],…, A Pos[R W ] W>A Pos[k] W<A Pos[k] LWLW RWRW Pos

11 11 The Search algorithm – The easy way - O(PlogN) L MR abcde...abcdf...abd... Pos Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with A Pos[M], where M=(L+R)/2. In the end L W R W=“abcx”

12 12 The Search algorithm using lcp values in O(P+logN) – Definitions: Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: – l = lcp(A Pos[L], W) – r=lcp(W, A Pos[R] ) – Llcp[M] = lcp(A Pos[L] A Pos[M] ) – Rlcp[M] = lcp(A Pos[M], A Pos[R] )

13 13 The Search algorithm using lcp values in O(P+logN) Example: A=“abcx” l = 3 Llcp[M]=4 Rlcp[M]=2 L MR abcde...abcdf...abd... Pos r = 2 Note that Llcp[M] is well defined because every midpoint M has one L M and one R M

14 14 So how do we use l,r,Llcp[M] ? Example: W=abcx abcde...abc... abcdf…abd… l=3r=2 Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>A Pos[L]  W>A Pos[M]  Go right  l is unchanged = 3 LM R Llcp[M]=4

15 15 Example: W=abcx (cont.) Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) A Pos[L] <A Pos[M]  W<A Pos[M]  Go left  r = Llcp[M] = 2 abcde...abdf…abd… r=2l=3 L M R Llcp[M]=2

16 16 Example: W=abcx (cont.) abcde...abc... abcp…abd… l=3 r=2 Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare W l and A Pos[M] l until W l+j A Pos[M] l+j  Go right or left according to W l+j, A Pos[M] l+j  new l or r = (l+j)  Number of comparisons = j+1 LMR Llcp[M]=3

17 17 The Search algorithm using lcp values- complexity In each iteration there are maximum j+1 comparisons, when in total  Total comparisons (P + #Iterations)  O(P+logN) running time Requires only 3N-sized arrays

18 18 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

19 19 Construction of suffix array in O(NlogN) Sorting the suffixes in a unique Radix sort – We Will have O(logN) stages (numbered 1,2,4,8,16…) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H– thus, in stage H the suffixes are sorted by )

20 20 Construction of suffix array – The general idea If A i, A j H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of A i+H and A j+H which are already sorted in phase H. abef…abcd…ab…bb...bb…cd… ef… H=2 : AiAi AjAj A j+H A i+H first bucketfourth bucketthird bucketsecond bucket

21 21 Construction of suffix array – The general idea (cont.) Let A i be in first H-bucket after stage H A i starts with smallest H-symbol string A i-H should be first in its H-bucket abef…abcd…ab…bb...bb…cdef…cdab…ef… AiAi A i-H H=2 :

22 22 Construction of suffix array – The algorithm Go over the suffix array: For each A i : Move A i-H to next available place in its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later)

23 23 Construction of suffix array – The algorithm Example: A = assassin 01234567 assinassassininnsinssinsassinssassin H=1 A3A3 A2A2 A i sets A i-1

24 24 Construction of suffix array – The algorithm Example: assinassassininnsassinssinsinssassin H=1 A0A0 A = assassin 01234567 A i sets A i-1

25 25 Construction of suffix array – The algorithm Example: assinassassininnsassinssinsinssassin H=1 A6A6 A = assassin 01234567 A5A5 A i sets A i-1

26 26 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssinssassin H=1 A7A7 A = assassin 01234567 A6A6 A i sets A i-1

27 27 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssinssassin H=1 A2A2 A1A1 A = assassin 01234567 A i sets A i-1

28 28 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssassinssin H=1 A4A4 A = assassin 01234567 A5A5 A i sets A i-1

29 29 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssassinssin H=1 A = assassin 01234567 A1A1 A0A0 A i sets A i-1

30 30 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=1 A = assassin 01234567 A4A4 A3A3 A i sets A i-1

31 31 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=1 A = assassin 01234567 Go over array to get new 2-buckets lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket back A i sets A i-1

32 32 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A0A0 A i sets A i-2

33 33 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A3A3 A1A1 A i sets A i-2

34 34 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A6A6 A4A4 A i sets A i-2

35 35 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A7A7 A5A5 A i sets A i-2

36 36 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A2A2 A0A0 A i sets A i-2

37 37 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A5A5 A3A3 A i sets A i-2

38 38 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A1A1 A i sets A i-2

39 39 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 A4A4 A2A2 A i sets A i-2

40 40 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin 01234567 Go over array to get new 4-buckets A i sets A i-2

41 41 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=4 A = assassin 01234567 That’s it, we are sorted!

42 42 Construction of suffix array – Complexity Summary Sorting by first char – O(N) O(logN) stages of O(N) operations = O(NlogN) Total - time: O(NlogN) - space: 2 integer arrays of size N back

43 43 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

44 44 How to find Longest Common Prefixes – the general idea We don’t care what is the lcp between suffixes in the same H-bucket. For A p, A q in the same H-bucket but different 2H-buckets: – H lcp(A p, A q ) < 2H – lcp(A p, A q ) = H + lcp(A p+H, A q+H ) – lcp(A p+H, A q+H ) < H  that is why A p+H, A q+H Are in different H-buckets, but which ones?

45 45 How to find Longest Common Prefixes – the general idea If A p+H and A q+H were in adjacent H-buckets then lcp is known. how?how? If not, Then: lcp(A Pos[i], A Pos[j] ) = {lcp(A Pos[k],A Pos[k+1] )}

46 46 How to find Longest Common Prefixes – the general idea lcp(A p+H, A q+H ) = min{1,1,2} = 1 assassinassininnsassinsinssassinssin A q+h A p+h 1 1 2 Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(A p+H, A q+H ) < H H=2

47 47 How to find lcp – algorithm and data structures – Hgt[] During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(A Pos[i-1], A Pos[i] ), initialized so that Hgt[i]=N+1 for every i. In stage H=1: Hgt(i)=0 for A Pos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that A Pos[i] is the first in a newly created 2H bucket

48 48 How to find lcp – Hgt[] example: H=1 assinassassininnsinssinsassinssassin 000999911 assinassassininnsinssinsassinssassin 00099 H=2 lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1

49 49 How to find lcp – Hgt[] example (cont.) 23 assinassassininnsin ssin sassinssassin H=4 00011 lcp( assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

50 50 How to find lcp – data structures We need a data structure that will contain lcp(A Pos[j], A Pos[i] ) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree.

51 51 How to find lcp – example of Interval tree (2,3)(3,4)(4,5)(5,6)(6,7)(1,2)(0,1) 0 9000 0 0 9 0 99 9 9 11 1 1 3 2

52 52 How to find lcp – Complexity Each time a leaf opens a new bucket we change Hgt[i] for that leaf. That change requires O(logN) changes in the interval tree There are O(N) leaves opening new bucket In total we get O(NlogN) to get all lcp values

53 53 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

54 54 Time Expected-case Improvement of the construction of pos[] Assumptions: - All N-symbol strings are equally likely. – Under this assumption: Expected length of longest repeated substring = O(log | | N) This immediately implies that construction of pos[] is reduced to O(NLogLogN). why?why? Next is a way to reduce it to O(N).

55 55 Time Expected-case Improvement of the construction of pos[] Let T = We encode each possible T length string to an integer with the isomorphism Int T (u) Map each A P to Int T (A P ) [0,| | T -1] : – Int T (A P ) = a p | | T-1 +

56 56 Example of the mapping Int T (A P ) = a p | | T-1 + 2*4^0 + 02 | |= 4, a=0, i=1, n=2, s=3 N=8 T= =1 1*4^0 + 01 3*4^0 + 03 3 0*4^0 + 00 3*4^0 + 03 3 0*4^0 + 00

57 57 Time Expected-case Improvement of the construction of pos[] By the definition of Int T (A P ) it takes O(N) to compute all Int T (A P ) values of all suffixes. So now instead of starting with H=1 we start with H= But since the longest repeated substring length is O(log | | N) we will have O(1) stages of the radix sort. Thus, the total time for constructing pos[] = O(N)

58 58 So is a suffix array better then a suffix tree? Suffix arraySuffix tree Construction time O(NlogN) - for small | | O(N) – needs additional space O(N) Time Complexity O(P+logN) – good for large alphabets O(Plog| |) Space Complexity requires 2N integers – this is the main advantage. O(N) dependent on | | ? NoYes

59 59


Download ppt "1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann."

Similar presentations


Ads by Google