Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

Similar presentations


Presentation on theme: "Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }"— Presentation transcript:

1 Suffix trees

2 Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }

3 Trie (Cont) Assume no string is a prefix of another a b c e e f d b f e g 1) Each edge is labeled by a letter, 2) No two edges outgoing from the same node are labeled the same. 3) Each string corresponds to a leaf.

4 Compressed Trie Compress unary nodes, label edges by strings a b c e e f d b f e g a bbf c eef d e g 

5 Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

6 Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $ Note that a suffix tree has O(n) nodes n = |s| (why?)

7 Trivial algorithm to build a Suffix tree Put the largest suffix in Put the suffix bab$ in a b a b $ a b a b $ a b $ b

8 Put the suffix ab$ in a b a b $ a b $ b a b a b $ a b $ b $

9 Put the suffix b$ in a b a b $ a b $ b $ a b a b $ a b $ b $ $

10 Put the suffix $ in a b a b $ a b $ b $ $ a b a b $ a b $ b $ $ $

11 We will also label each leaf with the starting point of the corres. suffix. a b a b $ a b $ b $ $ $ 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $

12 Analysis Takes O(n 2 ) time to build. It is possible to construct it in O(n) time But, how come? does it take O(n) space ?

13 $ Linear space ? Consider the string aaaaaabbbbbb a c bbbbbb$ ab a a a b b b b b$ bbbbbb$ $ $ $ $ $ abbbbbb$

14 To use only O(n) space encode the edge-labels a c bbbbbb$ ab a a a b b b b b$ bbbbbb$ (7,13) $ $ $ $ $ Consider the string aaaaaabbbbbb $ abbbbbb$

15 To use only O(n) space encode the edge-labels (1,1) c (7,13) Consider the string aaaaaabbbbbb (6,13) (1,1) (13,13) (7,7) (13,13) (12,13) (13,13) (7,7)

16 What can we do with it ? Exact string matching: Given a text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide if it occurs in T. We may also want to find all occurrences of P in T

17 Exact string matching In preprocessing we just build a suffix tree in O(n) time 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Given a pattern P = ab we traverse the tree according to the pattern.

18 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(m+k) time

19 Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To associate each suffix with a unique string in S add a different special char to each s

20 Generalized suffix tree (Example) Let s 1 =abab and s 2 =aab here is a generalized suffix tree for s 1 and s 2 { $ # b$ b# ab$ ab# bab$ aab# abab$ } 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

21 So what can we do with it ? Matching a pattern against a database of strings

22 Longest common substring (of two strings) Every node with a leaf descendant from string s 1 and a leaf descendant from string s 2 represents a common substring. A maximal common substring corresponds to such node. 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 # Find such node with largest “string depth”

23 Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

24 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

25 Lowest common ancestors

26 Write an Euler tour of the tree 3 4 5 1 6 7 2 4171356533123 1 2 3 4 6 9 10 85 7 1112 0 LCA(1,5) = 3 Shallowest node

27 3 4 5 1 6 7 2 4171356533123 1 2 3 4 6 9 10 85 7 1112 0 2121012100110 minimum Node id Level

28 Range minimum 3 4 5 1 6 7 2 4171356533123 1 2 3 4 6 9 10 85 7 1112 0 2121012100110 minimum Preprocess an array, such that given i,j you can find the minimum in [i,j] fast Reduction takes linear time

29 Trivial algorithms for RMQ O(n) space, O(n) query time O(n 2 ) space, O(1) query time

30 Less trivial algorithms to RMQ Try to use O(nlog(n)) space to do a query in O(1) time.. Simpler: O(n + sqrt(n)) to do a query in sqrt(n)

31 Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

32 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

33 Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position n-i+2 of s r

34 Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and s r = abaabc# For every i, find the LCA of suffix i of s and suffix n-i+2 of s r

35 3 a abab a baaba$baaba$ b 3 $ 7 $ b 7 # c 1 6 abab c#c# 5 2 2 a$a$ c#c# a 5 6 $ 4 4 1 c#c# a$a$ $ abc#abc# c#c# Let s = cbaaba$ then s r = abaabc#

36 Analysis O(n) time to identify all palindromes

37 Suffix trees for compression

38 aabcbbabbsb (2,5) bbabbabcbbabbabbsb… S[1..n] = Compression aabcbbabbsbabcbbbbabbabcbbabbabbsb…

39 aabcbbabbsb (2,5) bbabbabcbbabbabbsb… S[1..n] = Let prior i =S[i..i+L i -1] = S[s i..s i +L i -1] be this prefix Compression Let L i be the length of the longest prefix of S[i..n] that is a substring of S[1..i-1] Let s i the previous position of this prefix aabcbbabbsbabcbbbbabbabcbbabbabbsb…

40 aabcbbabbsb (2,5) bbabbabcbbabbabbsb… S[1..n]= Compression prior 12 (s 12, L 12 ) i=12 aabcbbabbsbabcbbbbabbabcbbabbabbsb…

41 For (i=1 ; i<=n ;;) {Compute (s i,L i ) if L i >1 {output(s i,L i ); i=i+L i }; else {output(S[i]); i=i+1}} aabcbb (2,2) bs (6,4) ?( s i,L i ) איך לחשב את “ Ziv-Lempel ” compression aabcbbabbsbabbbbbabbabcbbabbabbsb…

42 Implementation using suffix tree Before compression: Build a suffix tree T for S. For each node v, compute c v : –the smallest leaf index in v ’ s subtree. –the starting position of the leftmost copy of the substring that labels the path from the root to v. O(n) time.

43 leaf i root S[i...n] v string-len(v) + c v ≥ i computing (s i,L i ): i cvcv string-len(v) Let v be the “first” node such that: set s i ← c v set L i ← i-c v

44 v2v2 v1v1 Example S = abababab 1 2 3 4 5 6 7 8 1 1 1 12 2 2 a a a a a a a b b b b b b b b $ $ $ $ $ $ $ $ 18765432 i=1 L i =0  a i=2 L i =0  b i=3 L i =2 s i =1  (1,2) i=5 L i =4 s i =1  (1,4)

45 Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2031

46 How do we build it ? Build a suffix tree Traverse the tree in in-order, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time Can also build it directly

47 Example Let S = mississippi i ippi issippi ississippi mississippi pi 7 4 1 0 9 8 6 3 10 5 2 ppi sippi sisippi ssippi ssissippi L R Let P = issa M

48 How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time Can also do it in O(m+log(n)) with an additional array

49 Computing the suffix array Can do it in linear time without constructing the suffix tree ?

50 Divide into triples $ yab ba da b bad o abb ada bba do$

51 Divide into triples $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$

52 Radix sort triplets $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$ abb ada bba do$ bad dab o$$

53 Change the alphabet $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$ abb ada bba do$ bad dab o$$ 1 2 3 4 5 6 7 1 2 4 6 4 5 3 7

54 Sort the suffixes $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$ 1 2 4 6 4 5 3 7 1 2 4 6 2 4 6 4 6 6 4 5 3 7 5 3 7 3 7 7

55 Recursively (problem size decreases to 2/3 the original) abb ada bba do$ bba dab bad o$$ 124645 37 01234 56 7 0 - 12364537 1 - 2464537 2 - 464537 3 - 64537 4 - 4537 5 - 537 6 - 37 7 - 7 0 - 12364537 1 – 2464537 6 – 37 4 - 4537 2 - 464537 5 – 537 3 - 64537 7 - 7

56 Go back to original problem abb ada bba do$ bba dab bad o$$ 124645 37 0 - 12364537 1 - 2464537 2 - 464537 3 - 64537 4 - 4537 5 - 537 6 - 37 7 - 7 0 - 12364537 1 – 2464537 6 – 37 4 - 4537 2 - 464537 5 – 537 3 - 64537 7 - 7 $ yab ba da b bad o 1234 56 789101112 0 0/11/42/73/104/2 5/56/8 7/11

57 Go back to original problem abb ada bba do$ bba dab bad o$$ 124645 37 0 - 12364537 1 - 2464537 2 - 464537 3 - 64537 4 - 4537 5 - 537 6 - 37 7 - 7 0 - 12364537 1 – 2464537 6 – 37 4 - 4537 2 - 464537 5 – 537 3 - 64537 7 - 7 $ yab ba da b bad o 1234 56 789101112 0 0/11/42/73/104/2 5/56/8 7/11 142653 78

58 Sort the remaining third $ yab ba da b bad o 142653 78 (b, 2)(a, 5) (a, 7) (y, 1) (b, 2) (a, 5) (a, 7) (y, 1) 3 6 9 0 1234 56 789101112 0 

59 Merge $ yab ba da b bad o 142653 78 (b, 2)(a, 5) (a, 7) (y, 1) (b, 2) (a, 5) (a, 7) (y, 1) 3 6 9 0 1234 56 789101112 0 148275 1011 

60 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 6 9 0 148275 1011

61 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 6 9 0 48275 1011 6

62 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 39 0 48275 1011 64

63 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 39 0 8275 1011 649

64 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 0 8275 1011 6493

65 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 8275 1011 64938

66 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 275 1011 649382

67 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 75 1011 6493827

68 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 5 1011 64938275

69 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 1011 64938275

70 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 649382751011 0

71 summary $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 6493827510110 When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes


Download ppt "Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }"

Similar presentations


Ads by Google