Download presentation
Presentation is loading. Please wait.
Published byMoriah Batchelor Modified over 9 years ago
1
Suffix Trees
2
2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST
3
3 Introduction
4
4 Substrings String is any sequence of characters. Substring of string S is a string composed of characters i through j, i j of S. S = cater;ate is a substring. car is not a substring. Empty string is a substring of S.
5
5 Subsequences Subsequence of string S is a string composed of characters i 1 < i 2 < … < i k of S. S = cater; ate is a subsequence. car is a subsequence. Empty string is a subsequence of S.
6
6 String/Pattern Matching - I You are given a source string S. Suppose we have to answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching. O(|S| + | p i |) time per query. O(n|S| + i | p i |) time for n queries. Suffix tree solution. O(|S| + i | p i |) time for n queries.
7
7 String/Pattern Matching - II KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string (text) S. The suffix tree for the text is built in O(m) time during a pre-processing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches it in O(n) time using that suffix tree.
8
String Matching: Prefixes & Suffixes Substrings of S beginning at the first position of S are called prefixes of S, and substrings that end at its last position are called suffixes of S. S=AACTAG Prefixes: AACTAG,AACTA,AACT,AAC,AA,A Suffixes: AACTAG,ACTAG,CTAG,TAG,AG,G p i is a substring of S iff p i is a prefix of some suffix of S.
9
9 Suffix Trees
10
10 Definition: Suffix Tree (ST) T for S of length m 1. A rooted tree with m leaves numbered from 1 to m. 2. Each internal node, excluding the root, of T has at least 2 children. 3. Each edge of T is labeled with a nonempty substring of S. 4. No two edges out of a node can have edge-labels starting with the same character. 5. For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S, namely S[ i, m ], that starts at position i.
11
11 Example: Suffix Tree for S=xabxac
12
12 Existence of a suffix tree S If one suffix S j of S matches a prefix of another suffix S i of S, then the path for S j would not end at a leaf. S = xabxa S 1 = xabxa and S 4 = xa How to avoid this problem? Assume that the last character of S appears nowhere else in S. Add a new character $ not in the alphabet to the end of S.
13
13 Example: Suffix Tree for S=xabxac$
14
14 Building STs in linear time: Ukkonen’s algorithm
15
15 Building STs in linear time Weiner’s algorithm [FOCS, 1973] ”The algorithm of 1973” called by Knuth First algorithm of linear time, but much space McGreight’s algorithm [JACM, 1976] Linear time and quadratic space More readable Ukkonen’s algorithm [Algorithmica, 1995] Linear time algorithm and less space This is what we will focus on
16
16 Implicit Suffix Trees Ukkonen’s algorithm constructs a sequence of implicit STs, the last of which is converted to a true ST of the given string. An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by removing $ from all edge labels removing any edges that now have no label removing any node that does not still have at least two children An implicit suffix tree for prefix S[1, i ] of S is similarly defined based on the suffix tree for S[1, i ]$. I i will denote the implicit suffix tree for S[1, i ]. Each suffix is in the tree, but may not end at a leaf.
17
17 Example: Construction of the Implicit ST Implicit tree for xabxa from tree for xabxa$ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$ 2 3 6 5 4
18
18 Construction of the Implicit ST: Remove $ Remove $ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$ 2 3 6 5 4
19
19 Construction of the Implicit ST: After the Removal of $ Remove $ {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3 6 5 4
20
20 Construction of the Implicit ST: Remove unlabeled edges Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3 6 5 4
21
21 Construction of the Implicit ST: After the Removal of Unlabeled Edges Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3
22
22 Construction of the Implicit ST: Remove interior nodes Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3
23
23 Construction of the Implicit ST: Final implicit tree Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x xx a a a a a 2 3
24
24 Ukkonen’s Algorithm (UA) I i is the implicit suffix tree of the string S[1, i ] Construct I 1 /* Construct I i+1 from I i */ for i = 1 to m-1 do /* phase i+1 */ for j = 1 to i+1 do /* extension j */ Find the end of the path P from the root whose label is S[ j, i ] in I i and extend P with S[ i +1] by suffix extension rules; Convert I m into a suffix tree S
25
25 Example S = xabxacd$ i+1 = 1 xx i+1 = 2 extend x to xa aa i+1 = 3 extend xa to xab extend a to ab bb …
26
26 Extension Rules Goal: extend each S[ j,i ] into S[ j,i+1 ] Rule 1: S[ j,i ] ends at a leaf Add character S( i+1 ) to the end of the label on that leaf edge Rule 2: S[ j,i ] doesn’t end at a leaf, and the following character is not S( i+1 ) Split a new leaf edge for character S( i+1 ) May need to create an internal node if S[ j,i ] ends in the middle of an edge Rule 3: S[ j,i+1 ] is already in the tree No update
27
27 Example: Extension Rules Implicit tree for axabxb from tree for axabx 1 b b b x x x x a a a 2 3x b x 4 b b b b Rule 1: at a leaf node b Rule 3: already in treeRule 2: add a leaf edge (and an interior node) b 5
28
28 UA for axabxc (1) S[1,3]=axa ES(j,i)S(i+1) 1axa 2 xa 3 a
29
29 UA for axabxc (2)
30
30 UA for axabxc (3)
31
31 UA for axabxc (4)
32
32 Observations Once S[ j,i ] is located in the tree, extension rules take only constant time Naively we could find the end of any suffix S[ j,i ] in O(S[j,i]) time by walking from the root of the current tree. By that approach, I m could be created in O(m 3 ) time. Making Ukkonen’s algorithm O(m) Suffix links Skip and count trick Edge-label compression A stopper Once a leaf, always a leaf
33
33 Suffix Links Consider the two strings a and xa Suppose some internal node v of the tree is labeled with xa and another node s(v) in the tree is labeled with a The edge (v,s(v)) is called a suffix link Do all internal nodes (the root is not considered an internal node) have suffix links?
34
34 Example: suffix links S = ACACACAC$ AC AC$ AC $ $ $ $ $ $ $ C AC$
35
35 Suffix Link Lemma If a new internal node v with path-label xa is added to the current tree in extension j of some phase i+1, then the path labeled a already ends at an internal node of the tree or the internal node labeled a will be created in the extension of j+1 in the same phase i+1 string a is empty and s(v) is the root
36
36 Proof of Suffix Link Lemma A new internal node is created only by the extension rule 2 This means that there are two distinct suffixes of S[ 1,i+1 ] that start with xa xa S( i+1 ) and xacb where c is not S( i+1 ) This means that there are two distinct suffixes of S[ 1,i+1 ] that start with a a S( i+1 ) and acb where c is not S( i+1 ) Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a
37
37 Corollary of Suffix Link Lemma Every internal node of an implicit suffix tree has a suffix link from it.
38
38 How to use suffix links - 1 S[1,i] must end at a leaf since it is the longest string in the implicit tree I i Keep a pointer to this leaf in all cases and extend according to rule 1 Locating S[j+1,i] from S[j,i] which is at node w If w is an internal node, set v to w Otherwise, set v = parent(w) If v is the root, you must traverse from the root to find S[j+1,i] If not, go to s(v) and begin search for the remaining portion of S[j,i] from there
39
39 How to use suffix links - 2
40
40 Skip and Count Trick – (1) Problem: Moving down from s(v), directly implemented, takes time proportional to the number of characters compared Solution: To make running time proportional to the number of nodes in the path searched, instead of the number of characters
41
41 Skip and Count Trick – (2) After 4 nodes down-skips, the end of S[j, i] is found.
42
42 Skip and Count Trick – (3) Node-depth of v, denoted (ND(v)), is the number of nodes on the path from the root to the node v Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, ND(v) ND(s(v))+1
43
43 Skip and Count Trick – (4) At the moment of traversing (v,s(v)): ND(v) ND(s(v))+1
44
44 Skip and Count Trick – (5) The current node-depth of the algorithm is the node depth of the node most recently visited by the algorithm Lemma: Using the skip and count trick, any phase of Ukkonen’s algorithm takes O(m) time. Up-walk: decreases the current node-depth by 1 Suffix link traversal: same as up-walk Totally, the current node-depth is decreased by 2m. No node has depth >m. The total possible increment to the current node-depth is 3m.
45
45 Edge Label Representation Potential Problem Size of edge labels may require (m 2 ) space Thus, the time for the algorithm is at least as large as the size of its output Example S = abcdefghijklmnopqrstuvwxyz Total length is j<m+1 j = m(m+1)/2 Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size Solution Label edges with pair of indices indicating beginning and end of positions of the substring in S
46
46 Modified Extension Rules Rule 2: new leaf edge (phase i+1 ) create edge ( i+1, i+1 ) split edge (p, q) => (p, w) and (w + 1, q) Rule 1: leaf edge extension label had to be ( p,i ) before extension given rule 2 above and an induction argument: (p, q) => (p, q + 1) Rule 3 Do nothing
47
47 Full edge label representation String S = xabxa$ 1 b b b x x x x a a a a a $ $ $ $ $$ 2 3 6 5 4
48
48 Edge-label Compression String S = xabxa$ 1 (1,2) [or (4,5)?] (3,6) (2,2) (6,6) 2 3 6 5 4 (3,6) (6,6) (3,6)
49
49 A Stopper In any phase, if suffix extension rule 3 applies in extension j, it will also apply in all extensions k, where k>j, until the end of the phase. The extensions in phase i+1 that are done after the first execution of rule 3 are said to be done implicitly. This is in contrast to any extension j where the end of S[j, i] is explicitly found. An extension of that kind is called and explicit extension. Hence, we can end any phase i+1 when the first extension rule 3 applies.
50
50 Once a leaf, always a leaf – (1) If at some point in UA a leaf is created and labeled j (for the suffix starting at position j of S), then that leaf will remain a leaf in all successive trees created during the algorithm. In any phase i, there is an initial sequence of consecutive extensions (starting with extension 1) in which only rule 1 or 2 applies, where let j i be the last extension in this sequence. Note that j i j i+1.
51
51 Once a leaf, always a leaf – (2) Implicit extensions: for extensions 1 to j i, write [·, e] on the leaf edge, where e is a symbol denoting the current end and is set to i + 1 once at the beginning. In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location. Explicit extensions: from extension j i+1 till first rule 3 extension is found (or until extension i+1 is done)
52
52 Once a leaf, always a leaf – (3): Single phase algorithm j* is the last explicit extension computed in phase i+1 Phase i+1 Increment e to i+1 (implicitly extending all existing leaves) Explicitly compute successive extensions starting at j i+1 and continuing until reaching the first extension j* where rule 3 applies or no more extensions needed Set j i+1 to j* -1, to prepare to the next phase Observation Phase i and i+1 share at most 1 explicit extension
53
53 Example: S=axaxbb$ - (1) e = 1, a J 1 = 1 1 1 (1,e) I1I1 e = 2, ax S[1,2] : skip S[2,2] : rule 2, create(2, e) j 2 = 2 1 1 (1,e) I2I2 2 (2,e) e = 3, axa S[1,3].. S[2,3] : skip S[3,3] : rule 3 j 3 = 2 1 1 (1,e) I3I3 2 (2,e)
54
54 Example: S=axaxbb$ - (2) e = 4, axax S[1,4].. S[2,4] : skip S[3,4] : rule 3 S[4,4] : auto skip j 4 = 2 1 1 (1,e) I4I4 2 (2,e) e = 5, axaxb S[1,5].. S[2,5] : skip S[3,5] : rule 2, split (1,e) → (1, 2) and (3,e), create (5,e) S[4,5] : rule 2, split (2,e) → (2,2) and (3,e), create (5,e) S[5,5] : rule 2, create (5,e) j 5 = 5 1 (1,2) I5I5 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) 5
55
55 Example: S=axaxbb$ - (3) e = 6, axaxbb S[1,6].. S[5,6] : skip S[6,6] : rule 3 j 6 = 5 e = 7, axaxbb$ S[1,7].. S[5,7] : skip S[6,7] : rule 2, split (5,e) → (5,5) and (6,e), create (6,e) S[7,7] : rule 2, create (7,e) j 7 = 7 1 (1,2) I6I6 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) 5 1 (1,2) I7I7 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) (5,5) 5 6 (6,e) (7,e) 7
56
56 Complexity of UA Since all the implicit extensions in any phase is constant, their total cost is O(m). Totally, only 2m explicit extensions are executed. The max number of down-walking skips is O(m). Time-complexity of Ukkonen’s algorithm: O(m)
57
57 Finishing up Convert final implicit suffix tree to a true suffix tree Add $ using just another phase of execution Now all suffixes will be leaves Replace e in every leaf edge with m Just requires a traversal of tree which is O(m) time
58
58 Implementation Issues – (1) When the size of the alphabet grows: For large trees suffix links allow an algorithm to move quickly from one part of the tree to a distant part of the tree. This is great for worst-case time bounds, but its horrible if the tree isn't entirely in memory. Thus, implementing ST to reduce practical space use can be a serious concern. The main design issues are how to represent and search the branches out of the nodes of the tree. A practical design must balance between constraints of space and need for speed
59
59 Implementation Issues – (2) There are four basic choices to represent branches An array of size (| |) at each non-leaf node v A linked list at node v of characters that appear at the beginning of the edge-labels out of v. If its kept in sorted order it reduces the average time to search for a given character In the worst case it, adds time | | to every node operation. If the number of children of v is large, then little space is saved over the array while noticeably degrading performance A balanced tree implements the list at node v Additions and searches take O(logk) time and O(k) space, where k is the number of children of v. This alternative makes sense only when k is fairly large. A hashing scheme. The challenge is to find a scheme balancing space with speed. For large trees and alphabets hashing is very attractive at least for some of the nodes
60
60 Implementation Issues – (3) When m and are large enough, the best design is probably a mixture of the above choices. Nodes near the root of the tree tend to have the most children, so arrays are sensible choice at those nodes. For nodes in the middle of a suffix tree, hashing or balanced trees may be the best choice. Sometimes the alphabet size is explicitly presented in the time and space bounds Construction time is O( m log| |),using ( m | |) space. m – is the size of the string.
61
61 Applications of Suffix Trees
62
62 Applications of Suffix Trees Exact string matching Substring problem for a database Longest common substring Suffix arrays
63
63 Exact String Matching – (1) Exact matching problem: given a pattern P of length n and a text T of length m, find all occurrences of P in T in O(n+m) time. Overview of the ST approach: Built a ST for text T in O(m) time Match the characters of P along the unique path in ST until either P is exhausted or no more matches are possible In the latter case, P doesn’t appear anywhere in T In the former case, every leaf in the subtree below the point of the last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf ST approach spends O(m) preprocessing time and then O(n+k) search time, where k is the number of occurrences of P in T
64
64 Exact String Matching – (2) When search terminates at an element node, P appears exactly once in the source string T. When the search for P terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of P. 6 x 1 4 2 5 3 a b x a c c a b x a c c c b x a c Notes: P = xa; T = xabxac Pattern P matches a path down to the point shown by an arrow, and as required, the leaves below that point are numbered 1 and 4.
65
65 Substring Problem for a Database Input: a database D (a set of strings) and a string S Output: find all the strings in D containing S as a substring Usage: identity of the person Exact string matching methods: cannot work Suffix tree: D is stored in O(m) space, where m is the total length of all the strings in D. Suffix tree is built in O(m) time Each lookup of S (the length of S is n) costs O(n) time
66
66 Longest common substring –(1) Input: two strings S 1 and S 2 Output: find the longest substring S common to S 1 and S 2 Example: S 1 =common-substring S 2 =common-subsequence Then, S=common-subs
67
67 Longest common substring – (2) Build a suffix tree for S 1 and S 2 Each leaf represents either a suffix from one of S 1 and S 2, or a suffix from both S 1 and S 2 Mark each internal node v with a 1 (2) if there is a leaf in the subtree of v representing a suffix from S 1 (S 2 ) The path-label of any internal node marked both 1 and 2 is a substring common to both S 1 and S 2, and the longest such string is the longest common substring.
68
68 Longest common substring – (3) S 1 $ = xabxac$, S 2 $ = abx$, S = abx
69
69 Suffix Arrays – (1) A suffix array of an m-character string S, is an array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of S. Example: S = xabxac
70
70 Suffix Arrays – (2) A suffix array of S can be obtained from the suffix tree T of S by performing a lexical depth-first search of T
71
71 Suffix Arrays: Exact string matching Given two strings T and P, where |T | = m and |P| = n, find all the occurrences of P in T ? Using the binary search on the suffix array of T, all the occurrences of P in T can be found in O(n logm) time. Example: let T = xabxac and P = ac
72
72 References Dan Gusfield: Algorithms on Strings, Trees, and Sequences. University of California, Davis.Cambridge University Press,1997 Ela Hunt et al.: A database index to large biological sequences. Slides at VLDB, 2001 R.C.T. Lee and Chin Lung Lu.: Suffix Trees. Slides at CS 5313 Algorithms for Molecular Biology
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.