# Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.

## Presentation on theme: "Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST."— Presentation transcript:

Suffix Trees

2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

3 Introduction

4 Substrings String is any sequence of characters. Substring of string S is a string composed of characters i through j, i  j of S.  S = cater;ate is a substring.  car is not a substring.  Empty string is a substring of S.

5 Subsequences Subsequence of string S is a string composed of characters i 1 < i 2 < … < i k of S.  S = cater; ate is a subsequence.  car is a subsequence.  Empty string is a subsequence of S.

6 String/Pattern Matching - I You are given a source string S. Suppose we have to answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching.  O(|S| + | p i |) time per query.  O(n|S| +  i | p i |) time for n queries. Suffix tree solution.  O(|S| +  i | p i |) time for n queries.

7 String/Pattern Matching - II KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string (text) S. The suffix tree for the text is built in O(m) time during a pre-processing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches it in O(n) time using that suffix tree.

String Matching: Prefixes & Suffixes Substrings of S beginning at the first position of S are called prefixes of S, and substrings that end at its last position are called suffixes of S. S=AACTAG  Prefixes: AACTAG,AACTA,AACT,AAC,AA,A  Suffixes: AACTAG,ACTAG,CTAG,TAG,AG,G p i is a substring of S iff p i is a prefix of some suffix of S.

9 Suffix Trees

10 Definition: Suffix Tree (ST) T for S of length m 1. A rooted tree with m leaves numbered from 1 to m. 2. Each internal node, excluding the root, of T has at least 2 children. 3. Each edge of T is labeled with a nonempty substring of S. 4. No two edges out of a node can have edge-labels starting with the same character. 5. For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S, namely S[ i, m ], that starts at position i.

11 Example: Suffix Tree for S=xabxac

12 Existence of a suffix tree S If one suffix S j of S matches a prefix of another suffix S i of S, then the path for S j would not end at a leaf. S = xabxa S 1 = xabxa and S 4 = xa How to avoid this problem?  Assume that the last character of S appears nowhere else in S.  Add a new character \$ not in the alphabet to the end of S.

13 Example: Suffix Tree for S=xabxac\$

14 Building STs in linear time: Ukkonen’s algorithm

15 Building STs in linear time Weiner’s algorithm [FOCS, 1973]  ”The algorithm of 1973” called by Knuth  First algorithm of linear time, but much space McGreight’s algorithm [JACM, 1976]  Linear time and quadratic space  More readable Ukkonen’s algorithm [Algorithmica, 1995]  Linear time algorithm and less space  This is what we will focus on

16 Implicit Suffix Trees Ukkonen’s algorithm constructs a sequence of implicit STs, the last of which is converted to a true ST of the given string. An implicit suffix tree for string S is a tree obtained from the suffix tree for S\$ by  removing \$ from all edge labels  removing any edges that now have no label  removing any node that does not still have at least two children An implicit suffix tree for prefix S[1, i ] of S is similarly defined based on the suffix tree for S[1, i ]\$. I i will denote the implicit suffix tree for S[1, i ]. Each suffix is in the tree, but may not end at a leaf.

17 Example: Construction of the Implicit ST Implicit tree for xabxa from tree for xabxa\$ {xabxa\$, abxa\$, bxa\$, xa\$, a\$, \$} 1 b b b x x x x a a a a a \$ \$ \$ \$ \$\$ 2 3 6 5 4

18 Construction of the Implicit ST: Remove \$ Remove \$ {xabxa\$, abxa\$, bxa\$, xa\$, a\$, \$} 1 b b b x x x x a a a a a \$ \$ \$ \$ \$\$ 2 3 6 5 4

19 Construction of the Implicit ST: After the Removal of \$ Remove \$ {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3 6 5 4

20 Construction of the Implicit ST: Remove unlabeled edges Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3 6 5 4

21 Construction of the Implicit ST: After the Removal of Unlabeled Edges Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

22 Construction of the Implicit ST: Remove interior nodes Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

23 Construction of the Implicit ST: Final implicit tree Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x xx a a a a a 2 3

24 Ukkonen’s Algorithm (UA) I i is the implicit suffix tree of the string S[1, i ] Construct I 1 /* Construct I i+1 from I i */ for i = 1 to m-1 do /* phase i+1 */  for j = 1 to i+1 do /* extension j */ Find the end of the path P from the root whose label is S[ j, i ] in I i and extend P with S[ i +1] by suffix extension rules; Convert I m into a suffix tree S

25 Example S = xabxacd\$ i+1 = 1 xx i+1 = 2  extend x to xa aa i+1 = 3  extend xa to xab  extend a to ab bb …

26 Extension Rules Goal: extend each S[ j,i ] into S[ j,i+1 ] Rule 1: S[ j,i ] ends at a leaf  Add character S( i+1 ) to the end of the label on that leaf edge Rule 2: S[ j,i ] doesn’t end at a leaf, and the following character is not S( i+1 )  Split a new leaf edge for character S( i+1 )  May need to create an internal node if S[ j,i ] ends in the middle of an edge Rule 3: S[ j,i+1 ] is already in the tree  No update

27 Example: Extension Rules Implicit tree for axabxb from tree for axabx 1 b b b x x x x a a a 2 3x b x 4 b b b b Rule 1: at a leaf node b Rule 3: already in treeRule 2: add a leaf edge (and an interior node) b 5

28 UA for axabxc (1) S[1,3]=axa ES(j,i)S(i+1) 1axa 2 xa 3 a

29 UA for axabxc (2)

30 UA for axabxc (3)

31 UA for axabxc (4)

32 Observations Once S[ j,i ] is located in the tree, extension rules take only constant time Naively we could find the end of any suffix S[ j,i ] in O(S[j,i]) time by walking from the root of the current tree. By that approach, I m could be created in O(m 3 ) time. Making Ukkonen’s algorithm O(m)  Suffix links  Skip and count trick  Edge-label compression  A stopper  Once a leaf, always a leaf

33 Suffix Links Consider the two strings a and xa Suppose some internal node v of the tree is labeled with xa and another node s(v) in the tree is labeled with a The edge (v,s(v)) is called a suffix link Do all internal nodes (the root is not considered an internal node) have suffix links?

34 Example: suffix links S = ACACACAC\$ AC AC\$ AC \$ \$ \$ \$ \$ \$ \$ C AC\$

35 Suffix Link Lemma If a new internal node v with path-label xa is added to the current tree in extension j of some phase i+1, then  the path labeled a already ends at an internal node of the tree or  the internal node labeled a will be created in the extension of j+1 in the same phase i+1  string a is empty and s(v) is the root

36 Proof of Suffix Link Lemma A new internal node is created only by the extension rule 2 This means that there are two distinct suffixes of S[ 1,i+1 ] that start with xa  xa S( i+1 ) and xacb where c is not S( i+1 ) This means that there are two distinct suffixes of S[ 1,i+1 ] that start with a  a S( i+1 ) and acb where c is not S( i+1 ) Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a

37 Corollary of Suffix Link Lemma Every internal node of an implicit suffix tree has a suffix link from it.

38 How to use suffix links - 1 S[1,i] must end at a leaf since it is the longest string in the implicit tree I i  Keep a pointer to this leaf in all cases and extend according to rule 1 Locating S[j+1,i] from S[j,i] which is at node w  If w is an internal node, set v to w  Otherwise, set v = parent(w)  If v is the root, you must traverse from the root to find S[j+1,i]  If not, go to s(v) and begin search for the remaining portion of S[j,i] from there

39 How to use suffix links - 2

40 Skip and Count Trick – (1) Problem: Moving down from s(v), directly implemented, takes time proportional to the number of characters compared Solution: To make running time proportional to the number of nodes in the path searched, instead of the number of characters

41 Skip and Count Trick – (2) After 4 nodes down-skips, the end of S[j, i] is found.

42 Skip and Count Trick – (3) Node-depth of v, denoted (ND(v)), is the number of nodes on the path from the root to the node v Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, ND(v)  ND(s(v))+1

43 Skip and Count Trick – (4) At the moment of traversing (v,s(v)): ND(v)  ND(s(v))+1

44 Skip and Count Trick – (5) The current node-depth of the algorithm is the node depth of the node most recently visited by the algorithm Lemma: Using the skip and count trick, any phase of Ukkonen’s algorithm takes O(m) time.  Up-walk: decreases the current node-depth by  1  Suffix link traversal: same as up-walk  Totally, the current node-depth is decreased by  2m.  No node has depth >m.  The total possible increment to the current node-depth is  3m.

45 Edge Label Representation Potential Problem  Size of edge labels may require  (m 2 ) space  Thus, the time for the algorithm is at least as large as the size of its output  Example S = abcdefghijklmnopqrstuvwxyz  Total length is  j { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/4043771/slides/slide_45.jpg", "name": "45 Edge Label Representation Potential Problem  Size of edge labels may require  (m 2 ) space  Thus, the time for the algorithm is at least as large as the size of its output  Example S = abcdefghijklmnopqrstuvwxyz  Total length is  j

46 Modified Extension Rules  Rule 2: new leaf edge (phase i+1 ) create edge ( i+1, i+1 ) split edge (p, q) => (p, w) and (w + 1, q)  Rule 1: leaf edge extension label had to be ( p,i ) before extension given rule 2 above and an induction argument:  (p, q) => (p, q + 1)  Rule 3 Do nothing

47 Full edge label representation String S = xabxa\$ 1 b b b x x x x a a a a a \$ \$ \$ \$ \$\$ 2 3 6 5 4

48 Edge-label Compression String S = xabxa\$ 1 (1,2) [or (4,5)?] (3,6) (2,2) (6,6) 2 3 6 5 4 (3,6) (6,6) (3,6)

49 A Stopper In any phase, if suffix extension rule 3 applies in extension j, it will also apply in all extensions k, where k>j, until the end of the phase. The extensions in phase i+1 that are done after the first execution of rule 3 are said to be done implicitly. This is in contrast to any extension j where the end of S[j, i] is explicitly found. An extension of that kind is called and explicit extension. Hence, we can end any phase i+1 when the first extension rule 3 applies.

50 Once a leaf, always a leaf – (1) If at some point in UA a leaf is created and labeled j (for the suffix starting at position j of S), then that leaf will remain a leaf in all successive trees created during the algorithm. In any phase i, there is an initial sequence of consecutive extensions (starting with extension 1) in which only rule 1 or 2 applies, where let j i be the last extension in this sequence. Note that j i  j i+1.

51 Once a leaf, always a leaf – (2) Implicit extensions: for extensions 1 to j i, write [·, e] on the leaf edge, where e is a symbol denoting the current end and is set to i + 1 once at the beginning. In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location. Explicit extensions: from extension j i+1 till first rule 3 extension is found (or until extension i+1 is done)

52 Once a leaf, always a leaf – (3): Single phase algorithm j* is the last explicit extension computed in phase i+1 Phase i+1  Increment e to i+1 (implicitly extending all existing leaves)  Explicitly compute successive extensions starting at j i+1 and continuing until reaching the first extension j* where rule 3 applies or no more extensions needed  Set j i+1 to j* -1, to prepare to the next phase Observation  Phase i and i+1 share at most 1 explicit extension

53 Example: S=axaxbb\$ - (1)  e = 1, a  J 1 = 1 1 1 (1,e) I1I1  e = 2, ax  S[1,2] : skip  S[2,2] : rule 2, create(2, e)  j 2 = 2 1 1 (1,e) I2I2 2 (2,e)  e = 3, axa  S[1,3].. S[2,3] : skip  S[3,3] : rule 3  j 3 = 2 1 1 (1,e) I3I3 2 (2,e)

54 Example: S=axaxbb\$ - (2)  e = 4, axax  S[1,4].. S[2,4] : skip  S[3,4] : rule 3  S[4,4] : auto skip  j 4 = 2 1 1 (1,e) I4I4 2 (2,e)  e = 5, axaxb  S[1,5].. S[2,5] : skip  S[3,5] : rule 2, split (1,e) → (1, 2) and (3,e), create (5,e)  S[4,5] : rule 2, split (2,e) → (2,2) and (3,e), create (5,e)  S[5,5] : rule 2, create (5,e)  j 5 = 5 1 (1,2) I5I5 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) 5

55 Example: S=axaxbb\$ - (3)  e = 6, axaxbb  S[1,6].. S[5,6] : skip  S[6,6] : rule 3  j 6 = 5  e = 7, axaxbb\$  S[1,7].. S[5,7] : skip  S[6,7] : rule 2, split (5,e) → (5,5) and (6,e), create (6,e)  S[7,7] : rule 2, create (7,e)  j 7 = 7 1 (1,2) I6I6 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) 5 1 (1,2) I7I7 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) (5,5) 5 6 (6,e) (7,e) 7

56 Complexity of UA Since all the implicit extensions in any phase is constant, their total cost is O(m). Totally, only 2m explicit extensions are executed. The max number of down-walking skips is O(m). Time-complexity of Ukkonen’s algorithm: O(m)

57 Finishing up Convert final implicit suffix tree to a true suffix tree Add \$ using just another phase of execution  Now all suffixes will be leaves Replace e in every leaf edge with m  Just requires a traversal of tree which is O(m) time

58 Implementation Issues – (1) When the size of the alphabet grows:  For large trees suffix links allow an algorithm to move quickly from one part of the tree to a distant part of the tree. This is great for worst-case time bounds, but its horrible if the tree isn't entirely in memory.  Thus, implementing ST to reduce practical space use can be a serious concern. The main design issues are how to represent and search the branches out of the nodes of the tree. A practical design must balance between constraints of space and need for speed

59 Implementation Issues – (2) There are four basic choices to represent branches  An array of size  (|  |) at each non-leaf node v  A linked list at node v of characters that appear at the beginning of the edge-labels out of v. If its kept in sorted order it reduces the average time to search for a given character In the worst case it, adds time |  | to every node operation. If the number of children of v is large, then little space is saved over the array while noticeably degrading performance  A balanced tree implements the list at node v Additions and searches take O(logk) time and O(k) space, where k is the number of children of v. This alternative makes sense only when k is fairly large.  A hashing scheme. The challenge is to find a scheme balancing space with speed. For large trees and alphabets hashing is very attractive at least for some of the nodes

60 Implementation Issues – (3) When m and  are large enough, the best design is probably a mixture of the above choices.  Nodes near the root of the tree tend to have the most children, so arrays are sensible choice at those nodes.  For nodes in the middle of a suffix tree, hashing or balanced trees may be the best choice. Sometimes the alphabet size is explicitly presented in the time and space bounds  Construction time is O( m log|  |),using  ( m |  |) space. m – is the size of the string.

61 Applications of Suffix Trees

62 Applications of Suffix Trees Exact string matching Substring problem for a database Longest common substring Suffix arrays

63 Exact String Matching – (1) Exact matching problem: given a pattern P of length n and a text T of length m, find all occurrences of P in T in O(n+m) time. Overview of the ST approach:  Built a ST for text T in O(m) time  Match the characters of P along the unique path in ST until either P is exhausted or no more matches are possible  In the latter case, P doesn’t appear anywhere in T  In the former case, every leaf in the subtree below the point of the last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf  ST approach spends O(m) preprocessing time and then O(n+k) search time, where k is the number of occurrences of P in T

64 Exact String Matching – (2) When search terminates at an element node, P appears exactly once in the source string T. When the search for P terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of P. 6 x 1 4 2 5 3 a b x a c c a b x a c c c b x a c Notes: P = xa; T = xabxac Pattern P matches a path down to the point shown by an arrow, and as required, the leaves below that point are numbered 1 and 4.

65 Substring Problem for a Database Input: a database D (a set of strings) and a string S Output: find all the strings in D containing S as a substring Usage: identity of the person Exact string matching methods: cannot work Suffix tree:  D is stored in O(m) space, where m is the total length of all the strings in D.  Suffix tree is built in O(m) time  Each lookup of S (the length of S is n) costs O(n) time

66 Longest common substring –(1) Input: two strings S 1 and S 2 Output: find the longest substring S common to S 1 and S 2 Example:  S 1 =common-substring  S 2 =common-subsequence  Then, S=common-subs

67 Longest common substring – (2) Build a suffix tree for S 1 and S 2 Each leaf represents either a suffix from one of S 1 and S 2, or a suffix from both S 1 and S 2 Mark each internal node v with a 1 (2) if there is a leaf in the subtree of v representing a suffix from S 1 (S 2 ) The path-label of any internal node marked both 1 and 2 is a substring common to both S 1 and S 2, and the longest such string is the longest common substring.

68 Longest common substring – (3) S 1 \$ = xabxac\$, S 2 \$ = abx\$, S = abx

69 Suffix Arrays – (1) A suffix array of an m-character string S, is an array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of S. Example: S = xabxac

70 Suffix Arrays – (2) A suffix array of S can be obtained from the suffix tree T of S by performing a lexical depth-first search of T

71 Suffix Arrays: Exact string matching Given two strings T and P, where |T | = m and |P| = n, find all the occurrences of P in T ? Using the binary search on the suffix array of T, all the occurrences of P in T can be found in O(n logm) time. Example: let T = xabxac and P = ac

72 References Dan Gusfield: Algorithms on Strings, Trees, and Sequences. University of California, Davis.Cambridge University Press,1997 Ela Hunt et al.: A database index to large biological sequences. Slides at VLDB, 2001 R.C.T. Lee and Chin Lung Lu.: Suffix Trees. Slides at CS 5313 Algorithms for Molecular Biology

Download ppt "Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST."

Similar presentations