Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Similar presentations


Presentation on theme: "Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen."— Presentation transcript:

1 Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen developed a simpler to understand variant in 1995 –This is what we will focus on

2 Implicit Suffix Trees An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by –removing $ from all edge labels –removing any edges that now have no label –removing any node that does not still have at least two children Some suffixes may no longer be leaves An implicit suffix tree for prefix S[1..i] of S is similarly defined based on the suffix tree for S[1..i]$ I i will denote the implicit suffix tree for S[1..i]

3 Example Implicit tree for xabxa from tree for xabxa$ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$ 2 3 6 5 4

4 Remove $ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$ 2 3 6 5 4

5 Remove $ after Remove $ {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3 6 5 4

6 Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3 6 5 4

7 Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

8 Remove interior nodes Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

9 Final implicit tree Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x xx a a a a a 2 3

10 Basic Structure of Algorithm Initialization –I 1 has one edge labeled S(1) For i = 1 to m-1 (build I i+1 ) –For j = 1 to i+1 Find location of string S[j..i] in tree “Extend” to incorporate character S(i+1) Expand final implicit tree to make full suffix tree

11 Order of operations visualization S = xabxacdefghixabcab$ i+1 = 1 –x–x i+1 = 2 –extend x to xa –a–a i+1 = 3 –extend xa to xab –extend a to ab –b–b …

12 Extension Rules Case 1: S[j..i] ends at a leaf –Add character S(i+1) to end of label on leaf edge Case 2: Not a leaf, but no path from end of S[j..i] location continues with S(i+1) –Split a new leaf edge for character S(i+1) –May need to create an internal node if S[j..i] ends in the middle of an edge Case 3: S[j..i+1] is already in the tree –No update

13 Visualization Implicit tree for axabxb from tree for axabx 1 b b b x x x x a a a 2 3x b x 4 b b b b Rule 1: at a leaf node b Rule 3: already in treeRule 2: add a leaf edge (and an interior node) b 5

14 Observations Once S[j..i] is located in the tree, extending to accommodate S(i+1) is constant time Making Ukkonen’s algorithm O(m 2 ) –Finding the S[j..i] locations in the suffix trees quickly when explicit computation is needed

15 Edge Label Representation Potential Problem –Size of edge labels may be  (m 2 ) –Example S = abcdefghijklmnopqrstuvwxyz –Total length is  j<m+1 j = m(m+1)/2 –Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size Solution –Label edges with pair of indices indicating beginning and end of positions of the substring in S

16 Full edge label illustration String S = xabxa$ 1 b b b x x x x a a a a a $ $ $ $ $$ 2 3 6 5 4

17 Compact edge label illustration String S = xabxa$ 1 (1,2) [or (4,5)?] (3,6) (2,2) (6,6) 2 3 6 5 4 (3,6) (6,6) (3,6)

18 Modified Extension Rules Rule 2: new leaf edge –label new leaf edge (i+1, i+1) Rule 1: leaf edge extension –label had to be (p,i) before extension given rule 2 above and an induction argument –now will be (p, i+1) Rule 3: still nothing needs to be done

19 Suffix Links Consider the two strings  and x  Suppose some internal node v of the tree is labeled with x  and another node s(v) in the tree is labeled with  Then the edge (v,s(v)) is a suffix link Do all internal nodes (the root is not considered an internal node) have suffix links?

20 Suffix Link Lemma If a new internal node v with path-label x  is added to the current tree in extension j of some phase i+1, then –the path labeled  already exists at an internal node of the tree or –the internal node labeled  will be created in the extension of j+1 or –string  is empty and s(v) is the root

21 Proof of Suffix Link Lemma A new internal node is created only by extension rule 2 This means there are two distinct suffixes of S[1..i+1] that start with x  –x  S(i+1) and x  c  where c is not S(i+1) This means there are two distinct suffixes of S[1..i+1] that start with  –  S(i+1) and  c  where c is not S(i+1) Thus, if  is not empty,  will label an internal node once extension j+1 is processed which is the extension of 

22 Using suffix links to speed up location of S[j..i] S[1..i] must end at a leaf since it is the longest string in implicit tree I i –Keep a pointer to this leaf in all cases and extend according to rule 1 Locating S[j+1..i] from S[j..i] which is at node w –If w is an internal node, set v to w –Otherwise, set v = parent(w) –If v is the root, you must traverse from root to find S[j+1..i] –If not, go to s(v) and begin search for remaining portion of S[j..i] from there Remaining portion is the label of the edge we traversed up (see figure 6.5 on page 100)

23 Skip/count Trick Problem: Moving down from s(v) naively takes time proportional to the number of characters compared Solution –At each node, only compare the first character in an edge label to the next character to be checked –Then, use the number of characters on that edge to update search in constant time –Running time is now proportional to the number of nodes in the path searched rather than the number of characters See Figure 6.6 on page 102

24 O(m 2 ) argument node-depth of v: number of nodes on path from root to node v Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, nd(v) <= nd(s(v))+1 –If x  is an ancestral internal node of v where  is not empty, then it has a suffix link to a node with path-label  See Figure 6.7 on page 103

25 O(m 2 ) argument Lemma: Any phase takes O(m) time with skip/count trick Proof –Decrements to node depth at most 2m i+1 <= m extensions per phase Walking up decreases node depth at most 1 Suffix link traversal decreases node depth at most 1 –At most 3m downward edge traversal Max node depth is m None are negative Each downward traversal increases depth by at least 1

26 Observation Making Ukkonen’s algorithm O(m) –Implicit computations of many extensions –Need to take argument for a single phase and extend to multiple phases

27 Rule 3 Suppose suffix extension rule 3 applies to S[j..i+1]. –This means S[j..i+1] already appears in the implicit suffix tree as a prefix of a larger suffix (and is thus a substring of S[1..i]) Then it applies to S[k..i+1] for k > j. –Clearly, S[k..i+1] must also be a substring of S[1..i] and thus must be in the tree Thus, stop a phase once the first application of rule 3 occurs –All future rule 3 extensions are done implicitly, not explicitly

28 Implicit expansion of leaf nodes Once a leaf, always a leaf –It will always be extended using rule 1 Implicit expansion of leaf nodes –When a leaf edge is created in phase i+1, instead of labeling it with (p, i+1), label it with (p,e) –e is a global index that is set to i+1 once in each phase –In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location –How can we easily identify leaf nodes to avoid explicitly expanding them in later phases?

29 Avoiding leaf nodes For phase i, let last(i) denote the last extension of phase i that is not by rule 3 Observation: –All the suffixes S[j..i] for 1 <= j <= last(i) end at a leaf node –All the extensions for 1 <= j <= last(i) in phase i+1 and higher are rule 1 extensions by previous slide –Therefore, last(i+1) >= last(i) Trick –In phase i+1, only explicitly compute extensions for last(i)+1 up till first rule 3 extension is found

30 Single phase algorithm Phase i+1 –Increment e to i+1 (implicitly extending all existing leaves) –Explicitly compute successive extensions starting at last(i)+1 and continuing until a rule 3 extension or no more extensions needed Exact location is known given next step in previous phase –Set last(i+1) appropriately (last rule 1 or 2 extension) Observation –Phase i and i+1 share at most 1 explicit extension –In all phases, there will be at most 2m extensions

31 Visualization of explicit extensions S = xabxacdefghixabcab$ Phase 1: extend x (rule 1) Phase 2: extend a (rule 2) Phase 3: extend b (rule 2) Phase 4: extend x (rule 3) Phase 5: extend x (rule 3) Phase 6: extend x (rule 2), extend a (rule 2) extend c (rule 2) Phase 7: extend d (rule 2) …

32 O(m) argument Lemma: All phases take O(m) time Proof –Similar to previous node-depth argument –Key new observation From one phase to the next, we either start at root or we work with same node we ended with last time so the node depth is identical With 2m total extensions at most, we get O(m) upward and downward traversals of links

33 Finishing up Convert final implicit suffix tree to a true suffix tree Add $ using just another phase of execution –Now all suffixes will be leaves Replace e in every leaf edge with m –Just requires a traversal of tree which is O(m) time


Download ppt "Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen."

Similar presentations


Ads by Google