Download presentation

Presentation is loading. Please wait.

Published byEbony Deaner Modified about 1 year ago

1

2
Suffix Trees Construction and Applications João Carreira 2008

3
Outline Why Suffix Trees? Definition Ukkonen's Algorithm (construction) Applications

4
Why Suffix Trees?

5
Asymptotically fast.

6
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures.

7
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them.

8
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging.

9
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging. Expose interesting algorithmic ideas.

10
Definition m leaves numbered 1 to m Suffix Tree for an m-character string:

11
Definition m leaves numbered 1 to m edge-label vs node-label Suffix Tree for an m-character string:

12
Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children Suffix Tree for an m-character string:

13
Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] Suffix Tree for an m-character string:

14
Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] no two edges out of the same node can have edge-labels beginning with the same character Suffix Tree for an m-character string:

15
Definition Example String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac

16
Implicit vs Explicit What if we have “axabx” ?

17
Ukkonen's Algorithm suffix tree construction

18
Ukkonen's Algorithm Text: S[ 1..m ] m phases phase j is divided into j extensions: In extension j of phase i + 1: find the end of the path from the root labeled with substring S[ j..i ] extend the substring by adding the character S(i + 1) to its end suffix tree construction

19
Extension Rules Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.

20
Extension Rules Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path continues from the end of β.

21
Extension Rules Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.

22
Ukkonen's Algorithm Complexity: suffix tree construction

23
Ukkonen's Algorithm Complexity: m phases suffix tree construction

24
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions suffix tree construction

25
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) suffix tree construction

26
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1) suffix tree construction

27
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1) O(m 3 ) suffix tree construction

28
“First make it run, then make it run fast.” Brian Kernighan

29
Suffix Links Definition: For an internal node v with path-label xα, if there is another node s(v), with path-label α, then a pointer from v to s(v) is called a suffix link.

30
Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies:

31
Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1)

32
Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1) S[ j + 1..i ] continues with c.

33
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

34
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ.

35
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.

36
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w).

37
Node Depth The node-depth of v is at most one greater than the node depth of s(v). α ß xß xα xλ λ xß xα xλ ß α λ equal node-depth: 3 Node depth: 4Node depth: 3

38
γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|) Skip/count Trick

39
“Jump” from node to node. K = number of nodes in a path Time to traverse a path: O(|K|) γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)

40
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof:

41
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1

42
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link.

43
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one.

44
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one.

45
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth.

46
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times.

47
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times. No node can have depth greater than m, so the total increment to current node-depth (down walks) is bounded by 3m over the entire phase.

48
Ukkonen's Algorithm m phases 1 phase: O(m)

49
Ukkonen's Algorithm m phases 1 phase: O(m) O(m 2 )

50
“First make it run fast, then make it run faster.” João Carreira

51
Edge-Label Compression A string with m characters has m suffixes. If edge labels are represented with characters, O(m 2 ) space is needed.

52
Edge-Label Compression A string with m characters has m suffixes. If edge labels are represented with characters, O(m 2 ) space is needed. To achieve O(m) space, each edge-label: (p, q)

53
Two more tricks...

54
Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why?

55
Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why? When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.

56
Rule 3 is a show stopper End any phase i +1 the first time rule 3 applies. The remaining extensions are said to be done implicitly.

57
Once a leaf always a leaf Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf. Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase.

58
Once a leaf always a leaf Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf. Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase. Leaf Edge Label: (p, e)

59
Single Phase Algorithm In each phase i:

60
Single Phase Algorithm During construction:

61
Implicit to Explicit One last phase to add character $: O(m)

62
Suffix Trees are a Swiss Knife

63
Applications Exact String Matching:

64
Applications Exact String Matching: Three ocurrences of string aw. Preprocessing: O(m) Search: O(n + k)

65
Applications And much more.. Longest common substring O(n) Longest repeated substring O(n) Longest palindrome O(n) Most frequently occurring substrings of a minimum length O(n) Shortest substrings occurring only once O(n) Lempel-Ziv decomposition O(n).....

66
“Biology easily has 500 years of exciting problems to work on.” Donald Knuth

67
web.ist.utl.pt/joao.carreira Questions?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google