# Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.

## Presentation on theme: "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree."— Presentation transcript:

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

High-level of Ukkonen’s Algorithm Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. b a a 1 a b 2 1 : S[1…1] {a} 2 : S[1…2] {ab, b} a b 3 : S[1…3] {aba, ba, a} a bb a extensions phases

MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 9 : MISSISSIP 10 : MISSISSIPI 1 M I S S I S S I P I I S S I S S I I S S I S S I P I I S S I P I I I I P I I 2 3 4 5 P P 6 P 7 P 8 P 9 1234567890 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. How suffix links help?

What is achieved so far? Not so much. Worst-case running time is O(m 2 ) for a phase.

Trick1: Skip/Count Trick There must be a γ path from s(v).

Trick1: Skip/Count Trick There must be a γ path from s(v). Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy 2233 Nodes But what does it buy in terms of worst-case bounds? Edge length

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment, the node-depth of v is at most one greater than the node depth of s(v). v=2 s(v)=1 v=3 s(v)=3 v=4 s(v)=5

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment, the node-depth of v is at most one greater than the node depth of s(v). Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension – The algorithm walks up at most one edge – Find suffix link and traverse it – Walks down some number of nodes – Applies suffix extension rules – And may add a suffix link All operations except down-walk takes constant time Only needs to analyze down walk time

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment, the node-depth of v is at most one greater than the node depth of s(v). Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension – The algorithm walks up at most one edge – Find suffix link and traverse it – Walks down some number of nodes – Applies suffix extension rules – And may add a suffix link All operations except down-walk takes constant time Only needs to analyze down walk time – Decreases current node-depth by at most one – Decreases node-depth by at most another one – Each down walk moves to greater node-depth – Over the entire phase, current node-depth is decremented by at most 2m times – Since no node can have depth greater than m, the total possible increment to current node- depth is bounded by 3m over the entire phase – Total number of edge traversal bounded by 3m – Since each edge traversal is constant, in a phase all the down-walking is O(m).

Complexity There are m phases Each phase takes O(m) So the running time is O(m 2 ) Two more tricks and we are done

Simple Implementation Detail Suffix tree may require O(m 2 ) space Consider the string Every suffix begins with a distinct character, so there are 26 edges out of the root. Requires 26x27/2 characters in all So O(m) is impossible to achieve in this representation.

Alternative Representation of Suffix Tree Edge Label Compression 1 2 3 4 56789 0 1 2 A fragment of the suffix treeEdge label compressed Could be 8,9 Number of edge at most 2m – 1, and two numbers are written in a edge, so space is O(m)

MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 1 M I S S I S S I I S S I S S I S S I S S I I S S I 2 3 4 1234567890 Observation 1: Rule 2 is a show stopper. We stop further extension. Implicit extension Explicit Extension 8 : 12345678 7 : 1234567

MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 1 M I S S I S S I S S I S S S S I S S I S S 2 3 4 1234567890 Observation 2: Once a leaf always a leaf Explicit Extension The major cost 8 : 12345678 7 : 1234567 1,7 2,7 4,7 3,7 e = 8

MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 1 M I S S I S S I S S I S S S S I S S I S S 2 3 4 1234567890 Once a leaf always a leaf Explicit Extension The major cost 8 : 12345678 7 : 1234567 1,7 2,7 4,7 3,7 e = 8 At any phase the cost is only for explicit extension

MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 1 M I S S I S S I S S I S S S S I S S I S S 2 3 4 1234567890 Once a leaf always a leaf 8 : 12345678 1,9 2,9 4,9 3,9 e = 9 At any phase the cost is only for explicit extension 9 : MISSISSIP 9 : 123456789 P 5 6,9 2,5 9,9 P 6 P 7 P 8 P 9 I I I I

MISSISSIPI 1234567890 8 : 12345 Since there are only m phases, the total number of explicit extension is bounded by 2m 9 : 123456789 So the total number of down-walk is bounded by O(m) Or The time to construct the suffix tree is bounded by O(m)

Reference Chapter 6: Algorithms on Strings, Trees and Sequences

Download ppt "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree."

Similar presentations