Presentation is loading. Please wait.

Presentation is loading. Please wait.

On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

Similar presentations


Presentation on theme: "On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University."— Presentation transcript:

1

2 On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University

3 2 Source E. Ukkonen. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14 :249--260, 1995. On-line construction of suffix trees. Algorithmica, 14 :249--260, 1995.

4 3 Outline Introduction Introduction Suffix tries and suffix trees Suffix tries and suffix trees Constructing suffix tries Constructing suffix tries Quadratic time Quadratic time On-line construction of suffix trees On-line construction of suffix trees Liner Time Liner Time

5 4 Notations T = t 1 t 2... t n be a string over an alphabet. T = t 1 t 2... t n be a string over an alphabet. T i denote the prefix t 1 … t i of T for 0 i n. T i denote the prefix t 1 … t i of T for 0 i n.. T i denote the suffix t i … t n of T where 1 i n + 1. T i denote the suffix t i … t n of T where 1 i n + 1.. T: abcde T3T3T3T3: abc T: abcde T3T3T3T3: cde

6 5 Notations (cont.) T n+ 1 = is the empty suffix. T n+ 1 = is the empty suffix. The set of all suffixes of T is denoted ( T ). The set of all suffixes of T is denoted ( T ). T: abcde ( T ) ( T ): abcde bcde cde de e

7 6 Suffix Tries & Suffix Trees a ab ababc abc b a b c c b a b c c c b c babc bc ab ababc ab abc c b c c abc babc bc c b Suffix Trie Suffix Tree

8 7 Suffix Tries The suffix trie of T is a trie representing ( T ). The suffix trie of T is a trie representing ( T ). STrie ( T ) = ( Q { }, root, F, g, f ) STrie ( T ) = ( Q { }, root, F, g, f ) and define such a trie as an augmented deterministic finite-state automation.

9 8 STrie ( T ) = ( Q { }, root, F, g, f ). STrie ( T ) = ( Q { }, root, F, g, f ). Q is the set of the states of STrie ( T ). Q is the set of the states of STrie ( T ). one-to-one correspondence with the substring of T one-to-one correspondence with the substring of T x is the state that corresponds to a substring x. x is the state that corresponds to a substring x. is an auxiliary state. is an auxiliary state. root is the initial state corresponds to the empty string. root is the initial state corresponds to the empty string. F is the final states corresponds to ( T ). F is the final states corresponds to ( T ). Suffix Tries (cont.)

10 9 g is the transition function: g is the transition function: g(x, a) = y for all x, y in Q such that y = xa, where a. g(x, a) = y for all x, y in Q such that y = xa, where a. f is the suffix function: f is the suffix function: Let x root. Then x = ay for some a, and we set f ( x ) = y. Let x root. Then x = ay for some a, and we set f ( x ) = y. f ( root ) =. f ( root ) =. We call f ( r ) the suffix link of state r. We call f ( r ) the suffix link of state r. Suffix Tries (cont.)

11 10 Suffix Tries (cont.) a ab abcabd b c a b d b c a b d T = abcabd d a d d d abd bcabd cabd d bdbdbdbd b b c suffix links Note: Only last layer of suffix links are shown explicitly. STrie ( T ) = ( Q { }, root, F, g, f )

12 11 We call the path that starts from the deepest state t 1... t i-1 and ends at the boundary path. We call the path that starts from the deepest state t 1... t i-1 and ends at the boundary path. Boundary path consists of the last layer of suffix links. Boundary path consists of the last layer of suffix links. Boundary Path

13 12 Constructing Suffix Tries Observation : ( T i ) = ( T i-1 ) t i { } Observation : ( T i ) = ( T i-1 ) t i { } abcd bcd cd d ( T i-1 ) ( T i-1 ) abcde bcde cde de e ( T i ) ( T i ) boundary path

14 13 Constructing Suffix Tries (cont.) Algorithm 1. r top; while g ( r, t i ) is undefined do create new state r' and new transition g ( r, t i ) = r'; if r top then create new suffix link f ( oldr' ) = r'; oldr' r'; r f ( r ) ; create new suffix link f ( oldr' ) = g ( r, t i ) ; top g ( top, t i ).

15 14 Constructing Suffix Tries (cont.) a T = a top r r top We color the boundary path orange

16 15 Constructing Suffix Tries (cont.) a ab b b T = ab r r top r b top We color the boundary path orange

17 16 Constructing Suffix Tries (cont.) a ab b c b c T = abc b c top r r r r top We color the boundary path orange

18 17 Constructing Suffix Tries (cont.) a ab b c a b c a T = abca a b c top r r r r top We color the boundary path orange

19 18 Constructing Suffix Tries (cont.) a ab b c a b b c a b T = abcab a b b c top r r r r top We color the boundary path orange

20 19 Constructing Suffix Tries (cont.) a ab abcabd b c a b d b c a b d T = abcabd d a d d d abd bcabd cabd d bdbdbdbd b b c top r r r r r r r top

21 20 Constructing Suffix Tries (cont.) a ab abcabd b c a b d b c a b d T = abcabd d a d d d abd bcabd cabd d bdbdbdbd b b c

22 21 Constructing Suffix Tries (cont.) Theorem 1 Suffix trie STrie ( T ) can be constructed in time proportional to the size of STrie ( T ) which, in the worst case, is O ( |T| 2 ). Note: The number of nodes in STrie ( T ) is the number of substrings of T. T has at most O ( n 2 ) substrings. Thus the size of STrie ( T ) is O ( n 2 ).

23 22 Suffix Trees Suffix tree Stree ( T ) represents STrie ( T ) in space linear in the length |T|. Suffix tree Stree ( T ) represents STrie ( T ) in space linear in the length |T|. Represent only a subset Q' { } of the states of STrie ( T ). Represent only a subset Q' { } of the states of STrie ( T ). Q' consists of all branching states and all leaves of Strie ( T ). Q' consists of all branching states and all leaves of Strie ( T ). Called the states in Q' { } the explicit states. Called the states in Q' { } the explicit states. The other states of STrie ( T ) are called implicit states as states of STree ( T ). The other states of STrie ( T ) are called implicit states as states of STree ( T ). Implicit states are not explicitly present in STree ( T ). Implicit states are not explicitly present in STree ( T ).

24 23 Suffix Trees (cont.) c a ab ababc abc b a b c c b a b c c c b babc bc Suffix Trie ab ababc ab abc c b c c abc babc bc c b Suffix Tree implicit states explicit states

25 24 Suffix Trees (cont.) The string w = t k... t p between two explicit states s and r is represented in STree ( T ) as generalized transition g' ( s, w ) = r. The string w = t k... t p between two explicit states s and r is represented in STree ( T ) as generalized transition g' ( s, w ) = r. To save space the string w = t k... t p is actually represented as a pair ( k, p ) of pointers to T. To save space the string w = t k... t p is actually represented as a pair ( k, p ) of pointers to T. A transition g' ( s, ( k, p )) = r is called an A transition g' ( s, ( k, p )) = r is called an a-transition if t k = a. Each s can have at most one a-transition for each Each s can have at most one a-transition for each a.

26 25 Suffix Trees (cont.) Suffix function: Suffix function: Defined only for all branching states x root as Defined only for all branching states x root as f ' ( x ) = y where y is a branching state such that x = ay for some a x = ay for some a f' ( root ) =. f' ( root ) =. If x is a branching state, the also f ' ( x ) is a branching state. These suffix links are explicitly represented. If x is a branching state, the also f ' ( x ) is a branching state. These suffix links are explicitly represented. The suffix tree of T is denoted as The suffix tree of T is denoted as STree ( T ) = ( Q' { }, root, g', f ' )

27 26 Size of Suffix Trees ab ababc ab abc c b c c abc babc bc c b (5,5)(2,2)(1,2) (3,5) (5,5) (3,5) (5,5) T = ababc a -transition b -transition c -transition

28 27 Size of Suffix Trees (cont.) The size of STree ( T ) is linear size in |T|. The size of STree ( T ) is linear size in |T|. Q' has at most |T| leaves and therefore Q' has to contain at most |T| - 1 branching states in Q'. Q' has at most |T| leaves and therefore Q' has to contain at most |T| - 1 branching states in Q'. There can be at most 2 |T| - 2 transitions between the states in Q'. There can be at most 2 |T| - 2 transitions between the states in Q'.

29 28 Reference to a State We refer to a state r of a suffix tree by a reference pair ( s, w ). We refer to a state r of a suffix tree by a reference pair ( s, w ). s is some explicit state that is an ancestor of r. s is some explicit state that is an ancestor of r. w is the string spelled out by the transitions form s to r in the corresponding suffix trie. w is the string spelled out by the transitions form s to r in the corresponding suffix trie. A reference pair is canonical if s is the closest ancestor of r. A reference pair is canonical if s is the closest ancestor of r. Pair ( s, ) is represented as ( s, ( p + 1, p )). Pair ( s, ) is represented as ( s, ( p + 1, p )).

30 29 States on the Boundary Path Let s 1 = t 1... t i- 1, s 2, s 3,..., s i = root, s i+ 1 = be the states of STrie ( T i- 1 ) on the boundary path. Let s 1 = t 1... t i- 1, s 2, s 3,..., s i = root, s i+ 1 = be the states of STrie ( T i- 1 ) on the boundary path. Let j be the smallest index such that s j is not a leaf. Let j be the smallest index such that s j is not a leaf. Let j' be the smallest index such that s j' has a t i -transition. Let j' be the smallest index such that s j' has a t i -transition. We call state s j the active point and s j' the end point of STrie ( T i- 1 ). We call state s j the active point and s j' the end point of STrie ( T i- 1 ).

31 30 States on the Boundary Path Lemma 1 Algorithm 1 adds to STrie ( T i-1 ) a t i -transition for each of the states s h, 1 h < j'. Lemma 1 Algorithm 1 adds to STrie ( T i-1 ) a t i -transition for each of the states s h, 1 h < j'. For 1 h < j, the new transition expands an old branch of the trie that ends at leaf s h. For 1 h < j, the new transition expands an old branch of the trie that ends at leaf s h. For j h < j', the new transition initiates a new branch from s h. For j h < j', the new transition initiates a new branch from s h. Algorithm 1 does not create any other transitions. Algorithm 1 does not create any other transitions.

32 31 States on the Boundary Path Algorithm 1 inserts two different groups of t i -transitions into STrie ( T i- 1 ) : Algorithm 1 inserts two different groups of t i -transitions into STrie ( T i- 1 ) : First groups First groups The states on the boundary path before the active point s j get a transition. The states on the boundary path before the active point s j get a transition. Second groups Second groups The states from the active point s j to the end point s j', the end point excluded, get a new transition. The states from the active point s j to the end point s j', the end point excluded, get a new transition.

33 32 States on the Boundary Path a ab b c a b b c a b a b b c active point T i- 1 = abcab STrie ( T i- 1 ) t i = d end point last layer of suffix links (boundary path) first group second group

34 33 States on the Boundary Path a ab abcabd b c a b d b c a b d d a d d d abd bcabd cabd d bdbdbdbd b b c first group second group STrie ( T i ) t i = d We color the new transition and new node green active point end point T i- 1 = abcab

35 34 Adding Transitions to STree(Ti- 1) First group can be not changed to STree ( T i- 1 ). First group can be not changed to STree ( T i- 1 ). Transitions of STree ( T i- 1 ) leading to a leaf is called an open transition. Transitions of STree ( T i- 1 ) leading to a leaf is called an open transition. Such a transition is of the form g' ( s, ( k, i- 1 )) = r. Such a transition is of the form g' ( s, ( k, i- 1 )) = r. Instead, open transitions are represented as g' ( s, ( k, )). Instead, open transitions are represented as g' ( s, ( k, )). indicates that this transition is 'open to grow'. indicates that this transition is 'open to grow'.

36 35 Open Transitions ab (1,2) b(2,2) active point T i- 1 = abcab STree ( T i- 1 ) t i = d end point first group second group cab ab (3, ) abcab bcab cab

37 36 Open Transitions ab b d first group second group STree ( T i ) t i = d We color the new transition and new node green active point end point T i- 1 = abcab (3, ) abcabd (3, ) bcabd (3, ) cabd dd ab (1,2) b(2,2)

38 37 Adding Transitions to STree(Ti- 1) (cont.) Create new branches for the second group. Create new branches for the second group. They are presented explicitly or implicitly. They are presented explicitly or implicitly. They will be found along the boundary path using reference pairs and suffix links. They will be found along the boundary path using reference pairs and suffix links. Let ( s, w ) be the canonical reference pair for s h, j h < j'. ( s, w ) = ( s, ( k, i - 1 )) for some k i. Let ( s, w ) be the canonical reference pair for s h, j h < j'. ( s, w ) = ( s, ( k, i - 1 )) for some k i. If ( s, ( k, i - 1 )) already refers to the end point s j', we are done. If ( s, ( k, i - 1 )) already refers to the end point s j', we are done. Otherwise a new branch has to be created. If ( s, ( k, i - 1 )) refers to an implicitly state, a new explicit state is created by splitting the transition. Then a t i -transition is created. Otherwise a new branch has to be created. If ( s, ( k, i - 1 )) refers to an implicitly state, a new explicit state is created by splitting the transition. Then a t i -transition is created.

39 38 On-Line Construction of Suffix Trees (cont.) Lemma 2 Let ( s, ( k, i - 1 )) be a reference pair of the end point s j' of STree ( T i- 1 ). Then ( s, ( k, i )) is a reference pair of the active point of STree ( T i ). Proof. s j is the active point of STree ( T i-1 ) if and only if s j is the longest suffix of T i-1 that occurs at least twice in T i-1. s j is the active point of STree ( T i-1 ) if and only if s j is the longest suffix of T i-1 that occurs at least twice in T i-1. s j' is the end point of STree ( T i-1 ) if and only if s j' is the longest suffix of T i-1 such that t j'... t i-1 t i is a substring of T i-1. s j' is the end point of STree ( T i-1 ) if and only if s j' is the longest suffix of T i-1 such that t j'... t i-1 t i is a substring of T i-1. If s j' is the end point of STree ( T i-1 ) then t j'... t i-1 t i is the longest suffix of T i that occurs at least twice in T i, that is, then state g ( s j', t i ) is the active point of STree ( T i ). If s j' is the end point of STree ( T i-1 ) then t j'... t i-1 t i is the longest suffix of T i that occurs at least twice in T i, that is, then state g ( s j', t i ) is the active point of STree ( T i ).

40 39 Constructing Suffix Tries (cont.) T = a s = root k = 1 i = 0 i = 1 (1, ) s = s = k = 2 T = ab i = 2 (2, ) k = 3 T = abc i = 3 (3, ) k = 4 active point end point T = abca i = 4 T = abcab i = 5 T = abcabd i = 6 active point end point (2,2) (1,2) (6, ) (4, ) (5, ) T = abcabd k = 5 k = 6 (3, )

41 40 On-Line Construction of Suffix Trees Algorithm 2 Construction of STree ( T ) for string T = t 1 t 2...# in alphabet = { t - 1,..., t -m } ; # is the end marker. Create states root and ; for j 1,..., m do create transition g' (, ( -j, -j )) = root; create suffix link f' ( root ) = ; s root; k 1 ; i 0 ; while t i+ 1 # do i i + 1 ; ( s, k ) update ( s, ( k, i )) ; ( s, k ) canonize ( s, ( k, i )).

42 41 On-Line Construction of Suffix Trees (cont.) procedure update ( s, ( k, i )) : ( s, ( k, i - 1 )) is the canonical reference pair for the active point; oldr root; ( endpoint, r ) test-and-split ( s, ( k, i - 1 ), t i ) ; while not ( end-point ) do create new transition g' ( r, ( i, )) = r' where r' is a new state; if oldr root then create new suffix link f' ( oldr ) = r; oldr r; ( s, k ) canonize ( f' ( s ), ( k, i - 1 )) ; ( end-point, r ) test-and-split ( s, ( k, i - 1 ), t i ) ; if oldr root then create new suffix link f' ( oldr ) = s; return ( s, k ).

43 42 On-Line Construction of Suffix Trees (cont.) procedure test-and-split ( s, ( k, p ), t ) : if k p then let g' ( s, ( k', p' )) = s' be the t k -transition from s; if t = t k'+p-k+ 1 then return ( true, s ) else replace the t k -transition above by transitions g' ( s, ( k', k' + p - k )) = r and g' ( r, ( k' + p - k + 1, p' )) = s' g' ( s, ( k', k' + p - k )) = r and g' ( r, ( k' + p - k + 1, p' )) = s' where r is a new state; return ( false, r ) else if there is no t-transition from s then return ( false, s ) else return ( true, s ).

44 43 On-Line Construction of Suffix Trees (cont.) procedure canonize ( s, ( k, p )) : if p < k then return ( s, k ) else find the t k -transition g' ( s, ( k', p' )) = s' from s; while p' – k' p – k do k k + p' – k' + 1 ; s s'; if k p then find the t k -transition g' ( s, ( k', p' )) = s' from s; return ( s, k ).

45 44 Time Complexity Theorem 2 Algorithm 2 constructs the suffix tree STree ( T ) for a string T = t 1... t n on-line in time O ( n ). Proof. The update is called n times. It takes time proportional to the total number of the visited states. The update is called n times. It takes time proportional to the total number of the visited states.

46 45 Time Complexity Analysis aababcabcaabcababcabd height = n width n + 1

47 46 Time Complexity Analysis activ e point end point sjsjsjsj s j' Let r i-1 be the string corresponding to the active point The string corresponding to end point is (r i ) i-1 (Lemma 2) Note: r i = (r i ) i-1 t i So that the number of the visited states in loop i = length ( r i-1 ) - ( length ( r i ) -1 ) + 1 = length ( r i-1 ) - ( length ( r i ) -1 ) + 1 Total number of the visited states = ( length ( r i-1 ) - length ( r i ) + 2 ) = length ( r 0 ) - length ( r n ) + 2n 2n

48 47 Conclusion Suffix tree can be constructed in linear time by employing Suffix tree can be constructed in linear time by employing suffix links suffix links open transitions for leaf nodes open transitions for leaf nodes implicit nodes implicit nodes relay on active points and end points. relay on active points and end points.

49 48 Suffix trees have many applications: Suffix trees have many applications: string searching string searching finding repeat substrings finding repeat substrings Many applications appear in Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, Cambridge, 1997. Many applications appear in Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, Cambridge, 1997.

50 49 Any Questions?

51 50 Thank You


Download ppt "On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University."

Similar presentations


Ads by Google