On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.
Simplifications of Context-Free Grammars
Mathematical Preliminaries
Variations of the Turing Machine
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Speaker: C. C. Lin Adviser: R. C. T. Lee
David Burdett May 11, 2004 Package Binding for WS CDL.
We need a common denominator to add these fractions.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Custom Services and Training Provider Details Chapter 4.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 How many layers of the Earth are there? The part of the Earth that consists of molten metal.
Polygon Scan Conversion – 11b
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.
1 The Blue Café by Chris Rea My world is miles of endless roads.
Bright Futures Guidelines Priorities and Screening Tables
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Association Rule Mining
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
How to convert a left linear grammar to a right linear grammar
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Graphs, representation, isomorphism, connectivity
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Essential Cell Biology
Converting a Fraction to %
CSE Lecture 17 – Balanced trees
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Completing the Square Topic
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Pumping Lemma for CFL’s
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Presentation transcript:

On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University

2 Source E. Ukkonen. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14 : , On-line construction of suffix trees. Algorithmica, 14 : , 1995.

3 Outline Introduction Introduction Suffix tries and suffix trees Suffix tries and suffix trees Constructing suffix tries Constructing suffix tries Quadratic time Quadratic time On-line construction of suffix trees On-line construction of suffix trees Liner Time Liner Time

4 Notations T = t 1 t 2... t n be a string over an alphabet. T = t 1 t 2... t n be a string over an alphabet. T i denote the prefix t 1 … t i of T for 0 i n. T i denote the prefix t 1 … t i of T for 0 i n.. T i denote the suffix t i … t n of T where 1 i n + 1. T i denote the suffix t i … t n of T where 1 i n T: abcde T3T3T3T3: abc T: abcde T3T3T3T3: cde

5 Notations (cont.) T n+ 1 = is the empty suffix. T n+ 1 = is the empty suffix. The set of all suffixes of T is denoted ( T ). The set of all suffixes of T is denoted ( T ). T: abcde ( T ) ( T ): abcde bcde cde de e

6 Suffix Tries & Suffix Trees a ab ababc abc b a b c c b a b c c c b c babc bc ab ababc ab abc c b c c abc babc bc c b Suffix Trie Suffix Tree

7 Suffix Tries The suffix trie of T is a trie representing ( T ). The suffix trie of T is a trie representing ( T ). STrie ( T ) = ( Q { }, root, F, g, f ) STrie ( T ) = ( Q { }, root, F, g, f ) and define such a trie as an augmented deterministic finite-state automation.

8 STrie ( T ) = ( Q { }, root, F, g, f ). STrie ( T ) = ( Q { }, root, F, g, f ). Q is the set of the states of STrie ( T ). Q is the set of the states of STrie ( T ). one-to-one correspondence with the substring of T one-to-one correspondence with the substring of T x is the state that corresponds to a substring x. x is the state that corresponds to a substring x. is an auxiliary state. is an auxiliary state. root is the initial state corresponds to the empty string. root is the initial state corresponds to the empty string. F is the final states corresponds to ( T ). F is the final states corresponds to ( T ). Suffix Tries (cont.)

9 g is the transition function: g is the transition function: g(x, a) = y for all x, y in Q such that y = xa, where a. g(x, a) = y for all x, y in Q such that y = xa, where a. f is the suffix function: f is the suffix function: Let x root. Then x = ay for some a, and we set f ( x ) = y. Let x root. Then x = ay for some a, and we set f ( x ) = y. f ( root ) =. f ( root ) =. We call f ( r ) the suffix link of state r. We call f ( r ) the suffix link of state r. Suffix Tries (cont.)

10 Suffix Tries (cont.) a ab abcabd b c a b d b c a b d T = abcabd d a d d d abd bcabd cabd d bdbdbdbd b b c suffix links Note: Only last layer of suffix links are shown explicitly. STrie ( T ) = ( Q { }, root, F, g, f )

11 We call the path that starts from the deepest state t 1... t i-1 and ends at the boundary path. We call the path that starts from the deepest state t 1... t i-1 and ends at the boundary path. Boundary path consists of the last layer of suffix links. Boundary path consists of the last layer of suffix links. Boundary Path

12 Constructing Suffix Tries Observation : ( T i ) = ( T i-1 ) t i { } Observation : ( T i ) = ( T i-1 ) t i { } abcd bcd cd d ( T i-1 ) ( T i-1 ) abcde bcde cde de e ( T i ) ( T i ) boundary path

13 Constructing Suffix Tries (cont.) Algorithm 1. r top; while g ( r, t i ) is undefined do create new state r' and new transition g ( r, t i ) = r'; if r top then create new suffix link f ( oldr' ) = r'; oldr' r'; r f ( r ) ; create new suffix link f ( oldr' ) = g ( r, t i ) ; top g ( top, t i ).

14 Constructing Suffix Tries (cont.) a T = a top r r top We color the boundary path orange

15 Constructing Suffix Tries (cont.) a ab b b T = ab r r top r b top We color the boundary path orange

16 Constructing Suffix Tries (cont.) a ab b c b c T = abc b c top r r r r top We color the boundary path orange

17 Constructing Suffix Tries (cont.) a ab b c a b c a T = abca a b c top r r r r top We color the boundary path orange

18 Constructing Suffix Tries (cont.) a ab b c a b b c a b T = abcab a b b c top r r r r top We color the boundary path orange

19 Constructing Suffix Tries (cont.) a ab abcabd b c a b d b c a b d T = abcabd d a d d d abd bcabd cabd d bdbdbdbd b b c top r r r r r r r top

20 Constructing Suffix Tries (cont.) a ab abcabd b c a b d b c a b d T = abcabd d a d d d abd bcabd cabd d bdbdbdbd b b c

21 Constructing Suffix Tries (cont.) Theorem 1 Suffix trie STrie ( T ) can be constructed in time proportional to the size of STrie ( T ) which, in the worst case, is O ( |T| 2 ). Note: The number of nodes in STrie ( T ) is the number of substrings of T. T has at most O ( n 2 ) substrings. Thus the size of STrie ( T ) is O ( n 2 ).

22 Suffix Trees Suffix tree Stree ( T ) represents STrie ( T ) in space linear in the length |T|. Suffix tree Stree ( T ) represents STrie ( T ) in space linear in the length |T|. Represent only a subset Q' { } of the states of STrie ( T ). Represent only a subset Q' { } of the states of STrie ( T ). Q' consists of all branching states and all leaves of Strie ( T ). Q' consists of all branching states and all leaves of Strie ( T ). Called the states in Q' { } the explicit states. Called the states in Q' { } the explicit states. The other states of STrie ( T ) are called implicit states as states of STree ( T ). The other states of STrie ( T ) are called implicit states as states of STree ( T ). Implicit states are not explicitly present in STree ( T ). Implicit states are not explicitly present in STree ( T ).

23 Suffix Trees (cont.) c a ab ababc abc b a b c c b a b c c c b babc bc Suffix Trie ab ababc ab abc c b c c abc babc bc c b Suffix Tree implicit states explicit states

24 Suffix Trees (cont.) The string w = t k... t p between two explicit states s and r is represented in STree ( T ) as generalized transition g' ( s, w ) = r. The string w = t k... t p between two explicit states s and r is represented in STree ( T ) as generalized transition g' ( s, w ) = r. To save space the string w = t k... t p is actually represented as a pair ( k, p ) of pointers to T. To save space the string w = t k... t p is actually represented as a pair ( k, p ) of pointers to T. A transition g' ( s, ( k, p )) = r is called an A transition g' ( s, ( k, p )) = r is called an a-transition if t k = a. Each s can have at most one a-transition for each Each s can have at most one a-transition for each a.

25 Suffix Trees (cont.) Suffix function: Suffix function: Defined only for all branching states x root as Defined only for all branching states x root as f ' ( x ) = y where y is a branching state such that x = ay for some a x = ay for some a f' ( root ) =. f' ( root ) =. If x is a branching state, the also f ' ( x ) is a branching state. These suffix links are explicitly represented. If x is a branching state, the also f ' ( x ) is a branching state. These suffix links are explicitly represented. The suffix tree of T is denoted as The suffix tree of T is denoted as STree ( T ) = ( Q' { }, root, g', f ' )

26 Size of Suffix Trees ab ababc ab abc c b c c abc babc bc c b (5,5)(2,2)(1,2) (3,5) (5,5) (3,5) (5,5) T = ababc a -transition b -transition c -transition

27 Size of Suffix Trees (cont.) The size of STree ( T ) is linear size in |T|. The size of STree ( T ) is linear size in |T|. Q' has at most |T| leaves and therefore Q' has to contain at most |T| - 1 branching states in Q'. Q' has at most |T| leaves and therefore Q' has to contain at most |T| - 1 branching states in Q'. There can be at most 2 |T| - 2 transitions between the states in Q'. There can be at most 2 |T| - 2 transitions between the states in Q'.

28 Reference to a State We refer to a state r of a suffix tree by a reference pair ( s, w ). We refer to a state r of a suffix tree by a reference pair ( s, w ). s is some explicit state that is an ancestor of r. s is some explicit state that is an ancestor of r. w is the string spelled out by the transitions form s to r in the corresponding suffix trie. w is the string spelled out by the transitions form s to r in the corresponding suffix trie. A reference pair is canonical if s is the closest ancestor of r. A reference pair is canonical if s is the closest ancestor of r. Pair ( s, ) is represented as ( s, ( p + 1, p )). Pair ( s, ) is represented as ( s, ( p + 1, p )).

29 States on the Boundary Path Let s 1 = t 1... t i- 1, s 2, s 3,..., s i = root, s i+ 1 = be the states of STrie ( T i- 1 ) on the boundary path. Let s 1 = t 1... t i- 1, s 2, s 3,..., s i = root, s i+ 1 = be the states of STrie ( T i- 1 ) on the boundary path. Let j be the smallest index such that s j is not a leaf. Let j be the smallest index such that s j is not a leaf. Let j' be the smallest index such that s j' has a t i -transition. Let j' be the smallest index such that s j' has a t i -transition. We call state s j the active point and s j' the end point of STrie ( T i- 1 ). We call state s j the active point and s j' the end point of STrie ( T i- 1 ).

30 States on the Boundary Path Lemma 1 Algorithm 1 adds to STrie ( T i-1 ) a t i -transition for each of the states s h, 1 h < j'. Lemma 1 Algorithm 1 adds to STrie ( T i-1 ) a t i -transition for each of the states s h, 1 h < j'. For 1 h < j, the new transition expands an old branch of the trie that ends at leaf s h. For 1 h < j, the new transition expands an old branch of the trie that ends at leaf s h. For j h < j', the new transition initiates a new branch from s h. For j h < j', the new transition initiates a new branch from s h. Algorithm 1 does not create any other transitions. Algorithm 1 does not create any other transitions.

31 States on the Boundary Path Algorithm 1 inserts two different groups of t i -transitions into STrie ( T i- 1 ) : Algorithm 1 inserts two different groups of t i -transitions into STrie ( T i- 1 ) : First groups First groups The states on the boundary path before the active point s j get a transition. The states on the boundary path before the active point s j get a transition. Second groups Second groups The states from the active point s j to the end point s j', the end point excluded, get a new transition. The states from the active point s j to the end point s j', the end point excluded, get a new transition.

32 States on the Boundary Path a ab b c a b b c a b a b b c active point T i- 1 = abcab STrie ( T i- 1 ) t i = d end point last layer of suffix links (boundary path) first group second group

33 States on the Boundary Path a ab abcabd b c a b d b c a b d d a d d d abd bcabd cabd d bdbdbdbd b b c first group second group STrie ( T i ) t i = d We color the new transition and new node green active point end point T i- 1 = abcab

34 Adding Transitions to STree(Ti- 1) First group can be not changed to STree ( T i- 1 ). First group can be not changed to STree ( T i- 1 ). Transitions of STree ( T i- 1 ) leading to a leaf is called an open transition. Transitions of STree ( T i- 1 ) leading to a leaf is called an open transition. Such a transition is of the form g' ( s, ( k, i- 1 )) = r. Such a transition is of the form g' ( s, ( k, i- 1 )) = r. Instead, open transitions are represented as g' ( s, ( k, )). Instead, open transitions are represented as g' ( s, ( k, )). indicates that this transition is 'open to grow'. indicates that this transition is 'open to grow'.

35 Open Transitions ab (1,2) b(2,2) active point T i- 1 = abcab STree ( T i- 1 ) t i = d end point first group second group cab ab (3, ) abcab bcab cab

36 Open Transitions ab b d first group second group STree ( T i ) t i = d We color the new transition and new node green active point end point T i- 1 = abcab (3, ) abcabd (3, ) bcabd (3, ) cabd dd ab (1,2) b(2,2)

37 Adding Transitions to STree(Ti- 1) (cont.) Create new branches for the second group. Create new branches for the second group. They are presented explicitly or implicitly. They are presented explicitly or implicitly. They will be found along the boundary path using reference pairs and suffix links. They will be found along the boundary path using reference pairs and suffix links. Let ( s, w ) be the canonical reference pair for s h, j h < j'. ( s, w ) = ( s, ( k, i - 1 )) for some k i. Let ( s, w ) be the canonical reference pair for s h, j h < j'. ( s, w ) = ( s, ( k, i - 1 )) for some k i. If ( s, ( k, i - 1 )) already refers to the end point s j', we are done. If ( s, ( k, i - 1 )) already refers to the end point s j', we are done. Otherwise a new branch has to be created. If ( s, ( k, i - 1 )) refers to an implicitly state, a new explicit state is created by splitting the transition. Then a t i -transition is created. Otherwise a new branch has to be created. If ( s, ( k, i - 1 )) refers to an implicitly state, a new explicit state is created by splitting the transition. Then a t i -transition is created.

38 On-Line Construction of Suffix Trees (cont.) Lemma 2 Let ( s, ( k, i - 1 )) be a reference pair of the end point s j' of STree ( T i- 1 ). Then ( s, ( k, i )) is a reference pair of the active point of STree ( T i ). Proof. s j is the active point of STree ( T i-1 ) if and only if s j is the longest suffix of T i-1 that occurs at least twice in T i-1. s j is the active point of STree ( T i-1 ) if and only if s j is the longest suffix of T i-1 that occurs at least twice in T i-1. s j' is the end point of STree ( T i-1 ) if and only if s j' is the longest suffix of T i-1 such that t j'... t i-1 t i is a substring of T i-1. s j' is the end point of STree ( T i-1 ) if and only if s j' is the longest suffix of T i-1 such that t j'... t i-1 t i is a substring of T i-1. If s j' is the end point of STree ( T i-1 ) then t j'... t i-1 t i is the longest suffix of T i that occurs at least twice in T i, that is, then state g ( s j', t i ) is the active point of STree ( T i ). If s j' is the end point of STree ( T i-1 ) then t j'... t i-1 t i is the longest suffix of T i that occurs at least twice in T i, that is, then state g ( s j', t i ) is the active point of STree ( T i ).

39 Constructing Suffix Tries (cont.) T = a s = root k = 1 i = 0 i = 1 (1, ) s = s = k = 2 T = ab i = 2 (2, ) k = 3 T = abc i = 3 (3, ) k = 4 active point end point T = abca i = 4 T = abcab i = 5 T = abcabd i = 6 active point end point (2,2) (1,2) (6, ) (4, ) (5, ) T = abcabd k = 5 k = 6 (3, )

40 On-Line Construction of Suffix Trees Algorithm 2 Construction of STree ( T ) for string T = t 1 t 2...# in alphabet = { t - 1,..., t -m } ; # is the end marker. Create states root and ; for j 1,..., m do create transition g' (, ( -j, -j )) = root; create suffix link f' ( root ) = ; s root; k 1 ; i 0 ; while t i+ 1 # do i i + 1 ; ( s, k ) update ( s, ( k, i )) ; ( s, k ) canonize ( s, ( k, i )).

41 On-Line Construction of Suffix Trees (cont.) procedure update ( s, ( k, i )) : ( s, ( k, i - 1 )) is the canonical reference pair for the active point; oldr root; ( endpoint, r ) test-and-split ( s, ( k, i - 1 ), t i ) ; while not ( end-point ) do create new transition g' ( r, ( i, )) = r' where r' is a new state; if oldr root then create new suffix link f' ( oldr ) = r; oldr r; ( s, k ) canonize ( f' ( s ), ( k, i - 1 )) ; ( end-point, r ) test-and-split ( s, ( k, i - 1 ), t i ) ; if oldr root then create new suffix link f' ( oldr ) = s; return ( s, k ).

42 On-Line Construction of Suffix Trees (cont.) procedure test-and-split ( s, ( k, p ), t ) : if k p then let g' ( s, ( k', p' )) = s' be the t k -transition from s; if t = t k'+p-k+ 1 then return ( true, s ) else replace the t k -transition above by transitions g' ( s, ( k', k' + p - k )) = r and g' ( r, ( k' + p - k + 1, p' )) = s' g' ( s, ( k', k' + p - k )) = r and g' ( r, ( k' + p - k + 1, p' )) = s' where r is a new state; return ( false, r ) else if there is no t-transition from s then return ( false, s ) else return ( true, s ).

43 On-Line Construction of Suffix Trees (cont.) procedure canonize ( s, ( k, p )) : if p < k then return ( s, k ) else find the t k -transition g' ( s, ( k', p' )) = s' from s; while p' – k' p – k do k k + p' – k' + 1 ; s s'; if k p then find the t k -transition g' ( s, ( k', p' )) = s' from s; return ( s, k ).

44 Time Complexity Theorem 2 Algorithm 2 constructs the suffix tree STree ( T ) for a string T = t 1... t n on-line in time O ( n ). Proof. The update is called n times. It takes time proportional to the total number of the visited states. The update is called n times. It takes time proportional to the total number of the visited states.

45 Time Complexity Analysis aababcabcaabcababcabd height = n width n + 1

46 Time Complexity Analysis activ e point end point sjsjsjsj s j' Let r i-1 be the string corresponding to the active point The string corresponding to end point is (r i ) i-1 (Lemma 2) Note: r i = (r i ) i-1 t i So that the number of the visited states in loop i = length ( r i-1 ) - ( length ( r i ) -1 ) + 1 = length ( r i-1 ) - ( length ( r i ) -1 ) + 1 Total number of the visited states = ( length ( r i-1 ) - length ( r i ) + 2 ) = length ( r 0 ) - length ( r n ) + 2n 2n

47 Conclusion Suffix tree can be constructed in linear time by employing Suffix tree can be constructed in linear time by employing suffix links suffix links open transitions for leaf nodes open transitions for leaf nodes implicit nodes implicit nodes relay on active points and end points. relay on active points and end points.

48 Suffix trees have many applications: Suffix trees have many applications: string searching string searching finding repeat substrings finding repeat substrings Many applications appear in Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, Cambridge, Many applications appear in Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, Cambridge, 1997.

49 Any Questions?

50 Thank You