1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
CS 336 March 19, 2012 Tandy Warnow.
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
1 Turing Machines and Equivalent Models Section 13.2 The Church-Turing Thesis.
WSPD Applications.
Longest Common Subsequence
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
Submitted by : Estrella Eisenberg Yair Kaufman Ohad Lipsky Riva Gonen Shalom.
1 Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
Aho-Corasick String Matching An Efficient String Matching.
On-line Construction of Suffix Tree Esko Ukkonen Algorithmica Vol. 14, No. 3, pp , 1995.
Context Free Pumping Lemma Zeph Grunschlag. Agenda Context Free Pumping Motivation Theorem Proof Proving non-Context Freeness Examples on slides Examples.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
1 Background Information for the Pumping Lemma for Context-Free Languages Definition: Let G = (V, T, P, S) be a CFL. If every production in P is of the.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
Segment Trees - motivation.  When one wants to inspect a small portion of a large and complex objects. Example: –GIS: windowing query in a map.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Approximation Algorithms based on linear programming.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
Advanced Algorithms Analysis and Design
Definition: Let G = (V, T, P, S) be a CFL
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Instructor: Aaron Roth
Sequences 5/17/ :43 AM Pattern Matching.
Switching Lemmas and Proof Complexity
Presentation transcript:

1

2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear time  Applications

3 Suffix Trees A suffix tree is a trie-like data structure representing all suffixes of a string. g o o o o goo

4 Notations  Let T = t 1 …t n be a string.  For 0  i  n, let T i = t 1 …t i denote the i-length prefix of T.  For 1  i  n + 1, let T i = t i …t n denote the suffix of T that starts at the i th position.  Let  (T) = {T i | 1  i  n + 1}.

5 Suffix Tries The suffix trie of T, denoted by STrie(T), is a trie representing  (T).

6 Suffix Tries (cont.) Definition: STrie(T) is an augmented DFA, STrie(T) = (Q  {  }, root, F, g, f) where:  Q = {x | x is a substring of T} is the set of the states of the DFA.  is an auxiliary state.  root is the initial state, corresponding to the empty string .  F =  (T) is the set of finite states.

7 Suffix Tries (cont.)  g : Q  {  }    Q (a partial function) is the transition function, defined as follows:  g(x,a) = y for all x,y  Q and a , s.t. y = xa.  g( ,a) = root for all a .  f : Q  Q  {  } is the suffix function defined as follows:  f(x) = y for all x,y  Q, x  root, s.t  a , s.t. x = ay.  f(root) = .

8 An Example – STrie(cacao) cacao aca acao ao   c cac caca cao ca o a ac o a a c c c o o o o  a a

9 The Size of Suffix Tries Theorem: The size of STrie(T), where |T| = n, is O(n 2 ). Proof: The size of STrie(T) is linear in the number of substrings of T. T has at most O(n 2 ) substrings. Thus the size of STrie(T) is O(n 2 ).

10 On-Line Construction of Suffix Tries  Let T = t 1 …t n.  1  i  n, the algorithm constructs STrie(T i ).  First we construct STrie(T 0 ) = STrie(  ).  Then,  1  i  n, we obtain STrie(T i ) from STrie(T i-1 ).

11 On-Line Construction of Suffix Tries (cont.) Observation 1:  (T i ) = {xt i | x   (T i-1 )}  {  }. Observation 2: The suffixes of T i can be found by starting at the state T i and following the suffix links, until . Thus,  (T i ) = {f j (T i ) | 0  j  i}. Definition: The path from T i to  following the suffix links is called the boundary path of STrie(T i ).

12 On-Line Construction of Suffix Tries (cont.) cacao aca acao ao   c cac caca cao ca o a ac o a a c c c o o o o  a a

13 a STrie(T i-1 )  STrie(T i ) a c a a c c   cac  caca

14 The Algorithm create STrie(  ) top   for i  1 to n do r  top while g(r,t i ) is undefined do create new state r’ and g(r,t i )  r’ if r  top then f(old-r’)  r’ old-r’  r’ r  f(r) f(old-r’)  g(r,t i ) top  g(top,t i )

15 a o The Algorithm (cont.) c a a c a c o o o o   c o c a a

16 Running Time Theorem: The running time of the algorithm is linear in the size of STrie(T), which is, in worst case, O(|T| 2 ).

17 Running Time (cont.) create STrie(  ) top   for i  1 to n do r  top while g(r, t i ) is undefined do create new state r’ and g(r, t i )  r’ if r  top then f(old-r’)  r’ old-r’  r’ r  f(r) f(old-r’)  g(r, t i ) top  g(top, t i ) O(1) for each node added to STrie(T)

18 Suffix Trees  A suffix tree STree(T) represents STrie(T) in space linear in |T|.  This is achieved by representing only a subset of Q’  {  } of Q  {  }, called the explicit states.

19 Explicit and Implicit States Definition: A state q is called explicit in the following cases:  q is a leaf  q is a branching state (has at least two transitions)  root and  are also defined to be branching states. Otherwise (if q has exactly one transitions and is not the root or  ), q is called implicit.

20 Explicit and Implicit States (cont). a o c a a c a c o o o o  

21 Generalized Transition Function  The string w spelled out by the transition path in STrie(T) between two explicit states s and r is represented in STree(T) as a generalized transition g’(s,w) = r.  A generalized transition g’(s,w) = r is called an a-transition if  a  and v  * s.t. w = av.  Note that for each explicit state s and a  there is at most one a-transition from s.

22 STrie(T)  STree(T) a o c a a c a c o o o o  

23 STrie(T)  STree(T) a o c a a c a c o o o o  

24 STrie(T)  STree(T) cao ca a cao o o o  

25 Suffix Links Definition: If x  Q’ is a branching state and x = ay, where a , then the suffix link of x is defined by f’(x) = y, and f’(  ) = . Proposition: If x  Q’ is a branching state and f’(x) = y then y is also a branching state. Proof:  a  b  s.t. xa and xb are substrings of T. y is a suffix of x. Thus ya and yb are also substrings of T.

26 STree(T) STree(T) = (Q’  {  }, root, g’, f’). cao ca a cao o o o  

27 The Size of Suffix Trees Theorem: The size of STree(T), where |T| = n, is O(n). Proof: Since we represent each substring w = t k …t p of T by a pair pointers (k,p), the size of STree(T) is linear in the number of explicit states. STree(T) has at most n leaves, and thus at most n - 1 branching states. Therefore, the size of STree(T) is O(n).

28 Reference Pairs Definition: Let r be an explicit or implicit state. (s,w) is called a reference pair for r if:  s is an explicit state and an ancestor of r.  w is the string spelled out by the transitions from s to r in the corresponding suffix trie. Definition: A reference pair (s,w) for r is called canonical if s is the closest explicit ancestor of r (or r itself, if it is explicit).

29 Active Point and Endpoint Let s 1 = T i-1, s 2, …, s i = root, s i+1 =  be the boundary path of STrie(T i-1 ). Definition: s j is called the active point of STrie(T i-1 ) if j is the smallest index for which s j is not a leaf. Definition: s j’ is called the endpoint of STrie(T i-1 ) if j’ is the smallest index for which g(s j’,t i ) is defined.

30 Active Point and Endpoint (cont.) The endpoint The active point a a c a a c c  

31 Active Point and Endpoint (cont.) Proposition: s j and s j’ are well defined and j  j’. Proof:  root is not a leaf  s j is defined.  g( ,t i ) is defined  s j’ is defined.  g(s j’,t i ) is defined  s j’ is not a leaf  j  j’.

32 Adding t i -Transitions to STrie(T i-1 ) Lemma: When obtaining STrie(T i ) from STrie(T i-1 ) the algorithm adds a t i -transition to each state s h s.t. 1  h < j’, and only to these states, as follows:  For 1  h < j, the new transition expands an old branch of the trie that ends at s h.  For j  h < j’, the new transition initiates a new branch from s h.

33 Adding t i -Transitions to STrie(T i-1 ) (cont.) The endpoint The active point a a c a a c c   o o o o o

34 On-Line Construction of Suffix Trees  We create STree(  ), and then  1  i  n we obtain STree(T i ) from STree(T i-1 ).  When obtaining STree(T i ) from STree(T i-1 ), we update STree(T i-1 ) according to the transitions we would add to STrie(T i-1 ).  Note that s 1,…,s i-1 are not necessarily explicit states.

35 On-Line Construction of Suffix Trees (cont.) For 1  h < j:  s h is a leaf. Thus,  s, 0  k  i-1 s.t. g’(s,(k,i-1)) = s h. We replace this transition by g’(s,(k,i)) = s h.  This would take too much time. Thus, we denote transitions of the type g’(s,(k,i-1)) in STree(T i-1 ) by g’(s,(k,  )). Hence, no updates are needed.

36 On-Line Construction of Suffix Trees (cont.) For j  h < j’:  If s h is an implicit state, we turn it into an explicit state by splitting the transition containing it.  We create a new leaf s h t i and add a new transition g’(s h,(i,  )).

37 EP a ac aca acao a c ca cac caca cacao ca On-Line Construction of Suffix Trees (cont.) cao o o   o c o c a a a a a a c c o o o o o c   AP EP

38 Lemma 1 Lemma 1: Let (s,(k,p)) be some reference pair for a state r. Then  s’, k’ s.t. (s’,(k’,p)) is the canonical reference pair for r. Proof: Let s’ be the closest explicit ancestor of r, or r itself if r is explicit. t k …t p is the path from the explicit state s to r. Thus, the path from s’ to r is a suffix t k’ …t p of t k …t p.

39 Lemma 2 Lemma 2: Let r be a state on the boundary path of STrie(T i ). Then  s, k s.t. (s,(k,i)) is the canonical reference pair for r. Proof: r is on the boundary path of STrie(T i ).  r refers to some suffix t k’ …t i of T i.  ( ,(k’,i)) is a reference pair for r.  the claim holds by lemma 1.

40 Lemma 3 Lemma 3: Let (s,(k,i-1)) be a reference pair for the endpoint of STrie(T i-1 ). Then (s,(k,i)) is a reference pair for the active point of STrie(T i ). Proof:  s j is the active point of STrie(T i-1 ) iff t j …t i-1 is the longest suffix of T i-1 that occurs at least twice in T i-1.

41 Lemma 3 (cont.) Proof (cont.):  s j’ is the endpoint of STrie(T i-1 ) iff t j’ …t i-1 is the longest suffix of T i-1 such that t j’ …t i-1 t i is a substring of T i-1.  Thus, if s j’ is the endpoint of STrie(T i-1 ), then t j’ …t i-1 t i is the longest suffix of T i that occurs at least twice in T i. Therefore, s j’ t i is the active point of STrie(T i ).

42 The Algorithm create STree(  ) s  root k  1 for i  1 to n do (s,k)  update(s,(k,i)) (s,k)  canonize(s,(k,i)) Transforms STree(T i-1 ) into STree(T i ). Input: (s,(k,i)) s.t. (s,(k,i-1) is the active point of STrie(T i-1 ). Output: (s’,k’) s.t. (s’,(k’,i-1) is the endpoint of STrie(T i-1 ). Input: a reference pair (s,(k,p)) for some state r. Output: (s’,k’) s.t. (s’,(k’,p)) is the canonical reference pair for r.

43 update(s,(k,i)) old-r  root (endpoint,r)  test-and-split(s,(k,i-1),t i ) while not endpoint do create new state r’; g’(r,(i,  ))  r’ if old-r  root then f’(old-r)  r old-r  r (s,k)  canonize(f’(s),(k,i-1)) (endpoint,r)  test-and-split(s,(k,i-1),t i ) if old-r  root then f’(old-r)  s return (s,k) Input: the canonical reference pair for some state r, and t i. Output: true/false if r is the endpoint or not, and the explicit state r (creating it if needed).

44 (2,  ) (2,2) (1,  ) (1,2) update   (3,  ) (5,  ) c o c a a s = root k = 1 i = 1 s =  s = root k = 2 i = 2 s =  s = root k = 3 i = 3 i = 4 i = 5 k = 4k = 5 s = 

45 test-and-split(s,(k,p),t) if k  p then find the t k -transition g’(s,(k’,p’)) = s’ from s if t = t k’+p-k+1 then return (true,s) else create a new state r replace g’(s,(k’,p’)) = s’ by g’(s,(k’,k’+p-k)) = r and g’(r,(k’+p-k+1,p’)) = s’ return (false,r) else if   t-transition from s then return (false,s) else return (true,s)

46 canonize(s,(k,p)) if p < k then return (s,k) else find the t k -transition g’(s,(k’,p’)) = s’ from s while p’ – k’  p – k do k  k + p’ – k’ + 1 s  s’ if k  p then find the t k -transition g’(s,(k’,p’)) = s’ from s return (s,k)

47 Running Time Theorem: The running time of the algorithm is O(n). Proof: We divide the running time into two components: 1.The total time of the procedure canonize. 2.The rest.

48 update old-r  root (endpoint,r)  test-and-split(s,(k,i-1),t i ) while not endpoint do create new state r’; g’(r,(i,  ))  r’ if old-r  root then f’(old-r)  r old-r  r (s,k)  canonize(f’(s),(k,i-1)) (endpoint,r)  test-and-split(s,(k,i-1),t i ) if old-r  root then f’(old-r)  s return (s,k) In each execution of the loop, a new state is created. O(1) Called n times

49 canonize if p < k then return (s,k) else find the t k -transition g’(s,(k’,p’)) = s’ from s while p’ – k’  p – k do k  k + p’ – k’ + 1 s  s’ if k  p then find the t k -transition g’(s,(k’,p’)) = s’ from s return (s,k) In each execution of the loop, the value of k increases. Called O(n) times

50 Applications - Exact String Matching Input: two strings: a text T and a pattern P. Output: all the occurrences of P in T. This problem can be solved in O(|T|+|P|) time (Boyer-Moore, Knuth-Morris-Pratt).

51 Applications - Exact String Matching (cont.)  We look at the case where we have a text T first, and then a sequence of patterns P 1,…,P r.  This problem can be solved using suffix trees.  Preprocessing time: O(|T|).  Finding a pattern P: O(|P|+k), where k is the number of occurrences of P in T.

52 Applications - Exact String Matching (cont.) abbababb # ababb# b b ab abb# ab b abb# b# # ababb# # #

53 Applications in Biology

54 Finding Repeats in DNA  The DNA contains many repetitive sequences with different biological functions.  We want to find all maximal repeats in a DNA sequence. ACCAGTTCGCGCATGAACGTTCGACCGGTTCGAT

55 Finding Repeats in DNA (cont.) Theorem: All maximal repeats in a sequence T can be found in O(|T|) time using suffix trees.

56 Finding Repeats in DNA (cont.) Lemma: If w is a maximal repeat in T, then the state w in STree(T) is explicit. Proof: If w is a maximal repeat then there are at least two occurrences of w in T s.t. the character following w is different. Thus w is a branching state, and therefore it is explicit.

57 Finding Repeats in DNA (cont.) Corollary: There are at most O(|T|) maximal repeats in T. Proof: By the above lemma, each maximal repeat corresponds to an explicit state. Since STree(T) has O(|T|) explicit states, T has O(|T|) maximal repeats.

58 Finding Repeats in DNA (cont.) Definition: The left character of a leaf t i …t n of STree(T) is t i-1. Definition: A node w of STree(T) is called left diverse if there are at least two leaves in w’s subtree with different left characters. Note that, by definition, a left diverse node is not a leaf.

59 Finding Repeats in DNA (cont.) Lemma: A substring w of T is a maximal repeat iff w is a left diverse explicit state in STree(T).

60 Finding Repeats in DNA (cont.) Proof: 1.Suppose w is a maximal repeat. i.By the previous lemma w is explicit. ii.  a  b  s.t aw and bw are substrings of T. Let awu and bwv be the corresponding suffixes.  wu and wv are two leaves in the subtree of w with different left characters.

61 Finding Repeats in DNA (cont.) 2.Suppose that w is explicit and left diverse. bwbw awaw bwdbwd awcawc (i) bwcbwc awcawc (ii) wdwd

62 Finding Repeats in DNA (cont.) TAGC# GCAT AGC# A A C # GC TAGC# ATAGC# # GC ATAGC# # TAGC# # CAGCATAGC - G LD G C T C A A A A C The maximal repeats: , C, CA, A, AGC

63 Bibliography  On-Line Construction of Suffix Trees E. Ukkonen  Algorithms on String, Trees, and Sequences Dan Gusfield