Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Chapter 4: Trees Part II - AVL Tree
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
Azita Keshmiri CS 157B Ch 12 indexing and hashing
13 Text Processing Hongfei Yan June 1, 2016.
Enumerating Distances Using Spanners of Bounded Degree
Indexing and Hashing Basic Concepts Ordered Indices
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Presentation transcript:

Suffix Trees

2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

3 Introduction

4 Substrings String is any sequence of characters. Substring of string S is a string composed of characters i through j, i  j of S.  S = cater;ate is a substring.  car is not a substring.  Empty string is a substring of S.

5 Subsequences Subsequence of string S is a string composed of characters i 1 < i 2 < … < i k of S.  S = cater; ate is a subsequence.  car is a subsequence.  Empty string is a subsequence of S.

6 String/Pattern Matching - I You are given a source string S. Suppose we have to answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching.  O(|S| + | p i |) time per query.  O(n|S| +  i | p i |) time for n queries. Suffix tree solution.  O(|S| +  i | p i |) time for n queries.

7 String/Pattern Matching - II KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string (text) S. The suffix tree for the text is built in O(m) time during a pre-processing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches it in O(n) time using that suffix tree.

String Matching: Prefixes & Suffixes Substrings of S beginning at the first position of S are called prefixes of S, and substrings that end at its last position are called suffixes of S. S=AACTAG  Prefixes: AACTAG,AACTA,AACT,AAC,AA,A  Suffixes: AACTAG,ACTAG,CTAG,TAG,AG,G p i is a substring of S iff p i is a prefix of some suffix of S.

9 Suffix Trees

10 Definition: Suffix Tree (ST) T for S of length m 1. A rooted tree with m leaves numbered from 1 to m. 2. Each internal node, excluding the root, of T has at least 2 children. 3. Each edge of T is labeled with a nonempty substring of S. 4. No two edges out of a node can have edge-labels starting with the same character. 5. For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S, namely S[ i, m ], that starts at position i.

11 Example: Suffix Tree for S=xabxac

12 Existence of a suffix tree S If one suffix S j of S matches a prefix of another suffix S i of S, then the path for S j would not end at a leaf. S = xabxa S 1 = xabxa and S 4 = xa How to avoid this problem?  Assume that the last character of S appears nowhere else in S.  Add a new character $ not in the alphabet to the end of S.

13 Example: Suffix Tree for S=xabxac$

14 Building STs in linear time: Ukkonen’s algorithm

15 Building STs in linear time Weiner’s algorithm [FOCS, 1973]  ”The algorithm of 1973” called by Knuth  First algorithm of linear time, but much space McGreight’s algorithm [JACM, 1976]  Linear time and quadratic space  More readable Ukkonen’s algorithm [Algorithmica, 1995]  Linear time algorithm and less space  This is what we will focus on

16 Implicit Suffix Trees Ukkonen’s algorithm constructs a sequence of implicit STs, the last of which is converted to a true ST of the given string. An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by  removing $ from all edge labels  removing any edges that now have no label  removing any node that does not still have at least two children An implicit suffix tree for prefix S[1, i ] of S is similarly defined based on the suffix tree for S[1, i ]$. I i will denote the implicit suffix tree for S[1, i ]. Each suffix is in the tree, but may not end at a leaf.

17 Example: Construction of the Implicit ST Implicit tree for xabxa from tree for xabxa$ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$

18 Construction of the Implicit ST: Remove $ Remove $ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$

19 Construction of the Implicit ST: After the Removal of $ Remove $ {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a

20 Construction of the Implicit ST: Remove unlabeled edges Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a

21 Construction of the Implicit ST: After the Removal of Unlabeled Edges Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

22 Construction of the Implicit ST: Remove interior nodes Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

23 Construction of the Implicit ST: Final implicit tree Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x xx a a a a a 2 3

24 Ukkonen’s Algorithm (UA) I i is the implicit suffix tree of the string S[1, i ] Construct I 1 /* Construct I i+1 from I i */ for i = 1 to m-1 do /* phase i+1 */  for j = 1 to i+1 do /* extension j */ Find the end of the path P from the root whose label is S[ j, i ] in I i and extend P with S[ i +1] by suffix extension rules; Convert I m into a suffix tree S

25 Example S = xabxacd$ i+1 = 1 xx i+1 = 2  extend x to xa aa i+1 = 3  extend xa to xab  extend a to ab bb …

26 Extension Rules Goal: extend each S[ j,i ] into S[ j,i+1 ] Rule 1: S[ j,i ] ends at a leaf  Add character S( i+1 ) to the end of the label on that leaf edge Rule 2: S[ j,i ] doesn’t end at a leaf, and the following character is not S( i+1 )  Split a new leaf edge for character S( i+1 )  May need to create an internal node if S[ j,i ] ends in the middle of an edge Rule 3: S[ j,i+1 ] is already in the tree  No update

27 Example: Extension Rules Implicit tree for axabxb from tree for axabx 1 b b b x x x x a a a 2 3x b x 4 b b b b Rule 1: at a leaf node b Rule 3: already in treeRule 2: add a leaf edge (and an interior node) b 5

28 UA for axabxc (1) S[1,3]=axa ES(j,i)S(i+1) 1axa 2 xa 3 a

29 UA for axabxc (2)

30 UA for axabxc (3)

31 UA for axabxc (4)

32 Observations Once S[ j,i ] is located in the tree, extension rules take only constant time Naively we could find the end of any suffix S[ j,i ] in O(S[j,i]) time by walking from the root of the current tree. By that approach, I m could be created in O(m 3 ) time. Making Ukkonen’s algorithm O(m)  Suffix links  Skip and count trick  Edge-label compression  A stopper  Once a leaf, always a leaf

33 Suffix Links Consider the two strings a and xa Suppose some internal node v of the tree is labeled with xa and another node s(v) in the tree is labeled with a The edge (v,s(v)) is called a suffix link Do all internal nodes (the root is not considered an internal node) have suffix links?

34 Example: suffix links S = ACACACAC$ AC AC$ AC $ $ $ $ $ $ $ C AC$

35 Suffix Link Lemma If a new internal node v with path-label xa is added to the current tree in extension j of some phase i+1, then  the path labeled a already ends at an internal node of the tree or  the internal node labeled a will be created in the extension of j+1 in the same phase i+1  string a is empty and s(v) is the root

36 Proof of Suffix Link Lemma A new internal node is created only by the extension rule 2 This means that there are two distinct suffixes of S[ 1,i+1 ] that start with xa  xa S( i+1 ) and xacb where c is not S( i+1 ) This means that there are two distinct suffixes of S[ 1,i+1 ] that start with a  a S( i+1 ) and acb where c is not S( i+1 ) Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a

37 Corollary of Suffix Link Lemma Every internal node of an implicit suffix tree has a suffix link from it.

38 How to use suffix links - 1 S[1,i] must end at a leaf since it is the longest string in the implicit tree I i  Keep a pointer to this leaf in all cases and extend according to rule 1 Locating S[j+1,i] from S[j,i] which is at node w  If w is an internal node, set v to w  Otherwise, set v = parent(w)  If v is the root, you must traverse from the root to find S[j+1,i]  If not, go to s(v) and begin search for the remaining portion of S[j,i] from there

39 How to use suffix links - 2

40 Skip and Count Trick – (1) Problem: Moving down from s(v), directly implemented, takes time proportional to the number of characters compared Solution: To make running time proportional to the number of nodes in the path searched, instead of the number of characters

41 Skip and Count Trick – (2) After 4 nodes down-skips, the end of S[j, i] is found.

42 Skip and Count Trick – (3) Node-depth of v, denoted (ND(v)), is the number of nodes on the path from the root to the node v Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, ND(v)  ND(s(v))+1

43 Skip and Count Trick – (4) At the moment of traversing (v,s(v)): ND(v)  ND(s(v))+1

44 Skip and Count Trick – (5) The current node-depth of the algorithm is the node depth of the node most recently visited by the algorithm Lemma: Using the skip and count trick, any phase of Ukkonen’s algorithm takes O(m) time.  Up-walk: decreases the current node-depth by  1  Suffix link traversal: same as up-walk  Totally, the current node-depth is decreased by  2m.  No node has depth >m.  The total possible increment to the current node-depth is  3m.

45 Edge Label Representation Potential Problem  Size of edge labels may require  (m 2 ) space  Thus, the time for the algorithm is at least as large as the size of its output  Example S = abcdefghijklmnopqrstuvwxyz  Total length is  j<m+1 j = m(m+1)/2  Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size Solution  Label edges with pair of indices indicating beginning and end of positions of the substring in S

46 Modified Extension Rules  Rule 2: new leaf edge (phase i+1 ) create edge ( i+1, i+1 ) split edge (p, q) => (p, w) and (w + 1, q)  Rule 1: leaf edge extension label had to be ( p,i ) before extension given rule 2 above and an induction argument:  (p, q) => (p, q + 1)  Rule 3 Do nothing

47 Full edge label representation String S = xabxa$ 1 b b b x x x x a a a a a $ $ $ $ $$

48 Edge-label Compression String S = xabxa$ 1 (1,2) [or (4,5)?] (3,6) (2,2) (6,6) (3,6) (6,6) (3,6)

49 A Stopper In any phase, if suffix extension rule 3 applies in extension j, it will also apply in all extensions k, where k>j, until the end of the phase. The extensions in phase i+1 that are done after the first execution of rule 3 are said to be done implicitly. This is in contrast to any extension j where the end of S[j, i] is explicitly found. An extension of that kind is called and explicit extension. Hence, we can end any phase i+1 when the first extension rule 3 applies.

50 Once a leaf, always a leaf – (1) If at some point in UA a leaf is created and labeled j (for the suffix starting at position j of S), then that leaf will remain a leaf in all successive trees created during the algorithm. In any phase i, there is an initial sequence of consecutive extensions (starting with extension 1) in which only rule 1 or 2 applies, where let j i be the last extension in this sequence. Note that j i  j i+1.

51 Once a leaf, always a leaf – (2) Implicit extensions: for extensions 1 to j i, write [·, e] on the leaf edge, where e is a symbol denoting the current end and is set to i + 1 once at the beginning. In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location. Explicit extensions: from extension j i+1 till first rule 3 extension is found (or until extension i+1 is done)

52 Once a leaf, always a leaf – (3): Single phase algorithm j* is the last explicit extension computed in phase i+1 Phase i+1  Increment e to i+1 (implicitly extending all existing leaves)  Explicitly compute successive extensions starting at j i+1 and continuing until reaching the first extension j* where rule 3 applies or no more extensions needed  Set j i+1 to j* -1, to prepare to the next phase Observation  Phase i and i+1 share at most 1 explicit extension

53 Example: S=axaxbb$ - (1)  e = 1, a  J 1 = (1,e) I1I1  e = 2, ax  S[1,2] : skip  S[2,2] : rule 2, create(2, e)  j 2 = (1,e) I2I2 2 (2,e)  e = 3, axa  S[1,3].. S[2,3] : skip  S[3,3] : rule 3  j 3 = (1,e) I3I3 2 (2,e)

54 Example: S=axaxbb$ - (2)  e = 4, axax  S[1,4].. S[2,4] : skip  S[3,4] : rule 3  S[4,4] : auto skip  j 4 = (1,e) I4I4 2 (2,e)  e = 5, axaxb  S[1,5].. S[2,5] : skip  S[3,5] : rule 2, split (1,e) → (1, 2) and (3,e), create (5,e)  S[4,5] : rule 2, split (2,e) → (2,2) and (3,e), create (5,e)  S[5,5] : rule 2, create (5,e)  j 5 = 5 1 (1,2) I5I5 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) 5

55 Example: S=axaxbb$ - (3)  e = 6, axaxbb  S[1,6].. S[5,6] : skip  S[6,6] : rule 3  j 6 = 5  e = 7, axaxbb$  S[1,7].. S[5,7] : skip  S[6,7] : rule 2, split (5,e) → (5,5) and (6,e), create (6,e)  S[7,7] : rule 2, create (7,e)  j 7 = 7 1 (1,2) I6I6 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) 5 1 (1,2) I7I7 2 (2,2) 13 (3,e) (5,e) 4 (3,e) (5,e) (5,5) 5 6 (6,e) (7,e) 7

56 Complexity of UA Since all the implicit extensions in any phase is constant, their total cost is O(m). Totally, only 2m explicit extensions are executed. The max number of down-walking skips is O(m). Time-complexity of Ukkonen’s algorithm: O(m)

57 Finishing up Convert final implicit suffix tree to a true suffix tree Add $ using just another phase of execution  Now all suffixes will be leaves Replace e in every leaf edge with m  Just requires a traversal of tree which is O(m) time

58 Implementation Issues – (1) When the size of the alphabet grows:  For large trees suffix links allow an algorithm to move quickly from one part of the tree to a distant part of the tree. This is great for worst-case time bounds, but its horrible if the tree isn't entirely in memory.  Thus, implementing ST to reduce practical space use can be a serious concern. The main design issues are how to represent and search the branches out of the nodes of the tree. A practical design must balance between constraints of space and need for speed

59 Implementation Issues – (2) There are four basic choices to represent branches  An array of size  (|  |) at each non-leaf node v  A linked list at node v of characters that appear at the beginning of the edge-labels out of v. If its kept in sorted order it reduces the average time to search for a given character In the worst case it, adds time |  | to every node operation. If the number of children of v is large, then little space is saved over the array while noticeably degrading performance  A balanced tree implements the list at node v Additions and searches take O(logk) time and O(k) space, where k is the number of children of v. This alternative makes sense only when k is fairly large.  A hashing scheme. The challenge is to find a scheme balancing space with speed. For large trees and alphabets hashing is very attractive at least for some of the nodes

60 Implementation Issues – (3) When m and  are large enough, the best design is probably a mixture of the above choices.  Nodes near the root of the tree tend to have the most children, so arrays are sensible choice at those nodes.  For nodes in the middle of a suffix tree, hashing or balanced trees may be the best choice. Sometimes the alphabet size is explicitly presented in the time and space bounds  Construction time is O( m log|  |),using  ( m |  |) space. m – is the size of the string.

61 Applications of Suffix Trees

62 Applications of Suffix Trees Exact string matching Substring problem for a database Longest common substring Suffix arrays

63 Exact String Matching – (1) Exact matching problem: given a pattern P of length n and a text T of length m, find all occurrences of P in T in O(n+m) time. Overview of the ST approach:  Built a ST for text T in O(m) time  Match the characters of P along the unique path in ST until either P is exhausted or no more matches are possible  In the latter case, P doesn’t appear anywhere in T  In the former case, every leaf in the subtree below the point of the last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf  ST approach spends O(m) preprocessing time and then O(n+k) search time, where k is the number of occurrences of P in T

64 Exact String Matching – (2) When search terminates at an element node, P appears exactly once in the source string T. When the search for P terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of P. 6 x a b x a c c a b x a c c c b x a c Notes: P = xa; T = xabxac Pattern P matches a path down to the point shown by an arrow, and as required, the leaves below that point are numbered 1 and 4.

65 Substring Problem for a Database Input: a database D (a set of strings) and a string S Output: find all the strings in D containing S as a substring Usage: identity of the person Exact string matching methods: cannot work Suffix tree:  D is stored in O(m) space, where m is the total length of all the strings in D.  Suffix tree is built in O(m) time  Each lookup of S (the length of S is n) costs O(n) time

66 Longest common substring –(1) Input: two strings S 1 and S 2 Output: find the longest substring S common to S 1 and S 2 Example:  S 1 =common-substring  S 2 =common-subsequence  Then, S=common-subs

67 Longest common substring – (2) Build a suffix tree for S 1 and S 2 Each leaf represents either a suffix from one of S 1 and S 2, or a suffix from both S 1 and S 2 Mark each internal node v with a 1 (2) if there is a leaf in the subtree of v representing a suffix from S 1 (S 2 ) The path-label of any internal node marked both 1 and 2 is a substring common to both S 1 and S 2, and the longest such string is the longest common substring.

68 Longest common substring – (3) S 1 $ = xabxac$, S 2 $ = abx$, S = abx

69 Suffix Arrays – (1) A suffix array of an m-character string S, is an array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of S. Example: S = xabxac

70 Suffix Arrays – (2) A suffix array of S can be obtained from the suffix tree T of S by performing a lexical depth-first search of T

71 Suffix Arrays: Exact string matching Given two strings T and P, where |T | = m and |P| = n, find all the occurrences of P in T ? Using the binary search on the suffix array of T, all the occurrences of P in T can be found in O(n logm) time. Example: let T = xabxac and P = ac

72 References Dan Gusfield: Algorithms on Strings, Trees, and Sequences. University of California, Davis.Cambridge University Press,1997 Ela Hunt et al.: A database index to large biological sequences. Slides at VLDB, 2001 R.C.T. Lee and Chin Lung Lu.: Suffix Trees. Slides at CS 5313 Algorithms for Molecular Biology