Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Chapter 5: Tree Constructions
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Chapter 4: Trees Part II - AVL Tree
AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
Greedy Algorithms Amihood Amir Bar-Ilan University.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 20 Prof. Erik Demaine.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph, any set of nodes that are not adjacent.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Chapter 10 Search Structures Instructors: C. Y. Tang and J. S. Roger Jang All the material are integrated from the textbook "Fundamentals of Data Structures.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Outline Binary Trees Binary Search Tree Treaps. Binary Trees The empty set (null) is a binary tree A single node is a binary tree A node has a left child.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Chapter 12 B+ Trees CS 157B Spring 2003 By: Miriam Sy.
1 10. Binary Trees Read Sec A. Introduction: Searching a linked list. 1. Linear Search /* Linear search a list for a particular item */ 1. Set.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
PC-Trees & PQ-Trees. 2 Table of contents Review of PQ-trees –Template operations Introducing PC-trees The PC-tree algorithm –Terminal nodes –Splitting.
AVL Trees and Heaps. AVL Trees So far balancing the tree was done globally Basically every node was involved in the balance operation Tree balancing can.
1 Binary Search Trees  Average case and worst case Big O for –insertion –deletion –access  Balance is important. Unbalanced trees give worse than log.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Nov String algorithms, Q Ukkonen’s suffix tree algorithm ● Recall McCreight’s approach: – For i = 1.. n+1, build compressed trie of {x[j..n]$
15-853:Algorithms in the Real World
McCreight's suffix tree construction algorithm
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ukkonen's suffix tree construction algorithm
Ch. 8 Priority Queues And Heaps
Lecture 2- Query Processing (continued)
Suffix Trees String … any sequence of characters.
Chap 3 String Matching 3 -.
Presentation transcript:

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen developed a simpler to understand variant in 1995 –This is what we will focus on

Implicit Suffix Trees An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by –removing $ from all edge labels –removing any edges that now have no label –removing any node that does not still have at least two children Some suffixes may no longer be leaves An implicit suffix tree for prefix S[1..i] of S is similarly defined based on the suffix tree for S[1..i]$ I i will denote the implicit suffix tree for S[1..i]

Example Implicit tree for xabxa from tree for xabxa$ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$

Remove $ {xabxa$, abxa$, bxa$, xa$, a$, $} 1 b b b x x x x a a a a a $ $ $ $ $$

Remove $ after Remove $ {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a

Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a

Remove unlabeled edges {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

Remove interior nodes Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x x x a a a a a 2 3

Final implicit tree Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a} 1 b b b x x xx a a a a a 2 3

Basic Structure of Algorithm Initialization –I 1 has one edge labeled S(1) For i = 1 to m-1 (build I i+1 ) –For j = 1 to i+1 Find location of string S[j..i] in tree “Extend” to incorporate character S(i+1) Expand final implicit tree to make full suffix tree

Order of operations visualization S = xabxacdefghixabcab$ i+1 = 1 –x–x i+1 = 2 –extend x to xa –a–a i+1 = 3 –extend xa to xab –extend a to ab –b–b …

Extension Rules Case 1: S[j..i] ends at a leaf –Add character S(i+1) to end of label on leaf edge Case 2: Not a leaf, but no path from end of S[j..i] location continues with S(i+1) –Split a new leaf edge for character S(i+1) –May need to create an internal node if S[j..i] ends in the middle of an edge Case 3: S[j..i+1] is already in the tree –No update

Visualization Implicit tree for axabxb from tree for axabx 1 b b b x x x x a a a 2 3x b x 4 b b b b Rule 1: at a leaf node b Rule 3: already in treeRule 2: add a leaf edge (and an interior node) b 5

Observations Once S[j..i] is located in the tree, extending to accommodate S(i+1) is constant time Making Ukkonen’s algorithm O(m 2 ) –Finding the S[j..i] locations in the suffix trees quickly when explicit computation is needed

Edge Label Representation Potential Problem –Size of edge labels may be  (m 2 ) –Example S = abcdefghijklmnopqrstuvwxyz –Total length is  j<m+1 j = m(m+1)/2 –Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size Solution –Label edges with pair of indices indicating beginning and end of positions of the substring in S

Full edge label illustration String S = xabxa$ 1 b b b x x x x a a a a a $ $ $ $ $$

Compact edge label illustration String S = xabxa$ 1 (1,2) [or (4,5)?] (3,6) (2,2) (6,6) (3,6) (6,6) (3,6)

Modified Extension Rules Rule 2: new leaf edge –label new leaf edge (i+1, i+1) Rule 1: leaf edge extension –label had to be (p,i) before extension given rule 2 above and an induction argument –now will be (p, i+1) Rule 3: still nothing needs to be done

Suffix Links Consider the two strings  and x  Suppose some internal node v of the tree is labeled with x  and another node s(v) in the tree is labeled with  Then the edge (v,s(v)) is a suffix link Do all internal nodes (the root is not considered an internal node) have suffix links?

Suffix Link Lemma If a new internal node v with path-label x  is added to the current tree in extension j of some phase i+1, then –the path labeled  already exists at an internal node of the tree or –the internal node labeled  will be created in the extension of j+1 or –string  is empty and s(v) is the root

Proof of Suffix Link Lemma A new internal node is created only by extension rule 2 This means there are two distinct suffixes of S[1..i+1] that start with x  –x  S(i+1) and x  c  where c is not S(i+1) This means there are two distinct suffixes of S[1..i+1] that start with  –  S(i+1) and  c  where c is not S(i+1) Thus, if  is not empty,  will label an internal node once extension j+1 is processed which is the extension of 

Using suffix links to speed up location of S[j..i] S[1..i] must end at a leaf since it is the longest string in implicit tree I i –Keep a pointer to this leaf in all cases and extend according to rule 1 Locating S[j+1..i] from S[j..i] which is at node w –If w is an internal node, set v to w –Otherwise, set v = parent(w) –If v is the root, you must traverse from root to find S[j+1..i] –If not, go to s(v) and begin search for remaining portion of S[j..i] from there Remaining portion is the label of the edge we traversed up (see figure 6.5 on page 100)

Skip/count Trick Problem: Moving down from s(v) naively takes time proportional to the number of characters compared Solution –At each node, only compare the first character in an edge label to the next character to be checked –Then, use the number of characters on that edge to update search in constant time –Running time is now proportional to the number of nodes in the path searched rather than the number of characters See Figure 6.6 on page 102

O(m 2 ) argument node-depth of v: number of nodes on path from root to node v Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, nd(v) <= nd(s(v))+1 –If x  is an ancestral internal node of v where  is not empty, then it has a suffix link to a node with path-label  See Figure 6.7 on page 103

O(m 2 ) argument Lemma: Any phase takes O(m) time with skip/count trick Proof –Decrements to node depth at most 2m i+1 <= m extensions per phase Walking up decreases node depth at most 1 Suffix link traversal decreases node depth at most 1 –At most 3m downward edge traversal Max node depth is m None are negative Each downward traversal increases depth by at least 1

Observation Making Ukkonen’s algorithm O(m) –Implicit computations of many extensions –Need to take argument for a single phase and extend to multiple phases

Rule 3 Suppose suffix extension rule 3 applies to S[j..i+1]. –This means S[j..i+1] already appears in the implicit suffix tree as a prefix of a larger suffix (and is thus a substring of S[1..i]) Then it applies to S[k..i+1] for k > j. –Clearly, S[k..i+1] must also be a substring of S[1..i] and thus must be in the tree Thus, stop a phase once the first application of rule 3 occurs –All future rule 3 extensions are done implicitly, not explicitly

Implicit expansion of leaf nodes Once a leaf, always a leaf –It will always be extended using rule 1 Implicit expansion of leaf nodes –When a leaf edge is created in phase i+1, instead of labeling it with (p, i+1), label it with (p,e) –e is a global index that is set to i+1 once in each phase –In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location –How can we easily identify leaf nodes to avoid explicitly expanding them in later phases?

Avoiding leaf nodes For phase i, let last(i) denote the last extension of phase i that is not by rule 3 Observation: –All the suffixes S[j..i] for 1 <= j <= last(i) end at a leaf node –All the extensions for 1 <= j <= last(i) in phase i+1 and higher are rule 1 extensions by previous slide –Therefore, last(i+1) >= last(i) Trick –In phase i+1, only explicitly compute extensions for last(i)+1 up till first rule 3 extension is found

Single phase algorithm Phase i+1 –Increment e to i+1 (implicitly extending all existing leaves) –Explicitly compute successive extensions starting at last(i)+1 and continuing until a rule 3 extension or no more extensions needed Exact location is known given next step in previous phase –Set last(i+1) appropriately (last rule 1 or 2 extension) Observation –Phase i and i+1 share at most 1 explicit extension –In all phases, there will be at most 2m extensions

Visualization of explicit extensions S = xabxacdefghixabcab$ Phase 1: extend x (rule 1) Phase 2: extend a (rule 2) Phase 3: extend b (rule 2) Phase 4: extend x (rule 3) Phase 5: extend x (rule 3) Phase 6: extend x (rule 2), extend a (rule 2) extend c (rule 2) Phase 7: extend d (rule 2) …

O(m) argument Lemma: All phases take O(m) time Proof –Similar to previous node-depth argument –Key new observation From one phase to the next, we either start at root or we work with same node we ended with last time so the node depth is identical With 2m total extensions at most, we get O(m) upward and downward traversals of links

Finishing up Convert final implicit suffix tree to a true suffix tree Add $ using just another phase of execution –Now all suffixes will be leaves Replace e in every leaf edge with m –Just requires a traversal of tree which is O(m) time