296.3: Algorithms in the Real World

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.
15-853Page : Algorithms in the Real World Suffix Trees.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Suffix trees and suffix arrays presentation by Haim Kaplan.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
On-line Construction of Suffix Tree Esko Ukkonen Algorithmica Vol. 14, No. 3, pp , 1995.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Binary Trees Chapter 6.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Data Structures - CSCI 102 Binary Tree In binary trees, each Node can point to two other Nodes and looks something like this: template class BTNode { public:
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
CSIT 402 Data Structures II
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
MCS 101: Algorithms Instructor Neelima Gupta
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
1 Binary Search Trees  Average case and worst case Big O for –insertion –deletion –access  Balance is important. Unbalanced trees give worse than log.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Nov String algorithms, Q Ukkonen’s suffix tree algorithm ● Recall McCreight’s approach: – For i = 1.. n+1, build compressed trie of {x[j..n]$
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Andrzej Ehrenfeucht, University of Colorado, Boulder
B+ Tree.
Ukkonen's suffix tree construction algorithm
String Processing.
Strings: Tries, Suffix Trees
Suffix trees.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
String Processing.
Presentation transcript:

296.3: Algorithms in the Real World Suffix Trees 296.3

Exact String Matching Given a text S of length m and pattern P of length n “Quickly” find an occurrence (or all occurrences) of P in S A Naïve solution: Compare P with S[i…i+n-1] for all i --- O(nm) time How about O(n+m) time? (Knuth Morris Pratt) How about O(m) preprocessing time and O(n) search time? 296.3

Suffix Trees Preprocess the text in O(m) time and search in O(n) time Idea: Construct a tree containing all suffixes of text along the paths from the root to the leaves For search, just follow the appropriate path 296.3

Suffix Trees A suffix tree for the string x a b x a 1 a b x b a x a 2 3 Notice no leaves for suffixes xa or a 296.3

Suffix Trees A suffix tree for the string x a b x a c 4 x a b x a c 1 Search for the string a b x a b x b a x c c a c c 5 2 3 6 296.3

Constructing Suffix trees Naive O(m2) algorithm For every i, add the suffix S[i .. m] to the current tree x a b x a c 1 c a x b 3 a b x c 2 296.3

Constructing Suffix trees Naive O(m2) algorithm For every i, add the suffix S[i .. m] to the current tree x a b x a c 1 b c a 4 x b a x a c c 2 3 296.3

Constructing Suffix trees Naive O(m2) algorithm For every i, add the suffix S[i .. m] to the current tree x a b x a c 1 c 6 a b c 4 x b a x c a c c 5 2 3 296.3

Ukkonen’s linear-time algorithm We will start with an O(m3) algorithm and then give a series of improvements In stage i, we construct a suffix tree Ti for S[1..i] As we will see, buildingTi+1 from Ti naively takes O(i2) time because we insert each of the i+1 suffixes S[j..i+1] Thus a total of O(m3) time 296.3

Going from Ti to Ti+1 In the jth substage of stage i+1, for j = 1 to i+1, we insert S[j..i+1] into Ti. Let S[j..i] = b. Three cases Rule 1: The path b ends on a leaf  add S[i+1] to the label of the last edge Rule 2: The path b continues with characters other than S[i+1]  create a new leaf node and split the path labeled b Rule 3: A path labeled b S[i+1] already exists  do nothing. 296.3

Idea #1 : Suffix Links Note that in each substage, we first search for some string in the tree and then insert a new node/edge/label Can we speed up looking for strings in the tree? Note that in any substage, we look for a suffix of the strings searched in previous substages Idea: Put a pointer from an internal node labeled xa to the node labeled a Such a link is called a “Suffix Link” 296.3

Idea #1 : Suffix Links Add the letter d SP 1 SP 4 SP 5 SP 6 SP 3 SP 2 x a b x a c d SP 1 c a b d SP 4 x b x a c c a c d SP 5 c d SP 6 d SP 3 d SP 2 296.3

Suffix Links – Bounding the time Steps in each substage Go up 1 link to the nearest internal node Follow a suffix link to the suffix node Follow path link for the remaining string First and second steps happen once per substage. Suffix links ensure that in third step, each character in S[1..i+1] is used at most once to traverse a downward tree edge to an internal node. Hence O(m) time over stage. Thus the total time per stage is O(m) 296.3

Maintaining Suffix Links Whenever a node labeled xa is created, in the following substage a node labeled a is created. Why? When a new node is created, add a suffix link from it to the root, and if required, add a suffix link from its predecessor to it. 296.3

Going from O(m2) to O(m) Can we even hope to do better than O(m2)? Size of the tree itself can be O(m2) But notice that there are only 2m edges! – Why? (still O(m) even if we double count edges for all suffixes that are prefixes of other suffixes) Idea: represent labels of edges as intervals Can easily modify the entire process to work on intervals 296.3

Idea #2 : Getting rid of Rule 3 Recall Rule 3: A path labeled S[j .. i+1] already exists ) do nothing. If S[j .. i+1] already exists, then S[j+1 .. i+1] exists too and we will again apply Rule 3 in the next substage Whenever we encounter Rule 3, this stage is over – skip to the next stage. 296.3

Idea #3 : Fast-forwarding Rules 1 & 2 Rule 1 applies whenever a path ends in a leaf Note that a leaf node always stays a leaf node – the only change is to append the new character to its edge using Rule 1 An application of Rule 2 in substage j creates a new leaf node This node is then accessed using Rule 1 in substage j in all the following stages 296.3

Idea #3 : Fast-forwarding Rules 1 & 2 Fast-forward Rule 1 and 2 Whenever Rule 2 creates a node, instead of labeling the last edge with only one character, implicitly label it with the entire remaining suffix Each leaf edge is labeled only once! 296.3

Loop Structure i Rule 2 gets applied once per j rule 2 (follow suffix link) j+1 rule 2 (follow suffix link) j+2 rule 3 Rule 2 gets applied once per j Rule 3 gets applied once per i i+1 j+2 rule 3 i+2 j+2 rule 3 i+3 j+2 rule 2 j+3 rule 3 296.3

Another Way to Think About It S j i insert finger increment when S[j..i] not in tree (rule 2) insert S[j..n] into tree by branching at S[j..i-1] create suffix pointer to new node at S[j..i-1] if there is one use parent suffix pointer to move finger to j+1 search finger increment when S[j..i] in tree (rule 3) Invariants: j is never after i S[j..i-1] is always in the tree 296.3

An example x a b x a c c a x x a b 1 c b c a 4 c 5 3 2 6 Leaf edge labels are updated by using a variable to denote the start of the interval 296.3

Complexity Analysis Rule 3 is used only once in every stage For every j, Rule 1 & 2 are applied only once in the jth substage of all the stages. Each application of a rule takes O(1) steps Other overheads are O(1) per stage Total time is O(m) 296.3

Extending to multiple texts Suppose we want to match a pattern with a dictionary of k texts Concatenate all the texts (separated by special characters) and construct a common suffix tree Time taken = O(km) Unnecessarily complicated tree; needs special characters 296.3

Multiple texts – Better algorithm First construct a suffix tree on the first text, then insert suffixes of the second text and so on Each leaf node should store values corresponding to each text O(km) as before 296.3

Longest Common Substring Find the longest string that is a substring of both S1 and S2 Construct a common suffix tree for both Any node that has leaf nodes labeled by S1 and S2 in the subtree it roots gives a common substring The “deepest” such node is the required substring Can be found in linear time by a tree traversal. 296.3

Common substrings of M strings Given M strings of total length n, find for every k, the length lk of the longest string that is a substring of at least k of the strings Construct a common suffix tree For every internal node, find the number of distinctly labeled leaves in the subtree rooted at the node Report lk by a single tree traversal O(Mn) time – not linear! 296.3

Lempel-Ziv compression Recall that at each stage, we output a pair (pi, li) where S[pi .. pi+li] = S[i .. i+li] Find all pairs (pi,li) in linear time Construct a suffix tree for S Let the position of each internal node be the minimum of the positions of all leaves below it – this is the first place in S where the node’s label occurs. Call this position cv. For every i, search for the string S[i .. m] stopping just before cv¸i. This gives us li and pi. 296.3