Suffix Tree 6 Mar. 2004 MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
CS2420: Lecture 13 Vladimir Kulyukin Computer Science Department Utah State University.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Binary Trees Chapter 6.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.
Linear Time Suffix Array Construction Using D-Critical Substrings
15-853:Algorithms in the Real World
COMP9319 Web Data Compression and Search
13 Text Processing Hongfei Yan June 1, 2016.
Trees Lecture 9 CS2110 – Fall 2009.
Suffix trees.
Suffix trees and suffix arrays
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Suffix Tree 6 Mar MinKoo Seo

Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring of Two Strings  Circular String Linearlization  Suffix Array  Where do large alphabet problems arise?  Reference

Basic Text Searching  Naïve Method  Time Complexity : T: x a b x y a b x y a b x z P: a b x y a b x z * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ * a b x y a b x z * a b x y a b x z

Basic Text Searching (cont)  Skip shifts/compares (T:Text, P:Pattern)  First character of P (namely a) does not occur again in P until position 5 of P. T: x a b x y a b x y a b x z P: a b x y a b x z * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ ^

Basic Text Searching (cont)  Even smarter algorithm  We already know that “abx” was matched. Therefore, we do not compare these again. T: x a b x y a b x y a b x z P: a b x y a b x z * a b x y a b x z ^ ^ ^ ^ ^ ^ ^ * a b x y a b x z ^ ^ ^ ^ ^

Introduction to Suffix Tree  Definition: Suffix Tree  A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m.  Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S.  No two edges out of a node can have edge-labels beginning with the same character.  The key feature of the suffix tree is that for any leaf i, the concatenation of the edge labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i..m].

Introduction to Suffix Tree (cont)  Example  S= x a b x a c c a c 563 c b x a c b x a c x ab x a c 1

A Motivating Example: Suffix Trees and Exact Matching  Exact Matching Algorithm (|T|=m, |P|=n) 1. Build a suffix tree T for text T in O(m) time. 2. Match the characters of P along the unique path in T until  P is exhausted; Every leaf in the subtree below the point of the last match is numbered with a starting location of P in T.  No more matches are possible; This means that P does not appear anywhere in T.  TC  O(m) preprocessing time.  O(n+k) search time where k is the number of occurrences of P in T.

A Motivating Example: Suffix Trees and Exact Matching  Example  S= x a b x a c / P= x a c a c 563 c b x a c b x a c x ab x a c 1

Example: Longest Common Substring of Two Strings  Longest common substring problem  Find the longest substring common to two given Strings S1 and S2.  For example, S1=superiorcalifornialives and S2=sealiver, then the longest common substring of S1 and S2 is alive.  Label of a path  The path label of a node is the label of the path from the root of T to that node.

Example: Longest Common Substring of Two Strings (cont)  String-depth  For any node v in a suffix tree, the string-depth of v is the number of characters in v’s label.  Example 4 2 c a c 563 c b x a c b x a c x ab x a c 1 Path label: xa String Depth: 2

Example: Longest Common Substring of Two Strings (cont)  Longest common substring problem using generalized suffix tree  Each leaf of the tree represents either a suffix from one of the two strings or a suffix that occurs in both the strings. Mark each internal node v with a 1(or 2) if there is a leaf in the subtree of v representing a suffix from S1(or S2). The path-label of any internal node marked both 1 and 2 is a substring common to both S1 and S2.  The algorithm has to find the node with the greatest string-depth that is marked both 1 and 2.  Construction can be done in linear time. Node marking and calculations of string-depth can be done by standard linear-time traversal methods.

Example: Longest Common Substring of Two Strings (cont)  Example  S1= x a b x a c / S2= a x a y z ,4 1,2 c a c 1,51, 6 1,3 c b x a c b x a c x b x a c 1,1 x a y z 2,1 y z 2,2 y z 2,3 y z 2,4 a z 2,5 1,2

Example: Circular String Linearlization  Circular String  A circular string of length n is a string in which character n is considered to precede character 1.  Example a a a x a c d y x a z c

Example: Circular String Linearlization (cont)  Lexically smaller than  Given an ordering of the characters in the alphabet, a string S1 is lexically (or lexicographically) smaller than a string S2 if S1 would appear before S2 in a normal dictionary ordering of the two strings.  Circular String Linearlization Problem  Choose a place to cut S so that the resulting linear string is the lexically smallest of all the n possible linear strings created by cutting S.

Example: Circular String Linearlization (cont)  Example  Motivation of Circular String Linearlization  This problem arises in chemical databases for circular molecules. To allow faster lookup and comparisons of molecules, one wants to store each circular string by a canonical linear string. A natural choice for canonical linear string is the one that is lexically least. a a a x a c d y x a z c a a a x a c d y x a z c

Example: Circular String Linearlization (cont)  Solution via suffix trees 1. Arbitrarily cut the circular string S, giving a linear string L. 2. Double L, creating the string LL, and build the suffix tree T for LL. 3. Affix the terminal symbol $ at the end of LL, but interpret it to be lexically greater than any other character in the alphabet. 4. Traverse tree T with the rule that, at every node, the traversal follows the edge whose first character is lexically smallest over all first characters on edges out of the node. 5. This traversal continues until the traversed path has string-depth n.  This method runs in linear time and is therefore optimal.

Example: Circular String Linearlization (cont)  Example b a c d L = c d b a LL$ = c d b a c d b a $ c d b a c d b a $ 1 $ 5 d b a c d b a $ 2 b a c d b a $ 3 a c d b a $ 4 $ 6 $ 7 $ 8 $ 9 a c d b

Suffix Array  When alphabet size is included in the time and space bounds, a suffix tree may require too much space to be practical in some applications.  Suffix Array  Very space efficient and yet can be used to solve the exact matching problem or the substring problem almost as efficiently as with a suffix tree.  Likely to be an important contribution to certain string problems in computational biology, where the alphabet can be very large.

Suffix Array (cont)  Suffix Array  Given an m-character string T, a suffix array for T, called Pos, is an array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of string T.  That is, the suffix starting at position Pos(1) of T is the lexically smallest suffix, and in general suffix Pos(i) of T is lexically smaller than suffix Pos(i+1).

Suffix Array (cont)  Example  If T is Mississippi 11:i 8:ippi 5:issippi 2:ississippi 1:mississippi 10:pi 9:ppi 7:sippi 4:sisippi 6:ssippi 3:ssissippi

Suffix Array (cont)  Notice that the suffix array holds only integers and hence contains no information about the alphabet used in string T.  Therefore, the space required by suffix arrays is modest.

Where do large alphabet problems arise?  Motivation of suffix arrays  Suffix arrays come from problems that arise in using suffix trees when the underlying alphabet is large.  Largest alphabet problems  Natural language, such as Chinese  String contains numbers, each of which is treated as a character

Where do large alphabet problems arise? (cont)  Map matching problem  A restriction enzyme map for a single enzyme specifies the locations in a DNA string where copies of a certain substring occurs.  The restriction enzyme map for that single enzyme is represented as a string consisting of a sequence of integers specifying the distances between successive enzyme sites. More generally, a map may display the sites of many different patterns of interest.  The alphabet is huge because the range of integers is huge, and since distances are often known with high precision, the numbers are not rounded off.  In addition, most DNA substrings that are studied are not fully sequenced – maps are easier and cheaper than sequences.

Reference  "Algorithms on Strings, Trees, and Sequences" by Dan Gusfield, Cambridge University Press