Suffix Trees String … any sequence of characters.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Space-for-Time Tradeoffs
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Alon Efrat Computer Science Department University of Arizona Suffix Trees.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
KMP String Matching Prepared By: Carlens Faustin.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Higher Order Tries Key = Social Security Number.
Chapter 5 : Trees.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Mark Redekopp David Kempe
Lecture 18. Basics and types of Trees
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Digital Search Trees & Binary Tries
String Processing.
Strings: Tries, Suffix Trees
Chapter 7 Space and Time Tradeoffs
Suffix trees.
Digital Search Trees & Binary Tries
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Higher Order Tries Key = Social Security Number.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Suffix trees and suffix arrays
Binary Trees, Binary Search Trees
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Knuth-Morris-Pratt Algorithm.
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Binary Trees, Binary Search Trees
Data Structures Using C++ 2E
Week 14 - Wednesday CS221.
Presentation transcript:

Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S. S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.

Subsequence Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.

String/Pattern Matching You are given a source string S. Answer queries of the form: is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query. O(n|S| + Si | pi |) time for n queries. Suffix tree solution. O(|S| + Si | pi |) time for n queries. Very significant run-time reduction in situations where S is very long and the pis are very short.

String/Pattern Matching KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S. An application of string matching. Genome project. Databank of strings (gene sequences). Character set is ATGF. Determine if a “new” sequence is a substring of a databank sequence.

Definition Of Suffix Tree Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r.

String Matching & Suffixes pi is a substring of S iff pi is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r. Which of these are substrings of S? leep, eepe, pe, leap, peel

Last Character Of S Repeats When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper creeper, reeper, eeper, eper, per, er, r When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper# creeper#, reeper#, eeper#, eper#, per#, er#, r#, # The length 1 suffix (i.e., last character) is a proper prefix of the suffix that begins at each of the other occurrences of this last character.

Suffix Tree For S = abbbabbbb# 1 2 3 4 5 Edges labeled with the branch character plus skipped over characters. When # added, char# of root is always 1.

Suffix Tree For S = abbbabbbb# 1 abbb b # abbbb# b# 5 2 10 3 1 5 9 4 4 8 3 Element nodes index into the string rather than keep complete suffixes. The index is to the first character of the suffix that would otherwise be in the element node. abbbabbbb# 7 2 6 12345678910

Suffix Tree For S = abbbabbbb# 1 1 abbb b # abbbb# b# 5 4 2 10 1 3 8 1 5 9 4 4 2 8 3 Edge information (edge is labeled with branch character plus skipped over characters) is extracted using an index in the child branch/element node. The branch node index is the same as that in any descendent element node (i.e., first char of any suffix in the subtree). Use the indicated character in search string to reach branch node; use index in reached branch node to figure out skipped characters. Need to check characters from previous node char# to current node char#-1. Note that by simply looking at the char#s on path from root, you can’t figure out what the edge info is because you can’t tell where you are in the string S. abbbabbbb# 7 2 6 12345678910

Suffix Tree Construction See Web write up for algorithm. Time complexity |S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference cited in Web write up.

O(|pi|) Time Substring Matching abbb b # abbbb# b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 babb abbba baba

Find All Occurrences Of pi Search suffix tree for pi. Suppose the search for pi is successful. When search terminates at an element node, pi appears exactly once in the source string S.

Search Terminates At Element Node abbb b # abbbb# b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 abbbb#

Search Terminates At Branch Node When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.

Search Terminates At Branch Node abbb b # abbbb# b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 ab

Find All Occurrences Of pi To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most and right most element node in its subtree.

Augmented Suffix Tree abbbabbbb# 12345678910 b abbb b # abbbb# b# 1 5

Longest Repeating Substring Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.

Longest Repeating Substring 10 abbb b # abbbb# b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 2 7 5 3 Circled values are number of suffixes (excluding #) in the subtree. m = 2 m = 5

Longest Common Substring Given two strings S and T. Find the longest common substring. S = carport, T = airports Longest common substring = rport Longest common subsequence = arport Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

Longest Common Substring Let $ be a new symbol. Construct the suffix tree for the string U = S$T#. U = carport$airports# No repeating substring includes $. Find longest repeating substring that is both to left and right of $. Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.