Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Slides:



Advertisements
Similar presentations
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Advertisements

Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
CSE 373 Data Structures Lecture 15
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Contents What is a trie? When to use tries
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
15-853:Algorithms in the Real World
Computational Geometry
COMP261 Lecture 22 Data Compression 2.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
Mark Redekopp David Kempe
COSC160: Data Structures Linked Lists
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
i206: Lecture 13: Recursion, continued Trees
Strings: Tries, Suffix Trees
String Matching Module-5.
Suffix trees.
String Matching Module-5.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Lower bound for sorting, radix sort
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
CENG 351 Data Management and File Structures
Sequences 5/17/ :43 AM Pattern Matching.
Tree A tree is a data structure in which each node is comprised of some data as well as node pointers to child nodes
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Advanced Data Structures Lecture 8 Mingmin Xie

Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications

Overview of String Matching Given an alphabet Σ, text T, and a pattern P: Is there a substring of T matching P ? Algorithmic approach: O(|T|), etc: KMP Data structural approach: o If T is large, immutable and searched for often o Why not preprocess it? o answer queries in O(|P|) time using O(|T||Σ|) space and O(|T| + sort(Σ)) preprocessing time.

Trie Name come from retrieval A set of words: { ana, ann, anna, anne } Is "ann" there? - O(|P|) Who starts with "ann"? - O(|P| + output)

Compressed Trie Coalesce non-branching paths Store only indices

Space complexity of Compressed Trie There are only branching nodes and leaf nodes Remove a leaf and its parent until there is no branching node:

Space complexity of Compressed Trie Finally at least one leaf node remaining: number of branching nodes < number of leaf nodes = number of words T number of nodes = branching + leaf < 2T number of edges = number of nodes - 1 <= 2T Every node has an array of pointers for the alphabet Σ. Space: O(|T||Σ|) So far we have a dictionary which takes O(|T||Σ|) space and serves queries in O(|P|) time.

Back to String Matching Now for the original string matching problem: find P in T We can see all suffixes of T as a dictionary and query a pattern P in it. If P is a prefix of any suffix, then P is in T. For example: T = "banana" The set of suffixes: { banana$, anana$, nana$, ana$, na$, na$, a$, $ }

Suffix Tree Making use of the trie, we can put all suffixes of T into a trie.

Suffix Tree Is P in T? -- O(|P|) Given a pattern P, all occurrences of P in T can be found and reported in time O(|P| + output). Space: O(|T||Σ|) o length of T = |T| = number of suffixes

How to Construct? The simple way: insert suffixes(words) into the Trie one by one Time: O(|T|2) Worst case: T = "aaaa", { $, a$, aa$, aaa$, aaaa$ } We went through path of common prefix "aa.." again and again, which can be optimized by preprocessing common prefixes!

Suffix Array T = "banana"

Suffix Array Suffix array is also a powerful structure itself. It is easier to implement so we can use it instead of suffix tree sometimes. If SA[i] = j then suffix Sj has rank i among the set of strings {S0,..., Sn−1}. We can search for occurrences of P directly on the suffix array using binary search in O(|P|lg|T|) time.

Longest Common Prefix Array

Equivalence

Construction of Suffix Tree From Suffix Array and LCP:

Construction of Suffix Array and LCP Skew algorithm[2]. Overview:

Skew Algorithm Suppose T = "mississippi" 1. sort Σ = {i, m, p, s} => {1, 2, 3, 4}, cost O(sort(Σ)) In the first iteration use any sorting algorithm, leading to the O(sort(Σ)) term. In the following iterations we can use radix sort to sort in linear time (see below). 2. Alphabet reduction: replace each letter in the text with its rank among the letters in the text. T =

Skew Algorithm 3. Divide the text T into 3 parts and consider triples of letters to be one megaletter, i.e. change the alphabet. More formally, form T0,T1, and T2 as follows: T0 = T1 = T2 = Note that Ti’s are just texts with n/3 letters of a new Σ3 alphabet. Our text size has become a third of the original, while the alphabet size has cubed. Here we have: T0 = mis sis sip pi$ = T1 = iss iss ipp i$$ = T2 = ssi ssi ppi =

Skew Algorithm Suffixes(T ) ∼ = Suffixes(T0) ∪ Suffixes(T1) ∪ Suffixes(T2) T0T1T2 mis sis sip pi$ sis sip pi$ sip pi$ pi$ iss iss ipp i$$ iss ipp i$$ ipp i$$ i$$ ssi ssi ppi ssi ppi ppi

Skew Algorithm 4. Recurse on sort the new alphabet Σ3 => {1, 2, 3, 4, 5, 6, 7} Here rank[i] = j if SA[j] = i. We use it to compare suffixes later.

Skew Algorithm 5. Sort suffixes of T2 using radix sort: suffix T2[i:] - pull off the first letter, it becomes a suffix of T0 SA2: {2, 1, 0} LCP: simply check the first letter distinct: 0 same: lookup LCP01 for T0[i+1:] and add 1 LCP2 = {0, 1 + LCP01[1]} = {0, 3}

Skew Algorithm 6. Merge SA01 and SA2 linearly. (like merge sort) Need to compare T0[i:] or T1[i:] to T2[j:] in constant time: T0[i:] vs T2[j:] ~= (T[3i], T1[i:]) vs (T[3j+2], T0[j+1:]), e.g.: o T0[1:] ~= sis sip pi$ ~= s iss ipp i$$ ~= (s, T1[0:]) ~= (s, 4) o T2[1:] ~= ssi ppi ~= s sip pi$ ~= (s, T0[2:]) ~= (s, 6) o so T0[1:] < T2[1:] T1[i:] vs T2[j:] ~= (T[3i+1], T[3i+2], T0[i+1:]) vs (T[3j+2], T[3j+3], T1[j+1:]), e.g.: o T1[1:] ~= iss ipp i$$ ~= i s sip pi$ ~= (i, s, T0[2:]) ~= (i, s, 6) o T2[1:] ~= ssi ppi ~= s s ipp i$$ ~= (s, s, T1[2:]) ~= (s, s, 1) o so T1[1:] < T2[1:]

Skew Algorithm Finally the merged suffix array SA: {10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2} LCP: done exactly as above T0[i:] vs T2[j:] 1st character different: 0 1st character same: lookup LCP01 and add 1 T1[i:] vs T2[j:] 1st character different: 0 1st same, 2nd different: 1 both equal: lookup LCP01 and add 2

Time Complexity The recursive execution time obeys the recurrence: T(n) = O(n) + T(2n/3) T(n) = O(1) for n < 3 Solution T(n) = O(n) The first sort of Σ costs O(sort(Σ)). Overall: O(|T| + sort(Σ)).

Applications count the number of occurrences of P o augment subtree: record the number of leaves longest repeated substring: O(|T|) o find the branching node of maximum letter depth multiple documents via multiple $'s: T = T1$1T2$2... longest common substring between two documents: O(|T|) o combine them as above o find the deepest node with both $1 and $2

Document Retrieval

Q & A Thank you