Strings: Tries, Suffix Trees

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Space-for-Time Tradeoffs
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Fall 2006Costas Busch - RPI1 Languages. Fall 2006Costas Busch - RPI2 Language: a set of strings String: a sequence of symbols from some alphabet Example:
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Fall 2004COMP 3351 Languages. Fall 2004COMP 3352 A language is a set of strings String: A sequence of letters/symbols Examples: “cat”, “dog”, “house”,
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Costas Busch - LSU1 Languages. Costas Busch - LSU2 Language: a set of strings String: a sequence of symbols from some alphabet Example: Strings: cat,
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet:
Lecture 12 : Trie Data Structure Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
Fundamental Data Structures and Algorithms
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
COMP9024: Data Structures and Algorithms
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
CSG523/ Desain dan Analisis Algoritma
Sorts, CompareTo Method and Strings
Sections 10.5 – 10.6 Hashing.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Searching, Maps,Tries (hashing)
15-853:Algorithms in the Real World
Data Structure By Amee Trivedi.
COMP261 Lecture 22 Data Compression 2.
Tries 07/28/16 11:04 Text Compression
Hashing & HashMaps CS-2851 Dr. Mark L. Hornick.
Tries 5/27/2018 3:08 AM Tries Tries.
Higher Order Tries Key = Social Security Number.
Fast String Manipulation
COMP 53 – Week Eleven Hashtables.
Hash table CSC317 We have elements with key and satellite data
CMPT 120 Topic: Searching – Part 1
Languages Prof. Busch - LSU.
Languages Costas Busch - LSU.
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Mark Redekopp David Kempe
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Strings: Tries, Suffix Trees
Tries A trie is another type of tree structure. The word “trie” comes from the word “retrieval,” but is usually pronounced like “try.” For our purposes,
Computer Science 2 Hashing
Chapter 7 Space and Time Tradeoffs
Suffix trees.
String Data Structures and Algorithms
CS202 - Fundamental Structures of Computer Science II
Advanced Implementation of Tables
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
How to use hash tables to solve olympiad problems
Space-for-time tradeoffs
Tree A tree is a data structure in which each node is comprised of some data as well as node pointers to child nodes
Languages Fall 2018.
Lecture-Hashing.
Week 13 - Wednesday CS221.
Presentation transcript:

Strings: Tries, Suffix Trees

Trie (Prefix tree) Trie: a tree for entering (usually) strings. Each edge gives one letter. A string is denoted by a node indicating it ends a string. e.g.: cat car call dog dag do Designate end nodes Indicate end of a string All leaves are end nodes c d l a g t r o

Tries Compared to hash tables for strings: No hash function needed No need for chaining/collision handling Can maintain alphabetical ordering Like hash tables, works on other data besides strings But, might not work as well in those cases Efficiency vs. hash tables depends on how the structures end up being stored in memory, caches, etc. Generally, hash tables are likely to be more efficient Tries are better in worst case At terminating node, could store more information (link to other data, for instance)

Suffix Trie A trie where you enter all suffixes of a word. Example: “reverse” reverse everse verse erse rse se e Suffix Tries allow subsequent faster processing for various tasks Though you need to build it first – this can take longer overall for some tasks

Suffix Trie e r v s a r v e s e e s e v e r r e e s s r e e s e

Suffix Tree from a Suffix Trie Suffix Tries can tend to have long “chains” of nodes This is inefficient Instead of one letter per edge, create strings of letters per edge Can still split an edge into two if there’s a difference part-way through Since string comparison involves comparing letter-by-letter, not wasting any time when doing this. Instead of O(n2) nodes for a suffix trie, now you have at most 2n nodes in the suffix tree.

Suffix Trie e r verse se a rse verse everse se

Using a suffix trie/tree To find if a string is a substring of a particular string Search in the suffix tree – the string must begin matching some suffix Do not have to end at a terminating node – as long as all intermediate edges are there, the string is a substring. Count matching substrings Make sure to keep a count in nodes of how many are that one or below. Longest repeated substring Find the deepest INTERNAL node of the suffix tree (not trie) Internal nodes must be a prefix string for 2 or more suffixes Longest common substring (not subsequence) for two strings Create a joint suffix tree Mark each node as having subnodes from one or other or both strings Longest is deepest node marked with both substrings.

Suffix Arrays Suffix trees can be constructed in O(n), but the algorithm is complicated; not good for fast/accurate coding Suffix Arrays can provide nearly as good operations, and are much simpler to implement. Idea: suffixes are originally numbered 0..n. Sort (the indices) alphabetically.

reverse everse verse erse rse se e 8: 7: e 4: erse 2: everse 1: reverse 5: rse 6: se 3: verse

Implementing Suffix Array (1) Let original string be S Initialize SA[i] = i for all n suffixes Sort SA, sort (SA, SA+n, cmp) where comparison is: cmp(int a, int b) { return strcmp(S+a, S+b) < 0; } i.e. if string starting at a is less than string starting at b, then comparison is true Unfortunately, though super-easy to code, this takes too long for long strings (> 1000 or so) due to length of string comparison strcmp string comparison is O(n), so overall is O(n2lgn)

Implementing Suffix Array (2) Can improve performance by limiting sorting range First sort just first letter Then sort by first and second Then by first through fourth Then by first through eighth etc. (by powers of 2) See book for very efficient code for doing this Also more detailed explanation of process Requires using a stable sort (counting sort is OK!) Can be a linear time sort End result is O(nlgn), so long strings are possible

Using a Suffix Array All require computing the Suffix Array first Finding occurrences of a substring Binary search to find the suffix before, and the suffix after the target string All those in between will be matches O(mlgn) for finding substring of length m.

Using a Suffix Array All require computing the Suffix Array first Finding occurrences of a substring Longest Common Prefix between suffixes (in suffix array order) Binary search on letters: first letter, then second, etc. Each letter narrows the range to search in subsequently Amortized analysis: O(n)

Using a Suffix Array All require computing the Suffix Array first Finding occurrences of a substring Longest Common Prefix between suffixes (in suffix array order) Longest repeated substring Once you have LCP, just find the maximum LCP value encountered. No more time than computing LCP.

Using a Suffix Array All require computing the Suffix Array first Finding occurrences of a substring Longest Common Prefix between suffixes (in suffix array order) Longest repeated substring Longest common substring between two strings Concatenate one string on the end of the other Put a unique terminating character between them. Then, find longest repeated substring, but with the two substrings (two adjacent entries in suffix array) from different strings.