Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

Slides:



Advertisements
Similar presentations
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Advertisements

Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
© 2004 Goodrich, Tamassia Tries1 Chapter 7 Tries Topics Basics Standard tries Compressed ( 壓縮 ) tries Suffix ( 尾字 ) tries.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Text Processing 1 Last Update: July 31, Topics Notations & Terminology Pattern Matching – Brute Force – Boyer-Moore Algorithm – Knuth-Morris-Pratt.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
15-853:Algorithms in the Real World
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Binary Search Trees < > =
Design & Analysis of Algorithm Huffman Coding
Red-Black Trees 5/17/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP261 Lecture 22 Data Compression 2.
Tries 07/28/16 11:04 Text Compression
Red-Black Trees 5/22/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 5/27/2018 3:08 AM Tries Tries.
Binary Search Tree A Binary Search Tree is a binary tree that is either an empty or in which every node contain a key and satisfies the following conditions.
Binary Search Trees < > =
Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Andrzej Ehrenfeucht, University of Colorado, Boulder
B-Trees 7/5/2018 4:26 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Mark Redekopp David Kempe
The Greedy Method and Text Compression
Heaps 8/2/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M. H. Goldwasser,
The Greedy Method and Text Compression
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Pattern Matching 9/14/2018 3:36 AM
13 Text Processing Hongfei Yan June 1, 2016.
Radix search trie (RST) R-way trie (RT) De la Briandias trie (DLB)
Binary Search Trees < > =
Optimal Merging Of Runs
Data Structure and Algorithms
Binary Search Trees < > =
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Binary Search Trees < > =
Chap 3 String Matching 3 -.
Sequences 5/17/ :43 AM Pattern Matching.
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M. H. Goldwasser, Wiley, 2014 Tries Tries

Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries Last() in Boyer-Moore’s algorithm If the text is large, immutable and searched for often (e.g., works by Shakespeare) preprocess the text instead of the pattern A trie represents a set of strings E.g. all the words in a text supports pattern matching queries in time proportional to the pattern size, ie O(m) Tries

Standard Tries The standard trie for a set of strings S is an ordered tree such that: Each node but the root is labeled with a character The children of a node are alphabetically ordered The paths from the external nodes to the root yield the strings of S Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } Tries

Analysis of Standard Tries A standard trie uses O(n) space supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet Tries

Word Matching with a Trie insert the words of the text into trie Each leaf is associated w/ one particular word leaf stores indices where associated word begins (“see” starts at index 0 & 24, leaf for “see” stores those indices) s e b a r ? l t o c k ! u y i d h p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 a e b l s u t 0, 24 o c i r 6 78 d 47, 58 30 y 36 12 k 17, 40, 51, 62 p 84 h 69 Tries

Compressed Tries internal nodes of degree at least two obtained from standard trie by compressing chains of nodes with only one child ex. the “i” and “d” in “bid” are “redundant” because they signify the same word Tries

Compact Representation Compact representation of a compressed trie for an array of strings: Stores at the nodes ranges of indices instead of substrings Uses O(s) space, where s is the number of strings in the array Serves as an auxiliary index structure indexOfS,start,end Tries

Suffix Trie The suffix trie of a string X is the compressed trie of all the suffixes of X Tries

Suffix matching (naturally) suffix = “ize” Is suffix in text? Tries

Pattern matching pattern = “min” Is pattern in text? Tries

Frequency of pattern pattern = “mi” How many times “mi” occurs? Tries

Suffix Trie Add locations of suffix at the leaves 7 2 6 3 1 5 4 Tries

Locations of a pattern 7 2 6 3 1 5 4 Pattern = “mi” What are the locations of the pattern? 7 2 6 3 1 5 4 Tries

Locations of a pattern 7 2 6 3 1 5 4 Pattern = “mini” What are the locations of the pattern? 7 2 6 3 1 5 4 Tries

Compact Representation Each node stores the begin and end indices in the array Tries

Analysis of Suffix Tries string X of size n from an alphabet of size d Uses O(n) space Can be constructed in O(n) time Supports pattern/suffix matching in X O(dm) time, where m is the size of the pattern Tries

Skipping the rest Tries

Encoding Trie (1) A code is a mapping of each character of an alphabet to a binary code-word A prefix code is a binary code such that no code-word is the prefix of another code-word An encoding trie represents a prefix code Each leaf stores a character The code word of a character is given by the path from the root to the leaf storing the character (0 for a left child and 1 for a right child a b c d e 00 010 011 10 11 a b c d e Tries

Encoding Trie (2) c a r d b T1 a c d b r T2 Given a text string X, we want to find a prefix code for the characters of X that yields a small encoding for X Frequent characters should have short code-words Rare characters should have long code-words Example X = abracadabra T1 encodes X into 29 bits T2 encodes X into 24 bits c a r d b T1 a c d b r T2 Tries