Tries 07/28/16 11:04 Text Compression

Slides:



Advertisements
Similar presentations
Applied Algorithmics - week7
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Compression & Huffman Codes
Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Optimal Merging Of Runs
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
A Data Compression Algorithm: Huffman Compression
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
© 2004 Goodrich, Tamassia Tries1 Chapter 7 Tries Topics Basics Standard tries Compressed ( 壓縮 ) tries Suffix ( 尾字 ) tries.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Chapter 9: Huffman Codes
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Greedy Algorithms Huffman Coding
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Algorithm Design & Analysis – CS632 Group Project Group Members Bijay Nepal James Hansen-Quartey Winter
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Data Structures Week 6: Assignment #2 Problem
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Bahareh Sarrafzadeh 6111 Fall 2009
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
1 COMP9024: Data Structures and Algorithms Week Ten: Text Processing Hui Wu Session 1, 2016
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
Compression & Huffman Codes
Tries 5/27/2018 3:08 AM Tries Tries.
Binary Search Tree A Binary Search Tree is a binary tree that is either an empty or in which every node contain a key and satisfies the following conditions.
Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Applied Algorithmics - week7
Mark Redekopp David Kempe
The Greedy Method and Text Compression
The Greedy Method and Text Compression
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Optimal Merging Of Runs
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Optimal Merging Of Runs
Trees Lecture 9 CS2110 – Fall 2009.
Data Structure and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Sequences 5/17/ :43 AM Pattern Matching.
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Tries 07/28/16 11:04 Text Compression Efficiently encode a string X into a smaller string Y. Save memory and/or bandwidth. Huffman encoding Compute frequency f(c) for each character c. Encode characters with bit strings. Use short strings for frequent characters. No string is a prefix for another string. Compute an optimal encoding expressed as a binary tree. Huffman Encoding 1

Encoding Tree Example a b c d e Tries 07/28/16 11:04 Encoding Tree Example A code is a mapping of each character of an alphabet to a binary code-word. A prefix code is a binary code such that no code-word is the prefix of another code-word. An encoding tree represents a prefix code. Each external node stores a character. The code word of a character is given by the path from the root to the external node storing the character (0 for a left child and 1 for a right child). a b c d e 00 010 011 10 11 a b c d e Huffman Encoding 2

Encoding Tree Optimization Tries 07/28/16 11:04 Encoding Tree Optimization Given a text string X, we want to find a prefix code for the characters of X that yields a small encoding for X. Frequent characters should have short code-words. Rare characters should have long code-words. Example X = abracadabra T1 encodes X into 29 bits T2 encodes X into 24 bits c a r d b T1 a c d b r T2 Huffman Encoding 3

Tries 07/28/16 11:04 Huffman’s Algorithm Algorithm HuffmanEncoding(X) Input string X of size n Output optimal encoding trie for X C  distinctCharacters(X) computeFrequencies(C, X) Q  empty heap for all c  C T  new leaf storing c Q.insert(getFrequency(c), T) while Q.size() > 1 f1  Q.minKey() T1  Q.removeMin() f2  Q.minKey() T2  Q.removeMin() T  join(T1, T2) Q.insert(f1 + f2, T) return Q.removeMin() Constructs a prefix code the minimizes the encoding size of a string X. Runs in time O(nd log d), where n is the size of X and d is the number of characters in X. A heap-based priority queue is used as an auxiliary structure. 4

Example X = abracadabra Frequencies a b c d r 5 2 1 c a b d r 2 4 6 11 Tries 07/28/16 11:04 Example c a b d r 2 4 6 11 X = abracadabra Frequencies a b c d r 5 2 1 c a b d r 2 5 4 6 c a r d b 5 2 1 c a r d b 2 5 c a b d r 2 5 4 Huffman Encoding 5

Tries 07/28/16 11:04 Larger Huffman Tree Huffman Encoding 6

Pattern Matching Pattern matching: find a pattern in a text. Tries 07/28/16 11:04 Pattern Matching Pattern matching: find a pattern in a text. The text is large, immutable, and often searched. Preprocess the text to make search faster. A trie is a compact data structure for representing a set of strings, such as all the words in a text. Search time is linear in pattern size. Tries 7

Tries The trie for a set of strings S is an ordered tree. 07/28/16 11:04 Tries The trie for a set of strings S is an ordered tree. Each node but the root is labeled with a character. The children of a node are alphabetically ordered. The paths from the root to the external nodes yield the strings of S. Example: S = { bear, bell, bid, bull, buy, sell, stock, stop } Tries 8

Tries 07/28/16 11:04 Analysis of Tries A trie uses O(n) space and supports search, insertion, and deletion in time O(dm). n total length of the strings in S m length of the query string d size of the alphabet Tries 9

Word Matching with a Trie Tries 07/28/16 11:04 Word Matching with a Trie insert the words of the text into a trie. A leaf stores the indices where a word begins. see starts at index 0 & 24. Its leaf stores those indices. s e b a r ? l t o c k ! u y i d h p 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 a e b l s u t 0, 24 o c i r 6 78 d 47, 58 30 y 36 12 k 17, 40, 51, 62 p 84 h 69 10 Tries

Tries 07/28/16 11:04 Compressed Tries A compressed trie has internal nodes of degree at least two. It is obtained by compressing chains of redundant nodes. The “i” and “d” in “bid” are redundant because they signify the same word. Tries 11

Compact Representation Tries 07/28/16 11:04 Compact Representation Compact representation of a compressed trie for an array of strings: Stores at the nodes ranges of indices instead of substrings. Uses O(s) space, where s is the number of strings in the array. Serves as an auxiliary index structure. Tries 12

Tries 07/28/16 11:04 Suffix Trie The suffix trie of a string X is the compressed trie of the non- null suffixes of X. Tries 13

Analysis of Suffix Tries 07/28/16 11:04 Analysis of Suffix Tries Compact representation of the suffix trie for a string X of length n from an alphabet of size d: Uses O(n) space. Supports arbitrary pattern matching queries in X in O(dm) time, where m is the length of the pattern. Can be constructed in O(n) time (algorithm not covered). Tries 14

Huffman Encoding Revisited Tries 07/28/16 11:04 Huffman Encoding Revisited Huffman encoding trees are tries. Each leaf stores a character instead of a string. The strings are generated by the algorithm, not input. a b c d e 00 010 011 10 11 a b c d e Tries 15