Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Problem: Huffman Coding Def: binary character code = assignment of binary strings to characters e.g. ASCII code A = B = C =
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture04 Data Compression.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Optimal Merging Of Runs
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Data Structures – LECTURE 10 Huffman coding
Chapter 9: Huffman Codes
Greedy Algorithms Huffman Coding
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Lossless Data Compression Using run-length and Huffman Compression pages
Data Compression Basics & Huffman Coding
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Huffman code uses a different number of bits used to encode characters: it uses fewer bits to represent common characters and more bits to represent rare.
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE Lectures 22 – Huffman codes
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
CS-2852 Data Structures LECTURE 13B Andrew J. Wozniewicz Image copyright © 2010 andyjphoto.com.
Data Structures Week 6: Assignment #2 Problem
Huffman Coding Dr. Ying Lu RAIK 283 Data Structures & Algorithms.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
© Jalal Kawash 2010 Trees & Information Coding Peeking into Computer Science.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CPS 100, Spring Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable.
Foundation of Computing Systems
Bahareh Sarrafzadeh 6111 Fall 2009
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Huffman Codes. Overview  Huffman codes: compressing data (savings of 20% to 90%)  Huffman’s greedy algorithm uses a table of the frequencies of occurrence.
Huffman Coding The most for the least. Design Goals Encode messages parsimoniously No character code can be the prefix for another.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
D ESIGN & A NALYSIS OF A LGORITHM 12 – H UFFMAN C ODING Informatics Department Parahyangan Catholic University.
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
Assignment 6: Huffman Code Generation
Chapter 9: Huffman Codes
Analysis & Design of Algorithms (CSCE 321)
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
Trees Addenda.
Data Structure and Algorithms
Podcast Ch23d Title: Huffman Compression
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
CSE 589 Applied Algorithms Spring 1999
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding

Fixed-length coding Problem Consider a message containing only the characters in the alphabet {’a’, ’b’, ’c’, ’d’}. The ASCII code (unsigned int) representation of the characters in the message is not appropriate: we have only 4 characters, each of which is coded using 8 bits. A code that uses only 2 bits to represent each character would be enough to store any message that is a combination of only 4 characters. Fixed-length coding The code of each character (or symbol) has the same number of bits.

Question How many bits do we need to encode uniquely each character in a message made up of characters from an n-letter alphabet ? Answer  log n  bits at least.

Variable-length coding Each character is assigned a different length. Problem when decoding When decoding, there is more than one possible message: Message = aaabcabc...which is correct, or, Message = bcbcbcbc...which is incorrect.

Consider now the following encoding scheme: Decoding When decoding, only one possible message: Message = aaabcabc...which is the correct message.

Prefix-free codes... What is a prefix? 00 is prefix of is prefix of is prefix of Prefix-free codes A prefix-free code is an encoding scheme where the code of a character is not the prefix of any other character: Encoding scheme 1 in the previous example is not a prefix-free code. Encoding scheme 2 in the previous example is a prefix-free code.

Prefix-free codes: advantage A prefix-free code allows to decode the message uniquely: the code represents only one possible message. This is the case in encoding scheme 2. Variable-length prefix-free code vs. Fixed-length code Compared to fixed-length codes, a variable-length prefix-free code allows to obtain shorter codes for messages. Example Consider Message = addd in the alphabet {’a’, ’b’, ’c’, ’d’}

Huffman coding... Objective Huffman coding is an algorithm used for lossless data compression. By lossless, it is meant that the exact original data can be recovered by decoding the compressed data. Applications Several data compression softwares, WinZip, zip, gzip,... Use lossless data encoding.

Basic idea Each symbol, in the original data to be compressed (for example, a character in a file), is assigned a code. The length of the code assigned to a symbol varies from a symbol to another (variable-length coding). The length of the code assigned to a symbol depends on the frequency (the number of times the symbol appears in the original data) of the symbol. Symbols whose frequency is high (appear more often in the message than others) are assigned a shorter codes. Symbols whose frequency is low (appear less often in the message than others) are assigned a longer codes.

Consider the alphabet {’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’} Count the frequency of each character in the message (number of times the character appears). For example: meaning that ’a’ appears only once in the message, ’b’ appears 3 times in the message... g

From the frequency table, build a forest of binary trees. Initially, each tree in the forest contains only a root corresponding to a character of the alphabet and its frequency (that we will call weight): Then, apply the following rules: Merge two trees with the smallest two frequencies, label left edge from the root of the merged tree 0, and the right edge 1, the weight of the root of the merged tree is the sum of the frequencies (weight) of its left and right children. Remarks: it is a non-deterministic algorithm as there is no specified rule to apply in case of identical frequencies.

The code of each character is obtained by concatenating the labels of the edges on the path from the root to the node representing the character: Let f i be the frequency of a character and d i the number of bits in the code of that character: The total number of bits required to encode the message M = = = 5·1+5·3+4·4+3·10+2·13+2·12+2·15 = 146 bits. We need 146 bits to encode the message with the given frequency table.

The average code word length: where n is the # of leaves (# of symbols) in the binary tree and Thus, we can also write where pi is the probability of occurrence of the ith symbol.

Link with information theory The quantity of information carry by a message is call entropy. The entropy is defined by: Where n is the size of the alphabet, C i is the character i and P(C i ) is its associate probability. It can be interpreted as the average optimal (minimal) length of a message using a given alphabet with associated probabilities. It can be compared with the average code word length computed with the Huffman coding. The Huffman coding is near optimal

Example: We have a message using characters a, b, c, d and e with associated probabilities.39,.21,.19,.12 and.09 We can compute the entropy associated: E =.39 * * * * * 3.74 = 2.09 The Huffman coding for this situation give a code length for a, b, c, d and e of respectively 2, 2, 2, 3 and 3 We can then compute the average code word length associated: =.39 * * * * * 3 = 2.21 The obtained code is almost as compact as the optimal code Remark: Many different encoding schemes can be obtained from the Huffman’s tree by exchanging edge label at the same depth.