CPS 100, Spring 2008 10.1 Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Problem: Huffman Coding Def: binary character code = assignment of binary strings to characters e.g. ASCII code A = B = C =
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Compression & Huffman Codes
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Optimal Merging Of Runs
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
A Data Compression Algorithm: Huffman Compression
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Data Structures – LECTURE 10 Huffman coding
Chapter 9: Huffman Codes
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Data Compression Basics & Huffman Coding
Huffman code uses a different number of bits used to encode characters: it uses fewer bits to represent common characters and more bits to represent rare.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Algorithm Design & Analysis – CS632 Group Project Group Members Bijay Nepal James Hansen-Quartey Winter
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2004 Simonas Šaltenis E1-215b
CompSci 100 Prog Design and Analysis II
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Data Structures Week 6: Assignment #2 Problem
CPS 100, Fall PF-AT-EOS l Data compression  Seriously important, why? l Priority Queue  Seriously important, why? l Assignment overview  Huff.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
Compsci 100, Fall Review: Why Compression l What gets compressed?  Save on storage, why is this a good idea?  Save on data transmission, how.
CSCE350 Algorithms and Data Structure Lecture 19 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
CPS Heaps, Priority Queues, Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  What property of MP3 was a significant.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CompSci 100e Program Design and Analysis II April 7, 2011 Prof. Rodger CompSci 100e, Spring p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
Bahareh Sarrafzadeh 6111 Fall 2009
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
CompSci 100e 8.1 Plan for the Course! l Understand Huffman Coding  Data compression  Priority Queues  Bits and Bytes  Greedy Algorithms l Algorithms.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
CPS Heaps, Priority Queues, Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  Why is compression important?
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
CompSci 100e 10.1 Binary Digits (Bits) l Yes or No l On or Off l One or Zero l
TWTG What work is still to come APT, Huff, APT
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
Compression & Huffman Codes
Assignment 6: Huffman Code Generation
The Greedy Method and Text Compression
The Greedy Method and Text Compression
CompSci 201 Data Representation & Huffman Coding
Heaps, Priority Queues, Compression
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Advanced Algorithms Analysis and Design
Huffman Coding CSE 373 Data Structures.
Data Structure and Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Scoreboard What else might we want to do with a data structure?
Presentation transcript:

CPS 100, Spring Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable length codes l Variable length codes though Prefix codes  Each letter is assigned a codeword  Codeword is for a given letter is produced by traversing the Huffman tree  Property: No codeword produced is the prefix of another  Letters appearing frequently have short codewords, while those that appear rarely have longer ones l Huffman coding is optimal per-character coding method

CPS 100, Spring Huffman coding: go go gophers l Encoding uses tree:  0 left/1 right  How many bits? 37!!  Savings? Worth it? ASCII 3 bits Huffman g o p h e r s sp s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o

CPS 100, Spring Coding/Compression/Concepts l We can use a fixed-length encoding, e.g., ASCII or 3-bits/per  What is minimum number of bits to represent N values?  Repercussion for representing genomic data (a, c,g, t)?  l We can use a variable-length encoding, e.g., Huffman  How do we decide on lengths? How do we decode?  Values for Morse code encodings, why?Morse code encodings 

CPS 100, Spring Building a Huffman tree l Begin with a forest of single-node trees (leaves)  Each node/tree/leaf is weighted with character count  Node stores two values: character and count  There are n nodes in forest, n is size of alphabet? l Repeat until there is only one node left: root of tree  Remove two minimally weighted trees from forest  Create new tree with minimal trees as children, New tree root's weight: sum of children (character ignored) l Does this process terminate? How do we get minimal trees?  Remove minimal trees, hummm……

CPS 100, Spring Mary Shaw l Software engineering and software architecture  Tools for constructing large software systems  Development is a small piece of total cost, maintenance is larger, depends on well-designed and developed techniques l Interested in computer science, programming, curricula, and canoeing, health-care costs l ACM Fellow, Alan Perlis Professor of Compsci at CMU

CPS 100, Spring Huffman coding: go go gophers l choose two smallest weights  combine nodes + weights  Repeat  Priority queue? l Encoding uses tree:  0 left/1 right  How many bits? ASCII 3 bits Huffman g ?? o ?? p h e r s sp goers* 33 h p 1 h 1 2 e 1 r 1 3 s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o p

CPS 100, Spring How do we create Huffman Tree/Trie? l Insert weighted values into priority queue  What are initial weights? Why?  l Remove minimal nodes, weight by sums, re-insert  Total number of nodes? PriorityQueue pq = new PriorityQueue (); for(int k=0; k < myCounts.length; k++){ pq.add(new TreeNode(k,freq[k],null,null)); } while (pq.size() > 1){ TreeNode left = pq.remove(); TreeNode right = pq.remove(); pq.add(new TreeNode(0,left.weight+right.weight, left,right)); } TreeNode root = pq.remove();

CPS 100, Spring Other Huffman Issues l What do we need to decode?  How did we encode?  How will we decode? l How do we know when to stop reading bits? How do we write bits?  We can write 3 bits, system will write 8, what do we do?  PSEUDO_EOF l Decode: read a bit, if 0 go left, else go right  When do we stop?

CPS 100, Spring Huffman Complexities l How do we measure? Size of input file, size of alphabet  Which is typically bigger? l Accumulating character counts: ______  How can we do this in O(1) time, though not really l Building the heap/priority queue from counts ____  Initializing heap guaranteed l Building Huffman tree ____  Why? l Create table of encodings from tree ____  Why? l Write tree and compressed file _____

CPS 100, Spring Writing code out to file l How do we go from characters to encodings?  Build Huffman tree  Root-to-leaf path generates encoding l Need way of writing bits out to file  Platform dependent?  Complicated to write bits and read in same ordering l See BitInputStream and BitOutputStream classes  Depend on each other, bit ordering preserved l How do we know bits come from compressed file?  Store a magic number

CPS 100, Spring Other methods l Adaptive Huffman coding l Lempel-Ziv algorithms  Build the coding table on the fly while reading document  Coding table changes dynamically  Protocol between encoder and decoder so that everyone is always using the right coding scheme  Works well in practice ( compress, gzip, etc.) l More complicated methods  Burrows-Wheeler ( bunzip2 )  PPM statistical methods