Data Compression Section 4.8 of [KT].

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms Greed is good. (Some of the time)
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Data Compression Michael J. Watts
Lecture04 Data Compression.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Spatial and Temporal Data Mining
A Data Compression Algorithm: Huffman Compression
Lecture 6: Greedy Algorithms I Shang-Hua Teng. Optimization Problems A problem that may have many feasible solutions. Each solution has a value In maximization.
Data Structures – LECTURE 10 Huffman coding
Chapter 9: Huffman Codes
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Greedy Algorithms Huffman Coding
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
CS420 lecture eight Greedy Algorithms. Going from A to G Starting with a full tank, we can drive 350 miles before we need to gas up, minimize the number.
16.Greedy algorithms Hsu, Lih-Hsing. Computer Theory Lab. Chapter 16P An activity-selection problem Suppose we have a set S = {a 1, a 2,..., a.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE Lectures 22 – Huffman codes
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Huffman Codes. Overview  Huffman codes: compressing data (savings of 20% to 90%)  Huffman’s greedy algorithm uses a table of the frequencies of occurrence.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Huffman encoding.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Data Compression: Huffman Coding in Weiss (p.389)
Data Compression Michael J. Watts
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
CSC317 Greedy algorithms; Two main properties:
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Data Coding Run Length Coding
Compression & Huffman Codes
Chapter 5 : Trees.
Proving the Correctness of Huffman’s Algorithm
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Chapter 16: Greedy Algorithms
Algorithms (2IL15) – Lecture 2
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
Greedy: Huffman Codes Yin Tat Lee
Data Structure and Algorithms
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Huffman Encoding.
Podcast Ch23d Title: Huffman Compression
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: a b c d e f frequency(%)
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Proving the Correctness of Huffman’s Algorithm
Analysis of Algorithms CS 477/677
15 Data Compression Foundations of Computer Science ã Cengage Learning.
Presentation transcript:

Data Compression Section 4.8 of [KT]

Data formats Space runs out very fast: need to compress files

Encodings Fixed length encodings: 8 bits per character (ASCII) Decoding simple: every 8 bits forms a character Morse Code: encoding using dots (0) and dashes (1) e: 0 t: 1 a: 01 frequent letters encoded by shorter strings Ambiguity: 0101 could translate to eta, aa, etet, aet Problem: encoding for a letter (e) is prefix of encoding of another (a) Pauses added between letters Actually an encoding using dots, dashes and pauses Need a code in which decoding is unambiguous

Prefix Codes Code: function : S  {0,1}* S: alphabet {0,1}*: set of all possible 0/1 strings Prefix code:  is prefix-free if: for all x,y in S, (x) is not a prefix of (y) Encoding: string x1x2x3…. encoded as (x1) (x2) (x3)…. Decoding: read shortest prefix that matches some character’s code, delete prefix and repeat

Prefix Code: Example 1(a) = 11 1(b) = 01 1(c) = 001 1(d) = 10 This is a prefix code string cecab  0010000011101 Can be decoded unambiguously Multiple prefix codes possible: which one is better? 2(a) = 11 2(b) = 10 2(c) = 01 2(d) = 001 2(e) = 000

Cost of a Prefix Code For each x in S, fx= frequency = fraction of times x appears in an average text ∑x in S fx = 1 ABL( ) = average #bits per letter for prefix code  = ∑x in S fx |  (x)| Average encoding length for a text of n letters = nABL() = ∑x in S n fx |  (x)| Example: fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 ABL(1) = 0.32·3 + 0.25 · 2 + 0.20 · 3 + 0.18 · 2 + 0.05 · 3 = 2.25 Cost of a fixed length encoding? ABL(2) = 0.32·2 + 0.25 · 2 + 0.20 · 2 + 0.18 · 3 + 0.05 · 3 = 2.23  is optimal if ABL() is the minimum possible

Prefix Codes to Binary Trees Constructing a tree corresponding to a prefix code  Recursively: All letters x whose encoding (x) starts with 0 are in left subtree of the root All letters x whose encoding (x) starts with 1 are in right subtree of the root Recursively construct left and right subtrees

Prefix codes to binary trees a,d b,c,e b d a 1 b d a e c c,e

Deriving a prefix code from a binary tree Let T be a binary tree with |S| leaves Label each leaf with a letter x in S For each x in S follow path from root to leaf labeled x each time path goes from a node to a left child, put 0 each time path goes from a node to its right child, put 1 e d c b a  1 b  011 c  010 d  001 e  000

Different codes from different trees

Codes constructed from a binary tree Lemma: The encoding of S constructed from binary tree T is a prefix code Proof: Suppose encoding of x is a prefix of the encoding of y Then, root to x path is a prefix of the root to y path  x is not a leaf

ABL(T) Length of encoding of x in S = length of path from root to x = depthT(x) b d a 1 e c ABL() = ∑x in S fx depthT(x) = ABL(T) Choosing optimal code is equivalent to choosing tree T with minimum ABL(T)

Structure of optimal trees Full binary tree: A binary tree T is full if every non-leaf node in T has two children Lemma: The binary tree corresponding to the optimal code is full Proof: Suppose T is not full. Then there is a non-leaf node u that has only one child. If u is the root, form T’ by deleting it If u is not the root, bypass it to form T’ ABL(T’) < ABL(T) w w T’ T u v v

Attempt I: Top down approach Intuition: produce tree with leaves that are as close to the root as possible - low average depth Shannon-Fano code Split S into sets S1 and S2 so that total frequency in each set is as close to 1/2 as possible Recursively form subtrees T1 and T2 for S1 and S2, resp. Make T1 and T2 as children of a root node Performs fairly well in practice, but not necessarily optimal

Attempt I: Example S={a,b,c,d,e} fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 S S2 S1 a, d b, c, e a d b c,e ABL=2.25 a b c d e a d b Optimal ABL=1.83 e c

Structure of the optimal tree Suppose we knew the optimal binary tree T*. How would we assign labels to the leaves? Lemma Suppose u,v are leaves of T*, such that depth(u) < depth(v). Further, suppose in the optimal labeling of T*, leaf u is labeled with letter y and leaf v is labeled with letter z. Then, fy ≥ fz. Proof (Exchange argument). Suppose fy < fz. Consider the new code obtained by exchanging y and z. new ABL - old ABL = depth(u)fz + depth(v)fy - depth(u)fy - depth(v)fz = (depth(v)-depth(u)) (fy - fz) < 0 High frequency letters must be at lower depth leaves in T*

Optimal labeling of optimal tree Order leaves of T* in non-decreasing order of depth Order letters in non-increasing order of frequency Match letters to leaves in this order The above order cannot be suboptimal Letters assigned to leaves at the same depth can be interchanged fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 a b c d e Optimal labeling

Properties of Optimal Prefix Codes Lemma There is an optimal prefix code with tree T*, in which the two lowest frequency letters are assigned to leaves that are siblings in T* Suppose v is the leaf at maximum depth in T* v must have a sibling w two lowest frequency letters y, z must be assigned to v,w safe to “lock up” y, z together w v y z

Huffman’s algorithm

Huffman’s algorithm: example S = {a, b, c, d, e} fa = 0.32, fb = 0.25, fc = 0.20, fd = 0.18, fe = 0.05 S’ = {a, b, c, (de)} a b a b c (de) c T’ d e T*

Proving optimality of Huffman’s algorithm Lemma ABL(T’) = ABL(T) - f Proof The depth of every letter x≠y*, z* is the same in T, T’. ABL(T) = ∑x in S fx depthT(x) = fy* depthT(y*) + fz* depthT(z*) + ∑x≠y*, z* fx depthT(x) = (fy* + fz*)(1+depthT’()) + ∑x≠y*, z* fx depthT(x) = f (1+depthT’()) + ∑x≠y*, z* fx depthT(x) = f + f depthT’() + ∑x≠y*, z* fx depthT’(x) = f + ABL(T’)

Proving optimality of Huffman’s algorithm Lemma The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code Proof The proof is by induction on the size of the alphabet. T: tree produced by Huffman’s algo. Z: optimum tree. Suppose ABL(Z) < ABL(T). Let y*, z* be the two lowest frequency letters. W.l.o.g., leaves labeled y*, z* are siblings in T and Z. Z’: delete leaves labeled y*, z* from Z and label their parent by new letter  T’: delete leaves labeled y*, z* from T and label their parent by new letter  ABL(T’)=ABL(T) - f, ABL(Z’)=ABL(Z) - f ABL(Z’) < ABL(T’). Contradicts optimality of T’ for S’ = S-{y*, z*}  {}

Implementation and Running Time Main operations: identify two lowest frequency letters and merge Array implementation: O(k) time per iterations  O(n2) time, n=|S| Priority Queue implementation: O(log k) time per iteration with k letters  O(n log n) time overall

Extensions Encode selective information: 1000 X 1000 image with very few black pixels Can store coordinates of black pixels explicitly Adaptive coding Frequencies of letters may change over the text Change the encoding locally, depending on frequencies

Lempel-Ziv Coding Basis of zip, gzip, compress, etc. Maintain dictionary D of some of the patterns seen so far Encode (the largest possible) pattern W by the index in the dictionary, if it exists When pattern W is coded, add Wa to D if there is space, where a is the letter after W in the text LZ: encodes variable length blocks to fixed ones Huffman: encodes fixed length blocks to variable ones

MP3 format Stands for MPEG (Motion Picture Experts Group) audio layer-3 Raw audio format: (e.g. on a CD) Sample signal 44,100 times per second Gives a sequence of real numbers s1, s2, …, sT Quantization: approximate each sample using 2B One sample for each channel: two channels for stereo 44,100 X 16 X 2 = 1,411,200 bits per second MP3 Fixed length to variable length encoding Further compression by identifying properties of human ear:

MP3 format Stands for MPEG (Motion Picture Experts Group) audio layer-3 Raw audio format: (e.g. on a CD) Sample signal 44,100 times per second Gives a sequence of real numbers s1, s2, …, sT Quantization: approximate each sample using 2B One sample for each channel: two channels for stereo 44,100 X 16 X 2 = 1,411,200 bits per second MP3 Fixed length to variable length encoding Further compression by identifying properties of human ear: Some sounds cannot be heard by the ear Some sounds are heard much better When two sounds are simultaneously played, we hear only the louder one

JPEG format Designed by the Joint Photographers Expert Group Uses Huffman coding

Forward Discrete Cosine Transform Like Fourier transform: separates image into sub-bands of differing importance

Sample image Highest compression Lowest compression