Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bahareh Sarrafzadeh 6111 Fall 2009

Similar presentations


Presentation on theme: "Bahareh Sarrafzadeh 6111 Fall 2009"— Presentation transcript:

1 Bahareh Sarrafzadeh 6111 Fall 2009
Huffman Codes Bahareh Sarrafzadeh 6111 Fall 2009

2 Overview What is Huffman Codes? How can they be helpful?
Fixed-Length Codes v.s. Variable-Length Codes Encoding v.s. Decoding Prefix Codes How to construct the Huffman’s Code? Greedy works! Problem Definition Proof of Correctness Huffman’s Code and Entropy

3 Huffman Codes - Intro A very efficient technique for data compression
Savings of 20% to 90% Proposed by David Huffman, 1952 A greedy algorithm which yields an optimal encoding for characters based on their frequency Huffman codes are a widely used and very effective technique for compressing data; savings of 20% to 90% are typical, depending on the characteristics of the data being compressed. Huffman's greedy algorithm uses a table of the frequencies of occurrence of the characters to build up an optimal way of representing each character as a binary string.

4 Fixed-length v.s. Variable-length
c d e f Frequency (in thousands) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 Variable-length codeword 111 1101 110

5 An Example 100 100 1 1 86 14 a : 45 55 1 1 58 28 14 Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its subtree. (a) The tree corresponding to the fixed-length code a = 000, ..., f = 101. (b) The tree corresponding to the optimal prefix code a = 0, b = 101, ..., f = 1100. 25 30 1 1 1 1 1 a : 45 b : 13 c : 12 d : 16 e : 9 f : 5 14 d : 16 c : 12 b : 13 1 e : 9 f : 5

6 Encoding V.S. Decoding 100 100 101010 11 100 100 1010 1011 T E N N I S
T 11 N 100 I 1010 S 1011 E T 10 N 100 I 0111 S 1010 Encoding is always simple for any binary character codes; we just concatenate the codewords representing each character of the file. When we have codes with variable lengths we may confront some difficulties throughout the Decoding process. 100 100 101010 11 100 100 1010 1011 T E N N I S T N

7 Prefix Codes Identify end of a codeword as soon as it arrives
No codeword can be a prefix of another codeword A symbol code is called a prefix code if no code word is a prefix of any other codeword Prefix Codes simplify decoding. A symbol code is called a prefix code if no code word is a prefix of any other codeword Prefix codes are desirable because they simplify decoding. Since no codeword is a prefix of any other, the codeword that begins an encoded file is unambiguous.

8 A Convenient Data Structure
The decoding process needs a convenient representation for the prefix code. A binary tree Leaves: characters Paths: codewords It is not a BST ! 1 1 1 The decoding process needs a convenient representation for the prefix code so that the initial codeword can be easily picked off. A binary tree whose leaves are the given characters provides one such representation. We interpret the binary codeword for a character as the path from the root to that character, where 0 means "go to the left child" and 1 means "go to the right child.“ Note that these are not binary search trees, since the leaves need not appear in sorted order and internal nodes do not contain character keys. A 1 E F B 1 C D

9 Optimal Code full binary tree
if C is the alphabet and all character frequencies are positive, then the tree for an optimal prefix code has exactly |C| leaves, and |C| - 1 internal nodes Ok, we saw that a binary tree can represent a set of codes assigned to the input characters. What about an Optimal code? An optimal code for a file is always represented by a full binary tree, in which every nonleaf node has two children. The fixed-length code in our example is not optimal since its tree, is not a full binary tree: there are codewords beginning 10..., but none beginning Since we can now restrict our attention to full binary trees, we can say that if C is the alphabet from which the characters are drawn and all character frequencies are positive, then the tree for an optimal prefix code has exactly |C| leaves, one for each letter of the alphabet, and exactly |C| - 1 internal nodes

10 frequency of character c
Cost of a tree Given a tree T, compute the number of bits required to encode a file: Given a tree T corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file. For each character c in the alphabet C, let f (c) denote the frequency of c in the file and let dT(c) denote the depth of c's leaf in the tree. Note that dT(c) is also the length of the codeword for character c. The number of bits required to encode a file is thus depth of c`s leaf character frequency of character c alphabet

11 Greedy Algorithm - Overview
Take the two least probable symbols in the alphabet. Combine these two symbols into a single symbol, and repeat. This algorithm builds a binary tree in a bottom-up manner following 2 steps:

12 An example 14 1 a : 45 d : 16 b : 13 c : 12 e : 9 f : 5

13 25 1 a : 45 d : 16 14 b : 13 c : 12 1 e : 9 f : 5

14 30 1 a : 45 25 d : 16 14 1 1 e : 9 f : 5 b : 13 c : 12

15 55 1 a : 45 30 25 1 1 b : 13 c : 12 d : 16 14 1 e : 9 f : 5

16 100 1 55 a : 45 1 30 25 1 1 b : 13 c : 12 d : 16 14 1 e : 9 f : 5

17 Specifications Preconditions: We have a set of characters C (i.e. an alphabet) and we can derive the frequency table for them. Postconditions: We have a full binary tree which corresponds to the list of prefix codes assigned to each character in C, such that the total cost is minimum. Greedy Choice: Take two nodes in the tree with the least frequencies and merge them. An Adaptive Decision We have an alphabet C of size n We have the frequency table f[c] Total Cost of the tree = Total length of these codes Greedy Choice here is an adaptive choice.

18 Specifications – Cont. Loop Invariant: Establishing LI: Pre LI
We have built a binary tree which is consistent with the optimal solution. Establishing LI: Pre LI Initially we haven’t made any choice, so our current solution is consistent with the optimal solution. Maintaining LI: LI + Code LI’ The Loop Invariant: We have not gone wrong. If there is a solution, then there is at least one optimal solution consistent with the choices made so far by the algorithm. Initially no choices have been made and hence all optimal solutions are consistent with these choices. We build the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C| - 1 “merging” operations to create the final tree. Loop Invariant: The characters are sorted in ascending order of their frequency We have used the first (i – 1) characters and built a suboptimal binary tree with them which is consistent with the optimal solution. Our subtree corresponds to the optimal solution for characters 0 to i – 1, so: the sum of all leaves’ heights (i.e. total code’s length) is minimum No codeword is a prefix of the other one Characters with higher frequencies have smaller heights.

19 Maintaining the LI OptSolLI Algorithm
optSLI: The loop invariant states that there is at least one optimal solution consistent with the choices made by the algorithm before this iteration. Let optSLI denote one such solution. (this is a full solution and as such specifies a decision about each object in the instance.) Taking a Step: During the iteration, the algorithm proceeds to choose the “best” object from amongst those not considered so far and makes an irrevocable decision about it. Instructions for Modifying optSLI: Give the prover’s detailed instructions on how the fairy godmother should modify optSLI . We will use optSours to denote what she constructs. Algorithm

20 Instructions for Fairy Godmother!
In the optimal tree T, leaves a and b are two of the deepest leaves and are siblings. Leaves x and y are the two leaves that Huffman's algorithm merges together first; they appear in arbitrary positions in T. Leaves a and x are swapped to obtain tree T′. Then, leaves b and y are swapped to obtain tree T″. Since each swap does not increase the cost, the resulting tree T″ is also an optimal tree. x a a y y b a b x b x y

21 Proof of Correctness: Validity We have a Tree!
Provingopt Sours is a Valid Solution: By the loop invariant optS_LI is a valid solution. (i.e. its decisions do not conflict in any way.) Our modifications did not make it invalid, because we only changed the structure of the tree, so we still have a tree! The changes may have made conflicts, but all of these conflicts were fixed. Validity: We have a full binary tree Consistency: We have used the first (i – 1) characters and built a suboptimal binary tree with them which is consistent with the optimal solution. Optimality: Our subtree corresponds to the optimal solution for characters 0 to i – 1, so: the sum of all leaves’ heights (i.e. total code’s length) is minimum No codeword is a prefix of the other one Characters with higher frequencies have smaller heights.

22 2. Consistency T T’ T” a b x y x a y y a b x b
Proving optSours is Consistency with Algorithm: By the loop invariant optS_LI is consistency with all the decisions made by algorithm before this iteration. The prover made sure that his modifications did not change any of these decisions and changed optS_LI ’s decision about the latest object to be consistent with what the algorithm did. Hence, optSours is is consistency both with earlier decisions made by algorithm and this most recent decision. In the optimal tree T, leaves a and b are two of the deepest leaves and are siblings. Leaves x and y are the two leaves that Huffman's algorithm merges together first; they appear in arbitrary positions in T. Leaves a and x are swapped to obtain tree T′. Then, leaves b and y are swapped to obtain tree T″. Since each swap does not increase the cost, the resulting tree T″ is also an optimal tree. Let C be an alphabet in which each character c ɛ C has frequency f [c]. Let x and y be two characters in C having the lowest frequencies. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit. Proof :The idea of the proof is to take the tree T representing an arbitrary optimal prefix code and modify it to make a tree representing another optimal prefix code such that the characters x and y appear as sibling leaves of maximum depth in the new tree. If we can do this, then their codewords will have the same length and differ only in the last bit. x a a y y b a b x b x y

23 We need to prove Cost (T’) is not more than Cost (T)
3. Optimality We need to prove Cost (T’) is not more than Cost (T) T x y a b

24 Running Time Running Time: O (n lg n) 1 Huffman (C) 2 n |C| 3 Q C
Q : a binary min heap 1 Huffman (C) 2 n |C| 3 Q C 4 for i = 1 to n – 1 do allocate a new node z left [z] x Extract-Min (Q) right [z] y Extract-Min (Q) f [z] f [x] + f [y] Insert (Q, z) 10 Return Extract-Min (Q) O (n) O (lg n) Running Time: O (n lg n)

25 Conclusion Huffman Coding Entropy Introduction and Application
Greedy Algorithm Proof of Correctness Entropy

26

27 Huffman’s Code and Entropy
As defined by Shannon, the information content h (in bits) of each symbol ci with non-null probability is: The entropy H (in bits) is the weighted sum, across all symbols ci with non-zero probability pi of the information content of each symbol: Lower bound on expected length is H(X) There is no better symbol code for a source than the Huffman code Constructing a binary tree top-down is suboptimal

28 a b c d e Sum Huffman SCode Optimality
Input (C, f) Symbol (ci) a b c d e Sum Probabilities (pi) 0.10 0.15 0.30 0.16 0.29 = 1 Huffman SCode Codewords(cwi) 000 001 10 01 11 Codeword length (in bits) (li) 3 2 Cost = li pi 0.45 0.60 0.32 0.58 L(C) = 2.25 Optimality Probability budget (2-li ) 1/8 1/4 = 1.00 Information content (in bits) (−log2 pi) 3.32 2.74 1.74 2.64 1.79 Entropy (−pi log2 pi) 0.332 0.411 0.521 0.423 0.518 H(A) = 2.205

29 Huffman reaches entropy limit when all probabilities are negative powers of 2
i.e., 1/2; 1/4; 1/8; 1/16; etc. H <= Code Length < H + 1


Download ppt "Bahareh Sarrafzadeh 6111 Fall 2009"

Similar presentations


Ads by Google