Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes

Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes

Huffman Codes Huffman codes are a way to compress data.
They are widely used and effective, leading to % savings of space, depending on the data. A greedy algorithm uses the frequency and occurrence of each character to build an optimal code for representing each character as a binary string.

Variable length codes We can save space by assigning frequently occurring characters short codes, and infrequently occurring characters long codes. a b c d e f Total freq Example: a 0 b 101 c 100 d 111 e 1101 f 1100 Number of bits = (45*1 + 13*3 + 12*3 + 16*3 + 9*4 +5*4)*1000 = 2.24 x 105 bits (This is optimal)

Tree for variable length code
100 1 a: 45 55 1 30 25 1 1 d: 16 c: 12 b: 13 14 1 f: 5 e: 9 Only full binary trees (every non-leaf node has exactly 2 children) represent optimal codes.

Size of Tree The tree for an optimal prefix code has 1 leaf for each letter in the alphabet. Let C = alphabet from which the set of characters is drawn. |C| = Size of C (number of characters in the alphabet) Then the number of leaves in the tree = |C| and the number of internal nodes in the tree = |C| -1

Cost of the tree The number of bits to encode a given character is given by the depth (dT(c)) of the leaf that contains that character. The depth is the length of the path from root to leaf. Given a tree, T, corresponding to a prefix code, Number of bits to encode a file is: frequency of c depth of c This is the cost of the tree.

Greedy Algorithm to create a Huffman code
The greedy algorithm to create an optimal prefix code works as follows: We will store the characters in Q, which we will implement as a Binary-Min-Heap. (A binary-min-heap is a heap with the minimum at the top, and all children have values that are >= to the values of their parent nodes). At each stage, our greedy choice will be to combine the two nodes with the smallest frequencies (values) into a single new node whose value is the sum of the values of the original two nodes. (This new node will have pointers to the original two nodes as its left and right children).

Greedy Pseudocode Huffman(C) n = |C|
Q = C //Store characters in C in a min-heap, Q for i = 1 to n-1 z = Allocate-Node() //Create an empty node z.left = x = Extract-Min(Q) //Node with lowest freq z.right = y = Extract-Min(Q) //Next lowest freq z.freq = x.freq + y.freq Insert(Q, z) //Insert z in appropriate place in Q return Extract-Min(Q) //Last remaining node in Q is root of tree

Example Q <- f : 5 e : 9 c : 12 b : 13 d : 16 a : 45
We will work this out in class.

Running time of Huffman algorithm
Assume Q is implemented as a binary-min-heap. The time to build the heap = ? The for loop is executed n-1 times (once for each node in tree) Time for each extract min? Time for insert? T(n) = ?

Showing the greedy choice property
We can show that this algorithm leads to optimal trees if we can show the greedy choice property and the optimal substructure property for this problem. The Greedy Choice Property: Let C be an alphabet for which each character c in C has frequency f(c). Let x and y be two characters in C having the lowest frequencies. We need to show that there exists an optimal prefix code for C in which the code words for x and y have the same length and differ only by the last bit (i.e. x and y are siblings of the same parent in an optimal tree). We will show this in class.

Using Huffman codes Decode files by starting at root and proceeding down the tree according to the bits in the message (0 = left, 1 = right). When a leaf is encountered, output the character at that leaf and restart at the root. Each message has a different tree. The tree must be saved with the message. Huffman codes are effective for long files where the savings in the message can offset the cost for storing the tree. Huffman codes are also effective when the tree can be precomputed and used for a large number of messages (e.g. a tree based on the frequency of occurrence of characters in the English language). Huffman codes are not very good for random files (each character about the same frequency).

Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes

Similar presentations

Presentation on theme: "Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes

Similar presentations

Presentation on theme: "Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes"— Presentation transcript:

Similar presentations

About project

Feedback