Presentation is loading. Please wait.

Presentation is loading. Please wait.

Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.

Similar presentations


Presentation on theme: "Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding."— Presentation transcript:

1 Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding

2 Fixed-length coding Problem Consider a message containing only the characters in the alphabet {’a’, ’b’, ’c’, ’d’}. The ASCII code (unsigned int) representation of the characters in the message is not appropriate: we have only 4 characters, each of which is coded using 8 bits. A code that uses only 2 bits to represent each character would be enough to store any message that is a combination of only 4 characters. Fixed-length coding The code of each character (or symbol) has the same number of bits.

3 Question How many bits do we need to encode uniquely each character in a message made up of characters from an n-letter alphabet ? Answer  log n  bits at least.

4 Variable-length coding Each character is assigned a different length. Problem when decoding When decoding, there is more than one possible message: Message = aaabcabc...which is correct, or, Message = bcbcbcbc...which is incorrect.

5 Consider now the following encoding scheme: Decoding When decoding, only one possible message: Message = aaabcabc...which is the correct message.

6 Prefix-free codes... What is a prefix? 00 is prefix of 001 110 is prefix of 110111 111101 is prefix of 111101001 Prefix-free codes A prefix-free code is an encoding scheme where the code of a character is not the prefix of any other character: Encoding scheme 1 in the previous example is not a prefix-free code. Encoding scheme 2 in the previous example is a prefix-free code.

7 Prefix-free codes: advantage A prefix-free code allows to decode the message uniquely: the code represents only one possible message. This is the case in encoding scheme 2. Variable-length prefix-free code vs. Fixed-length code Compared to fixed-length codes, a variable-length prefix-free code allows to obtain shorter codes for messages. Example Consider Message = addd in the alphabet {’a’, ’b’, ’c’, ’d’}

8 Huffman coding... Objective Huffman coding is an algorithm used for lossless data compression. By lossless, it is meant that the exact original data can be recovered by decoding the compressed data. Applications Several data compression softwares, WinZip, zip, gzip,... Use lossless data encoding.

9 Basic idea Each symbol, in the original data to be compressed (for example, a character in a file), is assigned a code. The length of the code assigned to a symbol varies from a symbol to another (variable-length coding). The length of the code assigned to a symbol depends on the frequency (the number of times the symbol appears in the original data) of the symbol. Symbols whose frequency is high (appear more often in the message than others) are assigned a shorter codes. Symbols whose frequency is low (appear less often in the message than others) are assigned a longer codes.

10 Consider the alphabet {’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’} Count the frequency of each character in the message (number of times the character appears). For example: meaning that ’a’ appears only once in the message, ’b’ appears 3 times in the message... g

11 From the frequency table, build a forest of binary trees. Initially, each tree in the forest contains only a root corresponding to a character of the alphabet and its frequency (that we will call weight): Then, apply the following rules: Merge two trees with the smallest two frequencies, label left edge from the root of the merged tree 0, and the right edge 1, the weight of the root of the merged tree is the sum of the frequencies (weight) of its left and right children. Remarks: it is a non-deterministic algorithm as there is no specified rule to apply in case of identical frequencies.

12

13

14

15

16

17

18 The code of each character is obtained by concatenating the labels of the edges on the path from the root to the node representing the character: Let f i be the frequency of a character and d i the number of bits in the code of that character: The total number of bits required to encode the message M = = = 5·1+5·3+4·4+3·10+2·13+2·12+2·15 = 146 bits. We need 146 bits to encode the message with the given frequency table.

19 The average code word length: where n is the # of leaves (# of symbols) in the binary tree and Thus, we can also write where pi is the probability of occurrence of the ith symbol.

20 Link with information theory The quantity of information carry by a message is call entropy. The entropy is defined by: Where n is the size of the alphabet, C i is the character i and P(C i ) is its associate probability. It can be interpreted as the average optimal (minimal) length of a message using a given alphabet with associated probabilities. It can be compared with the average code word length computed with the Huffman coding. The Huffman coding is near optimal

21 Example: We have a message using characters a, b, c, d and e with associated probabilities.39,.21,.19,.12 and.09 We can compute the entropy associated: E =.39 * 1.238 +.21 * 2.252 + 1.9 * 2.396 +.12 * 3.059 +.09 * 3.74 = 2.09 The Huffman coding for this situation give a code length for a, b, c, d and e of respectively 2, 2, 2, 3 and 3 We can then compute the average code word length associated: =.39 * 2 +.21 * 2 +.19 * 2 +.12 * 3 +.09 * 3 = 2.21 The obtained code is almost as compact as the optimal code Remark: Many different encoding schemes can be obtained from the Huffman’s tree by exchanging edge label at the same depth.


Download ppt "Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding."

Similar presentations


Ads by Google