Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.

Similar presentations


Presentation on theme: "Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so."— Presentation transcript:

1 Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so. Basically, file compression relies on using a smaller number of bits than normal to represent data.

2 Huffman Example ASCII A 01000001 B 01000010 C 01000011 D 01000100 A 01
There are many ways to compress data, but we’ll focus here on a method called Huffman Coding. ASCII is a fixed-length encoding scheme in which files are decoded by looking at file blocks of constant length (8 bits). Huffman encodings on the other hand are variable-length encodings, meaning that each character can have a representation of different length. If we can somehow figure out which characters occur most often, and encode them with fewer than 8 bits, we will be able to save more space than ASCII. However, we will then have to code some of the less frequently occurring characters with more than 8 bits, but that’s okay as long as the net effect is a decrease in total number of bits used!

3 Prefix Property A 0 B 1 C 01 D 11 A 0 B 10 C 110 D 111
An important property of Huffman coding is that no bit representation for any of the characters is a prefix of any other character’s representation. This is called the prefix property, and an encoding scheme with the prefix property is said to be immediately decodable. Why is the prefix property so important? Consider the following example: If we use the coding on the left to encode “CAB”, we would get the bit-string However, presented with the bit-string 0101, there is no way to tell whether it represents “ABAB”, “CAB”, “ABC”, or “CC”. In contrast, the coding on the right is unambiguous. The bit-string “110010” can only represent “CAB” (confirm it for yourself!).

4 Step 1: Frequency Analysis
String: “ECEABEADCAEDEEEECEADEEEEEDBAAEABDBBAAEAAACDDCCEABEEDCBEEDEAEEEEEAEEDBCEBEEADEAEEDAEBCDEDEAEEDCEEAEEE” We will use a binary tree to construct Huffman encodings for each character. First, however, we must figure out how often each character occurs in the file by building a frequency table. Test Yourself! After Huffman encoding, which of the letters in this string will be represented by the shortest bit-string? character A B C D E frequency 0.2 0.1 0.15 0.45

5 Step 2: Forest of Single Trees
B 0.1 C 0.1 D 0.15 A 0.2 E 0.45 Next, we’ll create a forest of trees, where each tree is a single node containing a character and the frequency of that character.

6 Step 3: Combine Trees 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45
Then, we combine trees until the forest contains only one tree. While there’s still more than one tree: Remove the two trees from the forest that have the smallest frequencies in their roots. Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children. Insert this new tree back into the forest. 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45

7 Step 3: Combine Trees 0.35 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45
While there’s still more than one tree: Remove the two trees from the forest that have the smallest frequencies in their roots. Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children. Insert this new tree back into the forest. 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45

8 Step 3: Combine Trees 0.55 0.35 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45
While there’s still more than one tree: Remove the two trees from the forest that have the smallest frequencies in their roots. Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children. Insert this new tree back into the forest. 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45

9 Step 3: Combine Trees 1.0 0.55 0.35 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E
While there’s still more than one tree: Remove the two trees from the forest that have the smallest frequencies in their roots. Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children. Insert this new tree back into the forest. Nice! We now only have one tree! Notice that all the frequencies add up to a total of 1.0. 0.2 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45

10 Step 4: Traverse Tree 1 1 1 1 1.0 0.55 0.35 0.2 B 0.1 C 0.1 D 0.15 A
0.55 1 0.35 1 1 To convert each letter into its Huffman encoding, we traverse the tree. For the letter “E,” we step right just once, so “E”’s encoding is 1. For the letter “A,” we step left once and right once, so “A”’s encoding is 01. “D” is 001. What are the Huffman representations of “C” and “B”? 0.2 1 B 0.1 C 0.1 D 0.15 A 0.2 E 0.45


Download ppt "Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so."

Similar presentations


Ads by Google