Download presentation
1
Chapter 3: Compression (Part 1)
2
Storage size for media types
1
3
A typical encyclopedia
50,000 pages of text (2KB/page) GB 3,000 color pictures (on average 64048024 bits = 1MB/picture) GB 500 maps/drawings (on average 64048016 bits = 0.6 MB/map) GB 60 minutes of stereo sound (176 KB/s) GB 30 animations, on average 2 minutes in duration (64032016 pixels/frame = 6.5 MB/s) GB 50 digitized movies, on average 1 min. in duration (64048030 frame/sec = 27.6 MB/s) GB TOTAL: GB (or 170 CD’s, or about 2 feet high of CD’s) 2
4
Compressing an encyclopedia
3
5
Compression and encoding
We can view compression as an encoding method which transforms a sequence of bits/bytes into an artificial format such that the output (in general) takes less space than the input. Compressor Encoder Fat data Slim data Fat (redundancy)
6
Redundancy Compression is made possible by exploiting redundancy in source data For example, redundancy found in: Character distribution Character repetition High-usage pattern Positional redundancy (these categories overlap to some extent)
7
Symbols and Encoding We assume a piece of data (such as text, image, etc.) is represented by a sequence of symbols. Each symbol is encoded in a computer by a code (or value). Example: English text: abc (symbols), ASCII (coding) Chinese text: 多媒體 (symbols), BIG5 (coding) Image: color (symbols), RGB (coding)
8
Character distribution
Some symbols are used more frequently than others. In English text, ‘e’ and space occur most often. Fixed-length encoding: use the same number of bits to represent each symbol. With fixed-length encoding, to represent n symbols, we need log2n bits for each code.
9
Character distribution
Fixed-length encoding may not be the most space-efficient. Example: “abacabacabacabacabacabacabac” With fixed-length encoding, we need 2 bits for each of the 3 symbols. Alternatively, we could use ‘1’ to represent ‘a’ and ‘00’ and ‘01’ to represent ‘b’ and ‘c’, respectively. How much saving did we achieve?
10
Character distribution
Technique: whenever symbols occur with different frequencies, we use variable length encoding. For your interest: In 1838, Samuel Morse and Alfred Vail designed the “Morse Code” for telegraph. Each letter of the alphabet is represented by dots (electric current of short duration) and dashes (3 dots). e.g., E “.” Z “--..” Morse & Vail determined the code by counting the letters in each bin of a printer’s type box.
11
Morse code table Ch SOS = — — —
12
Character distribution
It can be shown that Morse Code can be improved by no more than 15%. Principle: use fewer bits to represent frequently occurring symbols, use more bits to represent those rarely occurring ones.
13
Character repetition when a string of repetitions of a single symbol occurs, e.g., “aaaaaaaaaaaaaabc” Example: long strings of spaces/tabs in forms/tables. Example: in graphics, it is common to have a line of pixels using the same color. Technique: run-length encoding.
14
High usage patterns Certain patterns (i.e., sequences of symbols) occur very often. Example: “qu”, “th”, ……, in English text. Example: “begin”, “end”, …, in Pascal source code. Technique: Use a single special symbol to represent a common sequence of symbols. e.g., = ‘th’ i.e., instead of using 2 codes for ‘t’ and ‘h’, use only 1 code for ‘th’
15
Positional redundancy
A certain thing appears persistently at a predictable place. Sometimes, symbols can be predicted. Example: the left-uppermost pixel of my slides is always white. Technique: tell me the “changes” only. example: differential pulse code modulation (DPCM)
16
Categories of compression techniques
Entropy encoding regard data stream as a simple digital sequence, its semantics is ignored lossless compression Source encoding take into account the semantics of the data and the characteristics of the specific medium to achieve a higher compression ratio lossy compression 5
17
Categories of compression techniques
Hybrid encoding A combination of the above two. 5
18
A categorization 5
19
Information theory Entropy is a measure of uncertainty or randomness. Entropy order According to Shannon, the entropy of an information source S is defined as: where pi is the probability that symbol Si in S occurs. log2(1/pi) indicates the amount of information contained in Si or the number of bits needed to encode it. 7
20
Information theory The entropy is the lower bound for lossless compression: when the probability of each symbol is fixed, each symbol should be represented at least with H(S) bits on average. A guideline: the code lengths for different symbols should vary according to the information carried by the symbol. The coding techniques based on this are called entropy encoding.
21
Information theory For example, in an image with uniform distribution of gray-level intensity, pi, is 1/256. The number of bits needed to code each gray level is log21/(1/256) = log2256 = 8 bits. The entropy of this image is
22
Information theory Q: How about an image in which half of the pixels are white, another half are black? Ans: p(white) = 0.5, p(black) = 0.5, Q: How about an image such that white occurs ¼ of the time, black ¼ of the time, and gray ½ of the time? Ans: Entropy = 1.5. Entropy encoding is about wisely using just enough number of bits to represent the occurrence of certain patterns.
23
Variable-length encoding
A problem with variable-length encoding is ambiguity in the decoding process. e.g., 4 symbols: a (‘0’), b (‘1’), c (‘01’), d (‘10’) the bit stream: ‘0110’ can be interpreted as abba, or abd, or cba, or cd
24
Prefix code The idea is to translate fixed-sized pieces of input data into variable-length encoding. We wish to encode each symbol into a sequence of 0’s and 1’s so that no code for a symbol is the prefix of the code of any other symbol – the prefix property. With the prefix property, we are able to decode a sequence of 0’s and 1’s into a sequence of symbols. Morse code is not a prefix code
25
Shannon-Fano algorithm
Example: 5 characters are used in a text, statistics: Algorithm (top-down approach): 1. sort symbols according to their probabilities. 2. divide symbols into 2 half-sets, each with roughly equal counts. 3. all symbols in the first group are given ‘0’ as the first bit of their codes; ‘1’ for the second group. 4. recursively perform steps 2-3 for each group with the additional bits appended. A B 8
26
Shannon-Fano algorithm
Example: Gp1: {a,b}; Gp2: {c,d,e} subdivide Gp1: Gp11:{a}; Gp12{b} subdivide Gp2: Gp21:{c}; Gp22:{d,e} subdivide Gp22: Gp221:{d}; Gp222:{e} 8
27
Shannon-Fano algorithm
Result: 8
28
Shannon-Fano algorithm
The coding scheme in a tree representation: Total # of bits used for encoding is 89. Uncompressed code will need 3 bits for each symbol a total of 117 bits. 15 7 6 5 11 39 22 17 A B C D E 1 9
29
Shannon-Fano algorithm
It can be shown that the number of bits used to encode a symbol on average using the Shannon-Fano algorithm is between H(S) and H(S) + 1. Hence it is fairly close to the theoretical optimal. 9
30
Huffman code used in many compression programs, e.g., PKzip, ARJ.
Each code length in bits is approximately log2(1/pj) example: Pr[space] = 2.4 bits, Pr[Z] = 10.3 bits
31
Huffman code Algorithm (bottom-up approach):
We start with a forest of n trees. Each tree has only 1 node corresponding to a symbol. Let’s define the weight of a tree = sum of the probabilities of all the symbols in the tree. Repeat the following steps until the forest has one tree left: pick 2 trees of the lightest weights, let’s say, T1, T2. replace T1 and T2 by a new tree T, such that T1 and T2 are T’s children.
32
Huffman code a(15/39) b(7/39) c(6/39) d(6/39) e(5/39) 11/39 a(15/39)
13/39 11/39 a(15/39) b(7/39) c(6/39) d(6/39) e(5/39)
33
Huffman code 24/39 13/39 11/39 a(15/39) b(7/39) c(6/39) d(6/39)
34
Huffman code result from the previous example:
35
Huffman code The code of a symbol is obtained by traversing the tree from the root to the symbol, taking a ‘0’ when a left branch is followed; and a ‘1’ when a right branch is followed.
36
Huffman code total: 87 bits “ABAAAE” “0100000111” char code
bits/char # occ. #bits used A 1 15 B 100 3 7 21 C 101 6 18 D 110 E 111 5 total: 87 bits “ABAAAE” “ ”
37
Huffman code (advantages)
Decoding is trivial as long as the coding table is sent before the data. Overhead is relatively negligible if the data file is big. Prefix property: No code is a prefix to any other code. Forward scanning is never ambiguous. Good for decoder. 11
38
Huffman code (advantages)
“Optimal code size” (among any coding scheme that uses a fixed code for each symbol.) Entropy is Average bits/symbol needed for Huffman coding is 11
39
Huffman code (disadvantages)
A table is needed that lists each input symbol and its corresponding code. Need to know the symbol frequency distribution. (It might be ok for English text, but not data files) need two passes over the data. Frequency distribution may change over the content of the source.
40
Arithmetic coding Motivation:
Huffman coding uses integer number (k) of bits for each symbol. k is at least 1. Sometimes, e.g., sending bi-level images, compression does not help. In fact, Huffman is optimal (i.e., can attain the theoretical information lower bound) when and only when the symbol probabilities are integral powers of 1/2. Arithmetic coding works by representing a message by an interval of real numbers between 0 and 1. As the message becomes longer, the interval needed to represent it becomes smaller, and the number of bits needed to represent the interval increases. 12
41
idea: Suppose alphabet is {X,Y} and p(X) = 2/3, p(Y) = 1/3. Map X,Y to interval [0..1]: To encode length-2 messages, we map all possible messages to interval [0..1]: Now, just send enough bits of the binary fraction that uniquely specifies the interval. X Y 2/3 1 XX YX 6/9 1 4/9 8/9 XY YY XX YX 6/9 1 4/9 8/9 XY YY .01 2/4 .10 3/4 .110 15/16 .11111
42
Arithmetic coding Similarly, we can map all length-3 messages as follows: Length1 Length2 Length3 Interval XXX XX 8/27 XXY X 12/27 XYX XY 16/27 XYY 18/27 YXX YX 22/27 YXY Y 24/27 YYX YY 26/27 YYY 13 1 26
43
Arithmetic coding Decoding .110 on receiving ‘1’ marker: .1????
interval [0.5, 1] symbol: ? Length1 Length2 Length3 Interval XXX XX 8/27 XXY X 12/27 XYX XY 16/27 XYY 18/27 YXX YX 22/27 YXY Y 24/27 YYX YY 26/27 YYY 1 26 13
44
Arithmetic coding Decoding .110 on receiving ‘11’ marker: .11???
interval [0.75, 1] symbol: Y Length1 Length2 Length3 Interval XXX XX 8/27 XXY X 12/27 XYX XY 16/27 XYY 18/27 YXX YX 22/27 YXY Y 24/27 YYX YY 26/27 YYY 1 26 13
45
Arithmetic coding Decoding .110 on receiving ‘110’ marker: .110?
interval [0.75, 0.875) symbol: YX Length1 Length2 Length3 Interval XXX XX 8/27 XXY X 12/27 XYX XY 16/27 XYY 18/27 YXX YX 22/27 YXY Y 24/27 YYX YY 26/27 YYY 1 26 13
46
Arithmetic coding Decoding .110 on receiving ‘110’ marker: .110
interval [0.75, 0.875) symbol: YX If no more bits are received, there is insufficient data to deduce the 3rd symbol. Hence, ‘110’ represents the symbol string “YX”. Length1 Length2 Length3 Interval XXX XX 8/27 XXY X 12/27 XYX XY 16/27 XYY 18/27 YXX YX 22/27 YXY Y 24/27 YYX YY 26/27 YYY 1 26 13
47
Arithmetic coding Number of bits of codeword determined by the size of interval. First interval is 8/27, needs 2 bits. 2/3 bits per symbol (efficient). Last interval is 1/27, needs 5 bits. In general, it needs log2(1/p) bits to represent interval of size p. The code length approaches optimal encoding as message unit gets longer. Although arithmetic coding gives good compression efficiency, it is very CPU demanding (in generating the code). 14
48
Run-length encoding (RLE)
Image, video and audio data streams often contain sequences of the same bytes, e.g., background in images, and silence in sound files. Substantial compression can be achieved by replacing repeated byte sequences with the number of occurrences. 17
49
Run-length encoding If a byte occurs at least 4 consecutive times, the sequence is encoded with the byte followed by a special flag (i.e., a special byte) and the number of its occurrences, e.g., Uncompressed: ABCCCCCCCCDEFGGGHI Compressed: ABC!8DEFGGGHI If 4Cs (CCCC) are encoded as C!0, and 5Cs as C!1, and so on, then it allows compression of 4 to 259 bytes into 3 bytes only. 17
50
Run-length encoding Problems: used in ITU-T Group 3 fax compression
Ineffective with text except, perhaps, “Ahhhhhhhhhhhhhhhhh!”, “Noooooooooo!”. Need an escape byte (‘!’) ok with text not ok with data solution? used in ITU-T Group 3 fax compression a combination of RLE and Huffman compression ratio: 10:1 or better
53
Lempel-Ziv-Welch algorithm
Method due to Ziv, Lempel in 1977, Improvement by Welch in Hence, named LZW compression. used in UNIX compress, GIF, and V.42bis compression modems. Motivation: Suppose we want to encode an English text. We first encode the Webster’s English dictionary which contains about 159,000 entries. Here, we surely can use 18 bits for a word. Then why don’t we just transmit each word in the text as an 18 bit number, and so obtain a compressed version of the text? 18
54
LZW Problem: Solution:
Need to preload a copy of the dictionary before decoding. Dictionary is huge, 159,000 entries. Solution: Build the dictionary on-line while decoding. No bandwidth and storage is wasted for the preloading. Build the dictionary adaptively. Words not used in the text are not incorporated in the dictionary.
55
LZW Example: encode “^WED^WE^WEE^WEB^WET” using LZW.
Assume each symbol (character) is encoded by an 8-bit code (i.e., the codes for the characters run from 0 to 255). We assume that the dictionary is loaded with these 256 codes initially. While scanning the string, sub-string patterns are encoded into the dictionary. Patterns that appear after the first time would then be substituted by the index into the dictionary.
56
Example 1 (compression)
buffer
57
Example 1 (compression)
should be interpreted as numeric codes symbols buffer
58
Example 1 (compression)
buffer
59
LZW (encoding rules) Rule 1: whenever the pattern in the buffer is in the dictionary, we try to extend it. Rule 2: whenever a pattern is added to the dictionary, we output the codeword of the pattern before the last extension.
60
Example 1 Original: 19 symbols, 8-bit code each 19 8 = 152 bits
LZW’ed: 12 codes, 9 bits each 12 9 = 108 bits Some compression achieved. Compression effectiveness is more substantial for long messages. The dictionary is not explicitly stored. In a sense, the compressor is encoding the dictionary in the output stream implicitly. 20
61
LZW decompression “^WED{256}E{260}{261}{257}B{260}T” is read in.
recall that these are codes “^WED{256}E{260}{261}{257}B{260}T” is read in. Assume the decoder also knows about the first 256 entries of the dictionary. While scanning the code stream for decoding, the dictionary is rebuilt. When an index is read, the corresponding dictionary entry is retrieved to splice in position. The decoder deduces what the buffer of the encoder has gone through and rebuild the dictionary. 21
62
Example 1 (decompression)
63
LZW (decoding rule) Whenever there are 2 codes C1C2 in the buffer, add a new entry to the dictionary: symbols of C1 + first symbol of C2
64
Compression code Decompression code 22
65
Example 2 ababcbababaaaaaaa
66
Example 2 (compression)
67
LZW: Pros and Cons LZW (advantages) effective at exploiting:
character repetition redundancy high-usage pattern redundancy an adaptive algorithm (no need to learn about the statistical properties of the symbols beforehand) one-pass procedure
68
LZW: Pros and Cons LZW (disadvantages)
Messages must be long for effective compression It takes time for the dictionary to build up. However, if the message is not homogeneous redundancy characteristics shift during the message compression efficiency drops.
69
LZW (Implementation) decompression is generally faster than compression. Why? usually uses 12-bit codes 4096 table entries. need a hash table for compression.
70
LZW and GIF 320 x 200 11 colors create a color map of 11 entries:
color code [r0,g0,b0] [r1,g1,b1] [r2,g2,b2] … … [r10,g10,b10] 10 convert the image into a sequence of codes, apply LZW to the code stream. GIF is a file format, not a compression algorithm by itself. It uses LZW for compression. GIF uses a color lookup table with 256 entries. If there are more than 256 distinct colors in an image, a color is mapped to the “closest” entry in the table. 320 x 200
71
GIF, 5280 bytes (0.66 bpp) bit per pixel JPEG, bytes (2.04 bpp)
72
GIF image
73
JPEG image
74
V.42bis based on LZW algorithm. dictionary entries can be recycled.
transparent mode: disable compression but dictionary is still being updated (used in transmitting already-compressed files). bis is a suffix meaning a modified standard. Derived from the Latin “bis” means twice.
75
An experiment Hayes Microcomputer Products, Inc.’s implementation of V.42bis compression. 4 different types of files: a 176,369-byte English essay. (Essays) a 210,931-byte USENET news articles. (News) a 119,779-byte Pascal source. (Pascal) a 201,746-byte Latex manual source. (Tex)
76
Experiment result Compression ratios: Dictionary Size
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.