Download presentation
Published byAriel Anabel Cooper Modified over 9 years ago
1
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Room: C3-222, ext: 1204, 1 1
2
Dictionary Techniques
Statistical compression methods Use the statistical model of the data, and the quality of compression they achieve depends on how good that model is. Dictionary-based compression methods Do not use a statistical model The dictionary holds strings of symbols and it may be static or dynamic (adaptive). Static is permanent (allowing the addition of strings but no deletions). Dynamic holds strings previously found in the input stream, allowing for additions and deletions of strings as new input is being read. 2
3
Dictionary Techniques
So far we assumed independent symbol with some statistical model: Not true for many common data types, e.g.: text, images, code the quality of compression they achieve depends on how good that model is Basic idea (ADAPTIVE) Identify frequent symbol patterns. Encode those frequent symbols more efficiently. Use these encoded pattern as a default (less efficient) encoding for the rest! Notes This looks reasonable for things like “text, image” … Also, it looks reasonable for things which is not very random. 3
4
Lempel-Ziv Compression Techniques
Static coding requires two-passes: one pass to compute probabilities (or frequencies) and determine the mapping, and a second pass to encode. Examples of Static techniques: Static Huffman Coding All of the adaptive methods are one-pass methods; only one scan of the message (to encode) is required. Examples of adaptive techniques: LZ77?, LZ78?, and Adaptive Huffman Coding 4
5
Lempel-Ziv Compression Techniques
LZ77 (Sliding Window) Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Applications: gzip, Squeeze, LHA, PKZIP, ZOO LZ78 (Full Dictionary Based) Variants: LZW (Lempel-Ziv-Welch), LZC (Lempel-Ziv-Compress) Applications: compress, ARC , V.42bis (LZ78), zlib, GIF, CCITT (modems), PAK (LZ77) Traditionally: LZ77 was better but slower, but the gzip version is almost as fast as any LZ78. 5
6
Example Dictionary All/some letters from the alphabet +
As many diagrams(pairs/group of letters) as possible Example: A= {a, b, c, d, r} 3 pairs: ab, ac, ad Encode the following: abracadabra 6
7
Example 7
8
Example 8
9
Example 9
10
Example 10
11
Example 11
12
Example 12
13
Example 13
14
Example 14
15
Example 15
16
LZ78 Compression Algorithm
LZ78 inserts one- or multi-character, distinct patterns of the message to be encoded in a Dictionary. The multi-character patterns are of the form: C0C Cn-1Cn. The prefix of a pattern refers to all the pattern characters except the last: C0C Cn-1 LZ78 Output: Note: The dictionary is usually implemented as a hash table.
17
LZ78 Compression Algorithm (cont’d)
Dictionary empty ; Prefix empty ; DictionaryIndex 1; while(characterStream is not empty) { Char next character in characterStream; if(Prefix + Char exists in the Dictionary) Prefix Prefix + Char ; else if(Prefix is empty) CodeWordForPrefix 0 ; CodeWordForPrefix DictionaryIndex for Prefix ; Output: (CodeWordForPrefix, Char) ; insertInDictionary( ( DictionaryIndex , Prefix + Char) ); DictionaryIndex++ ; } if(Prefix is not empty) CodeWordForPrefix DictionaryIndex for Prefix; Output: (CodeWordForPrefix , ) ;
18
Example 1: LZ78 Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the LZ78 algorithm. The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B) Note: The above is just a representation, the commas and parentheses are not transmitted; we will discuss the actual form of the compressed message later!
19
Example 1: LZ78 Compression (cont’d)
How to solve it :-) 1. A is not in the Dictionary; insert it 2. B is not in the Dictionary; insert it 3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it. 5. B is in the Dictionary. BA is not in the Dictionary; insert it. 6. B is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it. 7. B is in the Dictionary, BC is in the Dictionary, BCA is in the Dictionary, BCAA is in the Dictionary. However, BCAAB is not in the Dictionary; insert it.
20
Example 2: LZ78 Compression
Encode (i.e., compress) the string BABAABRRRA using the LZ78 algorithm. The compressed message is: (0,B)(0,A)(1,A)(2,B)(0,R)(5,R)(2, )
21
Example 2: LZ78 Compression (cont’d)
1. B is not in the Dictionary; insert it 2. A is not in the Dictionary; insert it 3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it. 5. R is not in the Dictionary; insert it. 6. R is in the Dictionary. RR is not in the Dictionary; insert it. 7. A is in the Dictionary and it is the last input character; output a pair containing its index: (2, )
22
Example 3: LZ78 Compression
Encode (i.e., compress) the string AAAAAAAAA using the LZ78 algorithm. 1. A is not in the Dictionary; insert it 2. A is in the Dictionary AA is not in the Dictionary; insert it 3. A is in the Dictionary. AA is in the Dictionary. AAA is not in the Dictionary; insert it. 4. A is in the Dictionary. AAA is in the Dictionary and it is the last pattern; output a pair containing its index: (3, )
23
LZ78 Compression: Number of bits transmitted
Example: Uncompressed String: ABBCBCABABCAABCAAB Number of bits = Total number of characters * 8 = 18 * 8 = 144 bits Suppose the codewords are indexed starting from 1: Compressed string( codewords): (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) Codeword index Each code word consists of an integer and a character: The character is represented by 8 bits. The number of bits n required to represent the integer part of the codeword with index i is given by: or 0 Alternatively number of bits required to represent the integer part of the codeword with index i is the number of significant bits required to represent the integer i – 1
24
LZ78 Compression: Number of bits transmitted
Codeword (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) index Bits: (1 + 8) + (1 + 8) + (2 + 8) + (2 + 8) + (2 + 8) + (3 + 8) + (3 + 8) = 70 bits The actual compressed message is: 0A0B10C11A010A100A110B where each character is replaced by its binary 8-bit ASCII code.
25
output: currentCharacter else
LZ78 Decompression Algorithm (self study) input: (CI, character) pairs output: if(CI == 0) output: currentCharacter else output: stringAtIndex CI + currentCharacter Insert: current output in dictionary
26
Example 1: LZ78 Decompression
Decode (i.e., decompress) the sequence (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) The decompressed message is: ABBCBCABABCAABCAAB
27
Example 2: LZ78 Decompression
Decode (i.e., decompress) the sequence (0, B) (0, A) (1, A) (2, B) (0, R) (5, R) (2, ) The decompressed message is: BABAABRRRA
28
Example 3: LZ78 Decompression
Decode (i.e., decompress) the sequence (0, A) (1, A) (2, A) (3, ) The decompressed message is: AAAAAAAAA
29
Exercises 1. Use LZ78 to trace encoding the string SATATASACITASA.
2. Write a MATLAB program that encodes a given string using LZ78. 3. Write a MATLAB program that decodes a given set of encoded codewords using LZ78.
30
LZ77: Sliding Window Lempel-Ziv
(o)ffset= search_ptr – match_ptr = 7 (l)ength= number of consecutive letters matched = 4 (c)odeword(r) Encoding Notes <o, l, c> = <7, 4, C(‘r’)> |search buff| = S, S + |LA buff| = |W(indow)| Page30
31
LZ77: Example (0,0,a) (1,1,c) (3,3,a) (0,0,b) (3,3,a) (1,1,a)
|LA buf| a c b (1,1,c) a c b (3,3,a) |search buff| a c b (0,0,b) O = 3 a c b (3,3,a) | search buff | | LA buff | O=1 a c b (1,1,a) (S)earch Buffer (size = 6) Page31 l (Length of matched Letters) Next character
32
LZ77 Decoding Decoder keeps the same dictionary window as encoder.
For each message it looks it up in the dictionary and inserts a copy (which one at the initialization?) What if l > o? (only part of the message is in the dictionary.) E.g. input = abcd, codeword = (2,9,e) Simply copy starting at the cursor for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Out = abcdcdcdcdcdce Page32
33
LZ77 Optimizations used by gzip
LZSS: Output one of the following formats (0, position, length) or (1,char) Typically use the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c 15-853 Page33
34
Optimizations used by gzip (cont.)
Huffman code the positions, lengths and chars (as an outer code) Non greedy: possibly use shorter match so that next match is better Use hash table to store dictionary: Hash is based on strings of length 3. Find the longest match within the correct hash bucket. Limit on length of search. Store within bucket in order of position 15-853 Page34
35
Theory behind LZ77 The Problem: “long enough” is really really long.
Sliding Window LZ is Asymptotically Optimal [Wyner-Ziv,94] Will compress “long enough” strings to the source entropy as the window size goes to infinity. The Problem: “long enough” is really really long. Page35
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.