Presentation on theme: "Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem."— Presentation transcript:
Source Coding Data Compression A.J. Han Vinck
DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem statement: “find a means for spending as little time as possible on packing as much of data as possible into as little space as possible, and with no loss of information”
GENERAL IDEA: represent likely symbols with short length binary words where likely is derived from -prediction of next symbol in source output q-ue q-ua q-ui q-uo q ? q-00 q-01 q-10 q-11 - context between the source symbols words sounds context in pictures
Why compress? 1.- Lossless compression often reduces file size by 40% to 80%. 1.- More economical to transport and store 2.- Most Internet content is compressed for transmission 3.- Compression before encryption can make code-breaking difficult 4.- Conserve battery power and storage space on mobile devices 5.- Compression and decompression can be hardwired
Some history 1948 – Shannon-Fano coding 1952 – Huffman coding –reduced redundancy in symbol coding –demonstrably optimal fixed-length coding 1977 – Lempel-Ziv coding –first major “dictionary method” –maps repeated word patterns to code words
MODEL KNOWLEDGE best performance: exact prediction! exact prediction: no new information! no new information: no message to transmit
Example No prediction source: C message code representation length: = 3
Example with prediction ENCODE DIFFERENCE probability difference-101 code source Ccode - P L =.25 * * * 2 = 1.5 bit/difference symbol
binary tree codes the relation between source symbols and codewords A:= 11 B:=10 C:= General Properties: - every node has two successors: leaves or/and nodes - the way to reach a leave gives the connected codeword - source letters are only assigned to leaves i.e. no codeword is prefix of another code word code
tree codes Tree codes are prefix codes and uniquely decodable i.e. a string of codewords can be uniquely decomposed into the individual codewords Non-prefix codes may be uniquely decodable example: A:=1 B:=10 C:=100
binary tree codes The average codeword length Property: an optimal code has minimum L Property: for an optimal code the two least probable codewords have the same length, are the longest by manipulating the assignment differ only in the last code digit
Tree encoding (1) for data / text the compression should be: lossless no errors –STEP 1: assign messages to nodes codeword n i P(i) a b c d e AVERAGE CODEWORD LENGTH: = 2.75 bit/source symbol
Tree encoding (2) STEP 2 OPTIMIZE ASSIGNMENT (MINIMIZE average length ) codeword n i P(i) e d c b a AVERAGE CODEWORD LENGTH: = 2.35 bit/source symbol !
Kraft inequality Prefix codes with M code words satisfy the Kraft inequality: where n k is the code word length for message k Proof: let n M be the longest codeword length then, in a code tree of depth n M, the terminal nodes eliminate from the total number of available nodes
example Depth = 4 eliminates 8 eliminates 4 eliminates 2 Homework: can we replace ≤ into = in the Kraft inequality?
Kraft inequality Suppose that the length specification of M code words satisfies the Kraft inequality, Then where N i is the number of code words of length i. Then, we can construct a prefix code with the specified lengths Note that:
Kraft inequality From this, Interpretation: at every level less nodes used than available! E.g. for level 3, we have 8 nodes minus the nodes cancelled by Level 1 and 2.
performance Suppose that we select the code word lengths as Then, a prefix code exists, since with average length
Lower bound for prefix codes We show that We write Equality can be established for
Huffman Coding: (JPEG, MPEG, MP3 ) 1take together smallest probabilites: P(i) + P(j) 2 replace symbol i and j by new symbol 3 go to 1 - until end Example:code
Huffman Coding: optimality Given code C with average length L and M symbols Construct C‘: replace the 2 least probable symbols C M and C M-1 in C by symbol C M-1 ‘ with probability P(M) + P(M-1) to minimize L, we have to minimize L‘.
Properties ADVANTAGES: –uniquely decodable code –smallest average codeword length DISADVANTAGES: –LARGE tables give complexity –variable word length –sensitive to channel errors
Conclusion Huffman Tree coding (Huffman) is not universal! it is only valid for one particular type of source! For COMPUTER DATA data reduction is lossless no errors at reproduction universal effective for different types of data
Performance Huffman Using the probability distribution for the source U, a prefix code exists with average length L < H(U) + 1 Since Huffman is optimum, this bound is also true for Huffman codes Improvements can be made when we take J symbols together, then –JH(U) ≤ L < J H(U) + 1 and –H(U) ≤ L’ = L/J < H(U) + 1/J
Encoding idea Lempel Ziv Welch-LZW Assume we have just read a segment w from the text. a is the next symbol. If wa is not in the dictionary, ● Write the index of w in the output file. ● Add wa to the dictionary, and set w a. ● If wa is in the dictionary, ● Process the next symbol with segment wa. a w a
Encoding example address 0: aaddress 1: baddress 2: c String a a b a a c a b c a b c boutputupdate a a aa not in dictionry, output 0 add aa to dictionary 0aa 3 a a b continue with a, store ab in dictionary 0ab 4 a a b a continue with b, store ba in dictionary 1ba 5 a a b a a c aa in dictionary, aac not, 3aac 6 a a b a a c a 2ca 7 a a b a a c a b c 4abc 8 a a b a a c a b c a b 7cab 9
UNIVERSAL (LZW) (decoder) 1.Start with basic symbol set 2.Read a code c from the compressed file. - The address c in the dictionary determines the segment w. - write w in the output file. 3.Add wa to the dictionary: a is the first letter of the next segment
Decoding example address 0: aaddress 1: baddress 2: c Stringinputupdate a ? output a 0 a a ! output a determines ? = a, update aa 0 aa 3 a a b. output 1 determines !=b, update ab 1 ab 4 a a b a a. 3 ba 5 a a b a a c. 2 aac 6 a a b a a c a b. 4 ca 7 a a b a a c a b c a. 7 abc 8
Conclusion (LZW) IDEA: TRY to copy long parts of source output –if overflow throw least-recently used entry away in en- and decoder –universal –lossless Homework: encode/decode the sequence Try to solve the problem that occurs!
Some history GIF, TIFF, V.42bis modem compression standard, PostScript Level 2 –1977 published by Abraham Lempel and Jakob Ziv –1984 LZ-Welch algorithm published in IEEE Computer –Sperry patent transferred to Unisys (1986) –GIF file format Required use of LZW algorithm
Summary of operations ENCODINGoutputupdate location W 1 A loc( W 1 ) W 1 A N W 2 F loc( W 2 ) W 2 F N+1 W 3 X loc( W 3 ) W 3 X N+2 DECODE: INPUT update location –loc( W 1 ) W 1 ? –loc( W 2 ) W 2 ? W 1 A N –loc( W 3 ) W 3 ? W 2 F N+1
Problem and solution ENCODINGoutputupdate location – W 1 A loc( W 1 ) W 1 A N W 2 = W 1 A F loc( W 2 ) W 2 F N+1 DECODE: INPUT update location –loc( W 1 ) W 1 ? –loc( W 2 = W 1 A) W 2 # W 1 A N Since W 2 = W 1 A, the ? can be solved W 2 updated at location N as W 1 A
Shannon-Fano coding Suppose that we have a source with M symbols. Every symbol u i occurs with probability P(u i ). We try to encode symbol u i with bits Then the average representation length is
code realization Define
continued Define: The codeword for u i is the binary expansion for Q(u i ) of length n i Property: The code is a prefix code with the promised length Proof: Let i k+1
continued 1.The binary radix-2 representation for Q(u i ) and Q(u k ) differ at least in position n k. 2.The codewords for Q(u i ) and Q(u k ) have length 3.The truncated representation for Q(u k ) can never be a prefix for the codeword n i.
example P(u 0 u 1 u 2 u 3 u 4 u 5 u 6 u 7 )=(5/16, 3/16,1/8, 1/8, 3/32, 1/16, 1/16, 1/32)