Algorithms in the Real World

Slides:



Advertisements
Similar presentations
15-583:Algorithms in the Real World
Advertisements

Data Compression CS 147 Minh Nguyen.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
An introduction to Data Compression
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture04 Data Compression.
Huffman Encoding 16-Apr-17.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
SWE 423: Multimedia Systems Chapter 7: Data Compression (3)
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
SWE 423: Multimedia Systems Chapter 7: Data Compression (2)
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
Data Structures – LECTURE 10 Huffman coding
Lecture 4 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
Lossless Compression in Multimedia Data Representation Hao Jiang Computer Science Department Sept. 20, 2007.
Data Compression Basics & Huffman Coding
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Yehong, Wang Wei, Wang Sheng, Jinyang, Gordon. Outline Introduction Overview of Huffman Coding Arithmetic Coding Encoding and Decoding Probabilistic Model.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Advanced Algorithms Piyush Kumar (Lecture 10: Compression) Welcome to COT5405 Source: Guy E. Blelloch, Emad, Tseng …
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2011.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 6 – Basics of Compression (Part 1) Klara Nahrstedt Spring 2011.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Lossless Compression(2)
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2012.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
CSI-447: Multimedia Systems
Algorithms in the Real World
Applied Algorithmics - week7
Huffman Coding, Arithmetic Coding, and JBIG2
CSE 589 Applied Algorithms Spring 1999
Huffman Encoding.
Huffman Coding Greedy Algorithm
CSE 589 Applied Algorithms Spring 1999
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Algorithms in the Real World Data Compression: Lectures 1 and 2 296.3

Compression in the Real World Generic File Compression Files: gzip (LZ77), bzip (Burrows-Wheeler), BOA (PPM) Archivers: ARC (LZW), PKZip (LZW+) File systems: NTFS Communication Fax: ITU-T Group 3 (run-length + Huffman) Modems: V.42bis protocol (LZW), MNP5 (run-length+Huffman) Virtual Connections 296.3

Compression in the Real World Multimedia Images: gif (LZW), jbig (context), jpeg-ls (residual), jpeg (transform+RL+arithmetic) TV: HDTV (mpeg-4) Sound: mp3 An example Other structures Indexes: google, lycos Meshes (for graphics): edgebreaker Graphs Databases: 296.3

Compression Outline Introduction: Lossless vs. lossy Model and coder Benchmarks Information Theory: Entropy, etc. Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, ... Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK 296.3

Encoding/Decoding Will use “message” in generic sense to mean the data to be compressed Encoder Decoder Input Message Output Compressed The encoder and decoder need to understand common compressed format. 296.3

Lossless vs. Lossy Lossless: Input message = Output message Lossy: Input message  Output message Lossy does not necessarily mean loss of quality. In fact the output could be “better” than the input. Drop random noise in images (dust on lens) Drop background in music Fix spelling errors in text. Put into better form. Writing is the art of lossy text compression. 296.3

How much can we compress? For lossless compression, assuming all input messages are valid, if even one string is compressed, some other must expand. 296.3

Model vs. Coder To compress we need a bias on the probability of messages. The model determines this bias Encoder Messages Model Probs. Coder Bits Example models: Simple: Character counts, repeated strings Complex: Models of a human face 296.3

Quality of Compression Runtime vs. Compression vs. Generality Several standard corpuses to compare algorithms e.g. Calgary Corpus 2 books, 5 papers, 1 bibliography, 1 collection of news articles, 3 programs, 1 terminal session, 2 object files, 1 geophysical data, 1 bitmap bw image The Archive Comparison Test maintains a comparison of just about all algorithms publicly available 296.3

Comparison of Algorithms 296.3

Compression Outline Introduction: Lossy vs. Lossless, Benchmarks, … Information Theory: Entropy Conditional Entropy Entropy of the English Language Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, ... Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK 296.3

Information Theory An interface between modeling and coding Entropy A measure of information content Conditional Entropy Information content based on a context Entropy of the English Language How much information does each character in “typical” English text contain? 296.3

Entropy (Shannon 1948) For a set of messages S with probability p(s), s S, the self information of s is: Measured in bits if the log is base 2. The lower the probability, the higher the information Entropy is the weighted average of self information. 296.3

Entropy Example 296.3

Conditional Entropy The conditional probability p(s|c) is the probability of s in a context c. The conditional self information is The conditional information can be either more or less than the unconditional information. The conditional entropy is the weighted average of the conditional self information 296.3

Example of a Markov Chain .1 p(b|w) p(w|w) w p(b|b) b .9 .8 p(w|b) .2 296.3

Entropy of the English Language How can we measure the information per character? ASCII code = 7 Entropy = 4.5 (based on character probabilities) Huffman codes (average) = 4.7 Unix Compress = 3.5 Gzip = 2.6 Bzip = 1.9 Entropy = 1.3 (for “text compression test”) Must be less than 1.3 for English language. 296.3

Shannon’s experiment Asked humans to predict the next character given the whole previous text. He used these as conditional probabilities to estimate the entropy of the English Language. The number of guesses required for right answer: From the experiment he predicted H(English) = .6-1.3 296.3

Compression Outline Introduction: Lossy vs. Lossless, Benchmarks, … Information Theory: Entropy, etc. Probability Coding: Prefix codes and relationship to Entropy Huffman codes Arithmetic codes Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, ... Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG, ... Compressing graphs and meshes: BBK 296.3

Assumptions and Definitions Communication (or a file) is broken up into pieces called messages. Each message comes from a message set S = {s1,…,sn} with a probability distribution p(s). Probabilities must sum to 1. Set can be infinite. Code C(s): A mapping from a message set to codewords, each of which is a string of bits Message sequence: a sequence of messages Note: Adjacent messages might be of a different types and come from different probability distributions 296.3

Discrete or Blended We will consider two types of coding: Discrete: each message is a fixed set of bits Huffman coding, Shannon-Fano coding Blended: bits can be “shared” among messages Arithmetic coding 01001 11 011 0001 message: 1 2 3 4 Warning: Different books and authors use different terminology. Here is what I will be using. 010010111010 message: 1,2,3, and 4 296.3

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every message value e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence of bits 1011 ? Is it aba, ca, or, ad? A uniquely decodable code is a variable length code in which bit strings can always be uniquely decomposed into its codewords. 296.3

Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another word. e.g., a = 0, b = 110, c = 111, d = 10 All prefix codes are uniquely decodable 296.3

Prefix Codes: as a tree Can be viewed as a binary tree with message values at the leaves and 0s or 1s on the edges: a = 0, b = 110, c = 111, d = 10 b c a d 1 296.3

Some Prefix Codes for Integers Later it is going to be very important to come up with coding schemes for the integers, which Give shorter codes to smaller integers. Imagine, for example, wanting to specify small errors from expectation. Many other fixed prefix codes: Golomb, phased-binary, subexponential, ... 296.3

Average Length For a code C with associated probabilities p(c) the average length is defined as We say that a prefix code C is optimal if for all prefix codes C’, la(C)  la(C’) l(c) = length of the codeword c (a positive integer) What does average length mean? If send a very large number of messages from the distribution, it is the average length (in bits) of those messages. 296.3

Relationship to Entropy Theorem (lower bound): For any probability distribution p(S) with associated uniquely decodable code C, Theorem (upper bound): For any probability distribution p(S) with associated optimal prefix code C, The upper bound can actually made tighter if the largest probability is small. We are going to Prove the upper bound. 296.3

Kraft McMillan Inequality Theorem (Kraft-McMillan): For any uniquely decodable binary code C, Also, for any set of lengths L such that there is a prefix code C such that 296.3

Proof of the Upper Bound (Part 1) Assign each message a length: We then have So by the Kraft-McMillan inequality there is a prefix code with lengths l(s). 296.3

Proof of the Upper Bound (Part 2) Now we can calculate the average length given l(s) And we are done. 296.3

Another property of optimal codes Theorem: If C is an optimal prefix code for the probabilities {p1, …, pn} then pi > pj implies l(ci)  l(cj) Proof: (by contradiction) Assume l(ci) > l(cj). Consider switching codes ci and cj. If la is the average length of the original code, the length of the new code is This is a contradiction since la is not optimal This is important in the proof of optimality of Huffman codes. 296.3

Huffman Codes Invented by Huffman as a class assignment in 1950. Used in many, if not most, compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to generate codes Cheap to encode and decode la = H if probabilities are powers of 2 Really good example of how great theory meets great practice. Also example of how it takes many years to make it into practice, and how basic method is improved over years. Should give you hope that you can invent something great as a student. There are other codes: Shannon-Fano (infact, Fano was the instructor of the course) Shannon-Fano codes guarantee that l < roundup(log(1/p)). Huffman codes do not. 296.3

Huffman Codes Huffman Algorithm: Start with a forest of trees each consisting of a single vertex corresponding to a message s and with weight p(s) Repeat until one tree left: Select two trees with minimum weight roots p1 and p2 Join into single tree by adding root with weight p1 + p2 296.3

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) (.3) (.5) (1.0) 1 (.5) d(.5) a(.1) b(.2) (.3) c(.2) 1 Step 1 (.3) c(.2) a(.1) b(.2) 1 Step 2 a(.1) b(.2) Step 3 a=000, b=001, c=01, d=1 296.3

Encoding and Decoding Encoding: Start at leaf of Huffman tree and follow path to the root. Reverse order of bits and send. Decoding: Start at root of Huffman tree and take branch for each bit received. When at leaf can output message and return to root. Any ideas how the faster methods work? It was these faster ways that eventually made Huffman codes really practical. a(.1) b(.2) (.3) c(.2) (.5) d(.5) (1.0) 1 There are even faster methods that can process 8 or 32 bits at a time 296.3

Huffman codes are “optimal” Theorem: The Huffman algorithm generates an optimal prefix code. Proof outline: Induction on the number of messages n. Base case n = 2 or n=1? Consider a message set S with n+1 messages Least probable messages m1 and m2 of S are neighbors in the Huffman tree; Replace the two least probable messages with one message with probability p(m1) + p(m2) making new message set S’ Can transform any optimal prefix code for S so that m1 and m2 are neighbors. Cost of optimal code for S is cost of optimal code for S’ + p(m1) + p(m2) Huffman code for S’ is optimal by induction Cost of Huffman code for S is cost of optimal code for S’ + p(m1) + p(m2). The proof is in the handout. This is a neet proof and you should definitely go through it. 296.3

Problem with Huffman Coding Consider a message with probability .999. The self information of this message is If we were to send 1000 such messages we might hope to use 1000*.0014 = 1.44 bits. Using Huffman codes we require at least one bit per message, so we would require 1000 bits. It might seem like we cannot do better than this. Assuming Huffman codes, how could we improve this? Assuming there is only one possible other message (with prob. .001), what would the Expected length be for sending 1000 messages picked from this distribution? (about 10bits + 1.44 bits = 11.44 bits) 296.3

Arithmetic Coding: Introduction Allows “blending” of bits in a message sequence. Only requires 3 bits for the example Can bound total bits required based on sum of self information: Used in DMM, JPEG/MPEG (as option), zip (PPMd) More expensive than Huffman coding, but integer implementation is not too bad. 296.3

Arithmetic Coding: message intervals Assign each probability distribution to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a = .2 c = .3 b = .5 0.0 0.2 0.7 1.0 We are going to make a distinction between message, sequence, and code intervals. It is important to keep them straight. Also I will give different meanings to p(I) and pi and it is important to keep these straight f(a) = .0, f(b) = .2, f(c) = .7 The interval for a particular message will be called the message interval (e.g for b the interval is [.2,.7)) 296.3

Arithmetic Coding: sequence intervals Code a message sequence by composing intervals. For example: bac The final interval is [.27,.3) We call this the sequence interval a = .2 c = .3 b = .5 0.0 0.2 0.7 1.0 0.3 0.55 0.21 0.27 If the notation gets confusing, the intuition is still clear, as shown by this example. 296.3

Arithmetic Coding: sequence intervals To code a sequence of messages with probabilities pi (i = 1..n) use the following: Each message narrows the interval by a factor of pi. Final interval size: bottom of interval size of interval How does the interval size relate to the probability of that message sequence? 296.3

Warning Three types of interval: message interval : interval for a single message sequence interval : composition of message intervals code interval : interval for a specific code used to represent a sequence interval (discussed later) 296.3

Uniquely defining an interval Important property:The sequence intervals for distinct message sequences of length n will never overlap Therefore: specifying any number in the final interval uniquely determines the sequence. Decoding is similar to encoding, but on each step need to determine what the message value is and then reduce interval Why will they never overlap? How should we pick a number in the interval, and how should we represent the number? 296.3

Arithmetic Coding: Decoding Example Decoding the number .49, knowing the message is of length 3: The message is bbc. a = .2 c = .3 b = .5 0.0 0.2 0.7 1.0 0.3 0.55 0.35 0.475 0.49 Basically we are running the coding example but instead of paying attention to the symbol names, We pay attention to the number. 296.3

Representing Fractions Binary fractional representation: So how about just using the smallest binary fractional representation in the sequence interval. e.g. [0,.33) = .0 [.33,.66) = .1 [.66,1) = .11 But what if you receive a 1? Should we wait for another 1? It is not a prefix code. 296.3

Representing an Interval Can view binary fractional numbers as intervals by considering all completions. e.g. We will call this the code interval. We now have mentioned all three kinds of intervals: message, sequence and code. Can you give an intuition for the lemma? 296.3

Code Intervals: example Try code [0,.33) = .0 [.33,.66) = .1 [.66,1) = .11 1 .11… .1… .0… Note that if code intervals overlap then one code is a prefix of the other. Lemma: If a set of code intervals do not overlap then the corresponding codes form a prefix code. 296.3

Selecting the Code Interval To find a prefix code find a binary fractional number whose code interval is contained in the sequence interval. Can use the fraction l + s/2 truncated to bits .61 .79 .625 .75 Sequence Interval Code Interval (.101) Recal that s is the size of the interval. Using -log(s/2) bits represents an interval of size >= s/2. 296.3

Selecting a code interval: example [0,.33) = .001 [.33,.66) = .100 [.66,1) = .110 1 .110 .66 .100 .33 .001 e.g: for [.33…,.66…), l = .33…, s = .33… l + s/2 = .5 = .1000… truncated to bits is .100 Is this the best we can do for [0,.33) ? 296.3

RealArith Encoding and Decoding RealArithEncode: Determine l and s using original recurrences Code using l + s/2 truncated to 1+-log s bits RealArithDecode: Read bits as needed so code interval falls within a message interval, and then narrow sequence interval. Repeat until n messages have been decoded . Note that this requires that you know the number of messages n in the sequence. This needs to be predetermined or sent as a header. 296.3

Bound on Length Theorem: For n messages with self information {s1,…,sn} RealArithEncode will generate at most bits. Proof: The 1 + log s is from previous slide based on truncating to -log(s/2) bits. Note that s is overloaded (size of sequence interval, and self information of message). I apologize. 296.3

Integer Arithmetic Coding Problem with RealArithCode is that operations on arbitrary precision real numbers is expensive. Key Ideas of integer version: Keep integers in range [0..R) where R=2k Use rounding to generate integer sequence interval Whenever sequence interval falls into top, bottom or middle half, expand the interval by factor of 2 This integer Algorithm is an approximation or the real algorithm. S gets very small and require many bits of precision --- as many as the size of the final code. If input probabilities are binary floats, then can use extended floats. If input probabilities are rationals, then can use arbitrary precision rationals. 296.3

Integer Arithmetic Coding The probability distribution as integers Probabilities as counts: e.g. c(a) = 11, c(b) = 7, c(c) = 30 T is the sum of counts e.g. 48 (11+7+30) Partial sums f as before: e.g. f(a) = 0, f(b) = 11, f(c) = 18 Require that R > 4T so that probabilities do not get rounded to zero R=256 T=48 c = 30 18 b = 7 11 a = 11 296.3

Integer Arithmetic (contracting) l0 = 0, u0 = R-1 R=256 u The ui and li do not overlap l 296.3

Integer Arithmetic (scaling) If l  R/2 then (in top half) Output 1 followed by m 0s m = 0 Scale message interval by expanding by 2 If u < R/2 then (in bottom half) Output 0 followed by m 1s If l  R/4 and u < 3R/4 then (in middle half) Increment m Hard part is if you keep narrowing down on the middle. 296.3