15-583:Algorithms in the Real World

Slides:



Advertisements
Similar presentations
Data Compression CS 147 Minh Nguyen.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
An introduction to Data Compression
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Huffman Encoding 16-Apr-17.
Introduction to Data Compression
SWE 423: Multimedia Systems
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
SWE 423: Multimedia Systems Chapter 7: Data Compression (2)
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
Data Structures – LECTURE 10 Huffman coding
Lecture 4 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
Data Compression Basics & Huffman Coding
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Advanced Algorithms Piyush Kumar (Lecture 10: Compression) Welcome to COT5405 Source: Guy E. Blelloch, Emad, Tseng …
Algorithms in the Real World
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2011.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Lossless Compression(2)
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lecture 12 Huffman Coding King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science Department.
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Data Compression: Huffman Coding in Weiss (p.389)
Compression & Huffman Codes
Digital Image Processing Lecture 20: Image Compression May 16, 2005
CSI-447: Multimedia Systems
Data Compression.
Algorithms in the Real World
Applied Algorithmics - week7
Huffman Coding, Arithmetic Coding, and JBIG2
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Chapter 11 Data Compression
CSE 589 Applied Algorithms Spring 1999
Huffman Encoding.
Huffman Coding Greedy Algorithm
CPS 296.3:Algorithms in the Real World
Presentation transcript:

15-583:Algorithms in the Real World Data Compression I Introduction Information Theory Probability Coding 15-853

Compression in the Real World Generic File Compression files: gzip (LZ77), bzip (Burrows-Wheeler), BOA (PPM) archivers: ARC (LZW), PKZip (LZW+) file systems: NTFS Communication Fax: ITU-T Group 3 (run-length + Huffman) Modems: V.42bis protocol (LZW) MNP5 (RL + Huffman) Virtual Connections 15-853

Multimedia Images: gif (LZW), jbig (context), jpeg-ls (residual), jpeg (transform+RL+arithmetic) TV: HDTV (mpeg-4) Sound: mp3 15-853

Compression Outline Introduction: Lossy vs. Lossless, Benchmarks, … Information Theory: Entropy, etc. Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, ... Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, fractals, ... Lossy algorithms for sound?: MP3, ... 15-853

Encoding/Decoding Will use “message” in generic sense to mean the data to be compressed Encoder Decoder Input Message Compressed Message Output Message The encoder and decoder need to understand common compressed format. 15-853

Lossless vs. Lossy Lossless: Input message = Output message Lossy: Input message  Output message Lossy does not necessarily mean loss of quality. In fact the output could be “better” than the input. Drop random noise in images (dust on lens) Drop background in music Fix spelling errors in text. Put into better form. Writing is the art of lossy text compression. 15-853

How much can we compress? For lossless compression, assuming all input messages are valid, if even one string is compressed, some other must expand. 15-853

Model vs. Coder To compress we need a bias on the probability of messages. The model determines this bias Example models: Simple: Character counts, repeated strings Complex: Models of a human face Encoder Messages Model Probs. Coder Bits 15-853

Quality of Compression Runtime vs. Compression vs. Generality Several standard corpuses to compare algorithms Calgary Corpus 2 books, 5 papers, 1 bibliography, 1 collection of news articles, 3 programs, 1 terminal session, 2 object files, 1 geophysical data, 1 bitmap bw image The Archive Comparison Test maintains a comparison of just about all algorithms publicly available 15-853

Comparison of Algorithms 15-853

Information Theory An interface between modeling and coding Entropy A measure of information content Conditional Entropy Information content based on a context Entropy of the English Language How much information does each character in “typical” English text contain? 15-853

Entropy (Shannon 1948) For a set of messages S with probability p(s), s S, the self information of s is: Measured in bits if the log is base 2. The lower the probability, the higher the information Entropy is the weighted average of self information. 15-853

Entropy Example 15-853

Conditional Entropy The conditional probability p(s|c) is the probability of s in a context c. The conditional self information is The conditional information can be either more or less than the unconditional information. The conditional entropy is the average of the conditional self information a 15-853

Example of a Markov Chain p(b|w) p(w|w) w p(b|b) b p(w|b) 15-853

Entropy of the English Language How can we measure the information per character? ASCII code = 7 Entropy = 4.5 (based on character probabilities) Huffman codes (average) = 4.7 Unix Compress = 3.5 Gzip = 2.5 BOA = 1.9 (current close to best text compressor) Must be less than 1.9. 15-853

Shannon’s experiment Asked humans to predict the next character given the whole previous text. He used these as conditional probabilities to estimate the entropy of the English Language. The number of guesses required for right answer: From the experiment he predicted H(English) = .6-1.3 15-853

Coding How do we use the probabilities to code messages? Prefix codes and relationship to Entropy Huffman codes Arithmetic codes Implicit probability codes 15-853

Assumptions Communication (or file) broken up into pieces called messages. Adjacent messages might be of a different types and come from a different probability distributions We will consider two types of coding: Discrete: each message is a fixed set of bits Huffman coding, Shannon-Fano coding Blended: bits can be “shared” among messages Arithmetic coding 15-853

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every message value e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence of bits 1011 ? Is it aba, ca, or, ad? A uniquely decodable code is a variable length code in which bit strings can always be uniquely decomposed into its codewords. 15-853

Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another word e.g a = 0, b = 110, c = 111, d = 10 Can be viewed as a binary tree with message values at the leaves and 0 or 1s on the edges. 1 1 a 1 d b c 15-853

Some Prefix Codes for Integers Many other fixed prefix codes: Golomb, phased-binary, subexponential, ... 15-853

Average Length For a code C with associated probabilities p(c) the average length is defined as We say that a prefix code C is optimal if for all prefix codes C’, la(C)  la(C’) 15-853

Relationship to Entropy Theorem (lower bound): For any probability distribution p(S) with associated uniquely decodable code C, Theorem (upper bound): For any probability distribution p(S) with associated optimal prefix code C, 15-853

Kraft McMillan Inequality Theorem (Kraft-McMillan): For any uniquely decodable code C, Also, for any set of lengths L such that there is a prefix code C such that 15-853

Proof of the Upper Bound (Part 1) Assign to each message a length We then have So by the Kraft-McMillan ineq. there is a prefix code with lengths l(s). 15-853

Proof of the Upper Bound (Part 2) Now we can calculate the average length given l(s) And we are done. 15-853

Another property of optimal codes Theorem: If C is an optimal prefix code for the probabilities {p1, …, pn} then pi < pj implies l(ci)  l(cj) Proof: (by contradiction) Assume l(ci) > l(cj). Consider switching codes ci and cj. If la is the average length of the original code, the length of the new code is This is a contradiction since la is not optimal 15-853

Huffman Codes Invented by Huffman as a class assignment in 1950. Used in many, if not most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to generate codes Cheap to encode and decode la=H if probabilities are powers of 2 15-853

Huffman Codes Huffman Algorithm Start with a forest of trees each consisting of a single vertex corresponding to a message s and with weight p(s) Repeat: Select two trees with minimum weight roots p1 and p2 Join into single tree by adding root with weight p1 + p2 15-853

p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) (.3) (.5) (1.0) 1 (.5) d(.5) a(.1) b(.2) (.3) c(.2) 1 Step 1 (.3) c(.2) a(.1) b(.2) 1 Step 2 a(.1) b(.2) Step 3 a=000, b=001, c=01, d=1 15-853

Encoding and Decoding Encoding: Start at leaf of Huffman tree and follow path to the root. Reverse order of bits and send. Decoding: Start at root of Huffman tree and take branch for each bit received. When at leaf can output message and return to root. (1.0) 1 There are even faster methods that can process 8 or 32 bits at a time (.5) d(.5) 1 (.3) c(.2) 1 a(.1) b(.2) 15-853

Huffman codes are optimal Theorem: The Huffman algorithm generates an optimal prefix code. 15-853