An introduction to Data Compression

Slides:



Advertisements
Similar presentations
More on Canonical Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a As we have seen canonical Huffman coding allows.
Advertisements

T.Sharon-A.Frank 1 Multimedia Compression Basics.
15-583:Algorithms in the Real World
Data Compression CS 147 Minh Nguyen.
Applied Algorithmics - week7
Lecture 3: Source Coding Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Information Theory EE322 Al-Sanie.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Data Compression.
Entropy and Shannon’s First Theorem
Processing of large document collections
Lecture04 Data Compression.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Fundamental limits in Information Theory Chapter 10 :
Spatial and Temporal Data Mining
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
CS336: Intelligent Information Retrieval
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
Variable-Length Codes: Huffman Codes
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
Information Theory and Security
Noise, Information Theory, and Entropy
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
©Brooks/Cole, 2003 Chapter 15 Data Compression. ©Brooks/Cole, 2003 Realize the need for data compression. Differentiate between lossless and lossy compression.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Information and Coding Theory
Chapter 2 Source Coding (part 2)
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
(Important to algorithm analysis )
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Communication Technology in a Changing World Week 2.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
DCSP-8: Minimal length coding I Jianfeng Feng Department of Computer Science Warwick Univ., UK
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Adaptive Huffman Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a Why Adaptive Huffman Coding? Huffman coding suffers.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Digital Image Processing Lecture 22: Image Compression
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
3/7/20161 Now it’s time to look at… Discrete Probability.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Digital Image Processing Lecture 20: Image Compression May 16, 2005
Data Compression.
Applied Algorithmics - week7
ISNE101 – Introduction to Information Systems and Network Engineering
Data Compression CS 147 Minh Nguyen.
Communication Technology in a Changing World
Communication Technology in a Changing World
CSE 589 Applied Algorithms Spring 1999
Huffman Coding Greedy Algorithm
Presentation transcript:

An introduction to Data Compression

General informations Requirements some programming skills (not so much...) knowledge of data structures ... some work! Office hours ... ... please write me an email monfardini@dii.unisi.it Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

What is compression? Intuitively compression is a method “to press something into a smaller space”. In our domains a better definition is “to make information shorter” Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Some basic questions What is information? How can we measure the amount of information? Why compression is useful? How do we compress? How much we can compress? Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

What is information? - I Commonly the term information refers to the knowledge of some fact, circumstance or thought. For example we can think about reading a newspaper, news are the information. syntax letters, punctuation marks, white spaces, grammar rules ... semantics meaning of the words and of the sentences Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

What is information? - II In our domain, information is merely the syntax, i.e. we are interested in the symbols of the alphabet used to express the information. In order to give a mathematical definition of information we need some principle of Information Theory Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The fundamental concept A key concept in Information Theory is that the information is conveyed by randomness Which information give us a biased coin, which outcome is always head? What about another biased coin, which outcome is head with 90% probability? We need a way to measure quantitatively the amount of information in some mathematical sense Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The Uncertainty - I Suppose we have a discrete random variable and is a particular outcome with probability uncertainty The units are given by the base of the logarithms base 2  bits base 10  nats Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The Uncertanty - II Suppose the random variable output  each outcome has 1 bit of information  0 gives no information at all, while if the outcome is 1 the information is Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The Entropy More useful is the entropy of a random variable with values in a space The entropy is a measure of the average uncertanty of the random variable Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The entropy - examples Consider again a r.v. with only two possible outcomes, 0 and 1 In this case Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Compression and loss lossless lossy decompressed message (file) is an exact copy of the original. Useful for text compression lossy some information is lost in the decompressed message (file). Useful for image and sound compression lgnore for a while lossy compression Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Definitions - I A source code from a r.v. is a mapping from to , the set of finite-length string from a D-ary alphabet. , codeword for , length of Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Definitions - II non-singular code (... trivial ...) every element of is mapped in a different string of : extension of a code uniquely decodable code its extension is uniquely decodable Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Definitions - III prefix (better prefix-free) or istantaneous code a no codeword is a prefix of any other codeword the advantage is that decoding has no need to look-ahead codewords a 11 b 110 ... 11? ... Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Examples Code 1 Code 2 Code 3 Code 4 1 01 10 2 110 010 00 3 11 4 111 10 2 110 010 00 3 11 4 111 singular not singular, but not uniquely decodable uniquely decodable, but not instantaneous instantaneous Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Kraft Inequality - I Theorem (Kraft Inequality) For any instantaneous code over an alphabet of size D, the codeword lengths must satisfy Conversely, given a set of codeword lengths that satisfy this inequality there exists an istantaneous code with these word lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Kraft Inequality - II Consider a complete D-ary tree at level k, there are nodes a node at level has descendants that are nodes at level k level 0 level 1 level 2 level 3 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Kraft Inequality - III Proof Consider a D-ary tree (not necessarily complete) representing the codewords, each path down the tree is a sequence of symbols, and each leaf (with its unique path) is a codeword. Let be the longest codeword. A codeword of length , being a leaf, imply that at level there are missing nodes Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Kraft Inequality - IV The total number of possible nodes at level is Summing over all codewords Dividing by Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Kraft Inequality - V Proof Suppose (without loss of generality) that codewords are ordered by length, i.e. . Consider a D-ary tree and start assigning each codeword to a node, starting from . For a generic codeword with length consider the set K of codewords with length , except i. Suppose there is no available node at level i. That is, Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Kraft Inequality - VI but this means that Then that is absurd. Then the obtained tree represents an instantaneous code with desidered codeword lengths Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Models and coders model model compressed text text text encoder decoder The model supplies the probabilities of the symbols (or of the group of symbols, as we will see later) The coder encodes and decodes starting from these probabilities Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Good modeling is crucial What happens if the true probability of the symbols to be coded are but we use ? Simply, compressed text will be longer, i.e. the average number of bits/symbol will be greater It is possible to calculate the difference in bit/symbol from the two mass probability p and q, known as relative entropy Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Finite-context models in english text ... ... but A finite-context model of order m uses the previous m symbols to make the prediction Better modeling but we need to extimate much more probabilities Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Finite-state models a 0.5 b 0.5 b 0.01 1 2 a 0.99 Although potentially more powerful (e.g. they can model wheather an odd or even number of as have occurred consecutively), they are not so popular. Obviously the decoder uses the same model, so they are always in the same states Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Static models A models is static if we set up a reasonable probability distribution and use it for all the texts to be coded. Poor performance in case of different kind of sources (english text, financial data...) One solution is to have K different models and to send the index of the used model ... but cfr. the book Gadsby by E. V. Wright Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Adaptive models In order to solve the problems of static modeling, adaptive (or dynamic) models begin with a bland probability distribution, that is refined as more symbols of the text are known The encoder and the decoder have the same initial distribution, and the same rules to alter it There could be adaptive models of order m>0 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

The zero-frequency problem The situation in which a symbol is predicted with probability zero should be avoided, as it cannot be coded One solution: the total number of symbols in the text is increased by 1. This 1/total probability is divided among all unseen symbols Another solution: to augment by 1 the count of every symbol Many more solutions... Which is the best? If text is sufficiently long the compression is similar Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006

Symbolwise and dictionary models The set of all possible symbols of a source is called the alphabet Symbolwise models provide an extimated probability for each symbol in the alphabet Dictionary models instead replace substrings in a text with codewords that identify each substring in a collection, called dictionary or codebook Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006