Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Slides:



Advertisements
Similar presentations
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
15-583:Algorithms in the Real World
Lecture 4 (week 2) Source Coding and Compression
Lecture 3: Source Coding Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Problem: Huffman Coding Def: binary character code = assignment of binary strings to characters e.g. ASCII code A = B = C =
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Data Compression.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
SWE 423: Multimedia Systems
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
A Data Compression Algorithm: Huffman Compression
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
CS336: Intelligent Information Retrieval
1 Chapter 5 A Measure of Information. 2 Outline 5.1 Axioms for the uncertainty measure 5.2 Two Interpretations of the uncertainty function 5.3 Properties.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Lossless Compression - I Hao Jiang Computer Science Department Sept. 13, 2007.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
IR IL Compression.  code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie.
Noise, Information Theory, and Entropy
Information Retrieval Space occupancy evaluation.
Noise, Information Theory, and Entropy
Huffman Codes Message consisting of five characters: a, b, c, d,e
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Huffman Coding Dr. Ying Lu RAIK 283 Data Structures & Algorithms.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Image Compression (Chapter 8) CSC 446 Lecturer: Nada ALZaben.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Random access to arrays of variable-length items
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 6 – Basics of Compression (Part 1) Klara Nahrstedt Spring 2011.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 7 (W5)
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Multi-media Data compression
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Information theory Data compression perspective Pasi Fränti
Ch4. Zero-Error Data Compression Yuan Luo. Content  Ch4. Zero-Error Data Compression  4.1 The Entropy Bound  4.2 Prefix Codes  Definition and.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Index construction: Compression of postings
EE465: Introduction to Digital Image Processing
Applied Algorithmics - week7
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Context-based Data Compression
Analysis & Design of Algorithms (CSCE 321)
Auto-completion Search
CSE 589 Applied Algorithms Spring 1999
Index construction: Compression of postings
Index construction: Compression of postings
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Rank and Select data structures
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa

Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? A uniquely decodable code can always be uniquely decomposed into their codewords.

Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another one e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 0 1 a bc d

Average Length For a code C with codeword length L[s], the average length is defined as p(A) =.7 [0], p(B) = p(C) = p(D) =.1 [1--] L a =.7 * * 3 = 1.6 bit (Huffman achieves 1.5 bit) We say that a prefix code C is optimal if for all prefix codes C’, L a (C)  L a (C’)

Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) 0-th order empirical entropy of string T i(s) 0 <= H <= log |  | H -> 0, skewed distribution H max, uniform distribution

Performance: Compression ratio Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. p(A) =.7, p(B) = p(C) = p(D) =.1 H ≈ 1.36 bits Huffman ≈ 1.5 bits per symb Shannon In practice Avg cw length Empirical H vs Compression ratio An optimal code is surely one that…

Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers

It is a prefix-free encoding… Given the following sequence of  coded integers, reconstruct the original sequence:

 code for integer encoding Use  -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as.  coding x takes about  log 2 x  + 2  log 2 (  log 2 x  )  + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers

Variable-byte  codes [10.2 bits per TREC12] Wish to get very fast (de)compress  byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=  binary(v) = Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!

PForDelta coding 1011 … … a block of 128 numbers = 256 bits = 32 bytes Use b (e.g. 2) bits to encode 128 numbers or create exceptions Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b  waste more bits, b  more exceptions Translate data: [base, base + 2 b -1]  [0,2 b -1]

A basic problem ! Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T Array of pointers (log m) bits per string, (n log m) bits overall. We could drop the separating NULL  Independent of string-length distribution It is effective for few strings  It is bad for medium/large sets of strings Space = 32 * n bits

A basic problem ! B Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... T 10#2#5#6#20#31#3#3#.... T X AbacoBattleCarColdCodDefenseGoogleYahoo.... X B We could drop msb We aim at achieving ≈n log(m/n) bits

Rank/Select B Rank b (i) = number of b in B[1,i] Select b (i) = position of the i-th b in B Rank 1 (6) = 2 Select 1 (3) = 8 m = |B| n = #1 Do exist data structures that solve this problem in O(1) query time and n log(m/n) + o(m) bits of space

z = 3, w=2 Elias-Fano useful for Rank/Select If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n log (m/n) bits - H takes 2n bits In unary Actually you can do binary search over B, but compressed ! Select 1 on B  uses L and Select 1 on H taking +o(n) space

If you wish to play with Rank and Select A next lecture… m/10 + n log m/n Rank in 0.4  sec, Select in < 1  sec For binary search cfr 2n + n log (m/n) only select vs 32n bits of explicit pointers