On Lossy Compression Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin.

Slides:



Advertisements
Similar presentations
Kolmogorov complexity and its applications Paul Vitanyi CWI & University of Amsterdam Microsoft Intractability Workshop, 5-7.
Advertisements

Completeness and Expressiveness
Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Lecture 9. Resource bounded KC K-, and C- complexities depend on unlimited computational resources. Kolmogorov himself first observed that we can put resource.
MATH 224 – Discrete Mathematics
15-583:Algorithms in the Real World
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
Information Theory EE322 Al-Sanie.
Evolutionary Game Algorithm for continuous parameter optimization Alireza Mirian.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Chapter 6 Information Theory
Visual Recognition Tutorial
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Mobile Ad Hoc Networks Theory of Data Flow and Random Placement.
Computer Science 335 Data Compression.
On the interdependence of routing and data compression in multi-hop sensor networks Anna Scaglione, Sergio D. Servetto.
Copyright © Cengage Learning. All rights reserved.
Information Theory and Security
Definition and Properties of the Production Function Lecture II.
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Entropy and some applications in image processing Neucimar J. Leite Institute of Computing
Section 11.4 Language Classes Based On Randomization
Department of Physics and Astronomy DIGITAL IMAGE PROCESSING
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
Kolmogorov complexity and its applications Ming Li School of Computer Science University of Waterloo CS 898.
EEET 5101 Information Theory Chapter 1
Positive and Negative Randomness Paul Vitanyi CWI, University of Amsterdam Joint work with Kolya Vereshchagin.
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
Facticity, Complexity and Big Data
Kolmogorov complexity and its applications Paul Vitanyi Computer Science University of Amsterdam Spring, 2009.
Competence Centre on Information Extraction and Image Understanding for Earth Observation Prof. Dr. Mihai Datcu SATELLITE IMAGE ARTIFACTS DETECTION BASED.
Kolmogorov complexity and its applications Ming Li School of Computer Science University of Waterloo CS860,
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
Computer Vision – Compression(1) Hanyang University Jong-Il Park.
Coding Theory Efficient and Reliable Transfer of Information
INTRODUCTION TO Machine Learning 3rd Edition
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 10 Rate-Distortion.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Fuzzy Optimization D Nagesh Kumar, IISc Water Resources Planning and Management: M9L1 Advanced Topics.
1 What happens to the location estimator if we minimize with a power other that 2? Robert J. Blodgett Statistic Seminar - March 13, 2008.
Image Processing Architecture, © Oleh TretiakPage 1Lecture 4 ECE-C490 Winter 2004 Image Processing Architecture Lecture 4, 1/20/2004 Principles.
Bringing Together Paradox, Counting, and Computation To Make Randomness! CS Lecture 21 
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
Image Processing Architecture, © Oleh TretiakPage 1Lecture 5 ECEC 453 Image Processing Architecture Lecture 5, 1/22/2004 Rate-Distortion Theory,
Kolmogorov Complexity
Distributed Compression For Still Images
Modeling with Recurrence Relations
2018/9/16 Distributed Source Coding Using Syndromes (DISCUS): Design and Construction S.Sandeep Pradhan, Kannan Ramchandran IEEE Transactions on Information.
Chapter 5. Optimal Matchings
Lecture 6. Prefix Complexity K
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Chap 3. The simplex method
Foundation of Video Coding Part II: Scalar and Vector Quantization
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
Presentation transcript:

On Lossy Compression Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin

You can import music in a variety of formats, such as MP3 or AAC, and at whatever quality level you’d prefer.  Lossy Compression You can even choose the new Apple Lossless encoder. Music encoded with that option offers sound quality indistinguishable from the original CDs at about half the file size of the original.  Lossless Compression

Lossy Compression drives the Web Pictures: JPEG Sound: MP3 Video: MPEG Majority of Web transfers are Lossy Compressed Data--- http traffic was exceeded by peer-to-peer music and video sharing in 2002.

Lena Compressed by JPEG Original Lena Image (256 x 256 Pixels, 24-Bit RGB)JPEG Compressed (Compression Ratio 43:1) As can be seen from the comparison images above, at compression ratios above 40:1 the JPEG algorithm begins to lose its effectiveness, while the JPEG2000 compressed image shows very little distortion. JPEG2000 Compressed (Compression Ratio 43:1)

Rate Distortion Theory Underlies Lossy Compression Claude Elwood Shannon, 1948 & 1959, Defines Rate Distortion (With learning mouse “Theseus” in the picture)

Rate Distortion X is a set of source words Y is a set of code words If |X| < |Y|, then no code is faithfull  Distortion

Distortion Choose a distortion measure d: X × Y  Real Numbers X Source words Y Code words coding x y Distortion = d(x,y) Distortion = fidelity of the coded version versus the source data.

Example Distortion Measures List Distortion for bit rate R: x  Hamming Distortion for bit rate R : x = y = x Source word x is a finite binary string; Code word y is a finite set of source words containing x, and y is described in ≤R bits. Bit flips y can be described in ≤R bits y coding Source word x and code word y are binary strings of length n. Distortion d(x,y) = log |y| (rounded up to integer value) Distortion d(x,y) = number of flipped bits between x and y.

Example Distortion Measures Euclidean Distortion for parameter R : y is a rational number that can be described in ≤R bits coding X is a real number Distortion d(x,y) = |x-y|

Distortion-rate function Minimal distortion as function of given rate R: Random source: x_1x_2 x_n Coding using a sequence of codes c_1.c_2,...,c_n from prescribed code class y_1 = c_1(x_1)y_2 = c_2(x_2)y_n = c_n(x_n) Code length | y_1 y_2... y_n | ≤ nR bits Distortion-rate function: D(R)= lim min ∑ p(x_1x_2...x_n) 1/n ∑ d(x_i,y_i) n∞n∞ x_1x_2...x_n i=1 n over all code sequences c_1,c_2,...,c_n satisfying rate constraints

Rate-distortion function Minimal rate as function of maximal allowed distortion D: Random source: x_1x_2 x_n Coding using a sequence of codes c_1.c_2,...,c_n from prescribed code class y_1 = c_1(x_1)y_2 = c_2(x_2)y_n = c_n(x_n) Expected distortion ∑ p(x_1x_2...x_n) 1/n ∑ d(x_i,y_i)≤ D Rate-distortion function: R(D)= lim min ∑ p(x_1x_2...x_n) 1/n ∑ |y_i| n∞n∞ x_1x_2...x_n i=1 n over all code sequences c_1,c_2,...,c_n satisfying distortion constraints x_1x_2...x_n Since D(R) is convex and nonincreasing, R(D) is its inverse.

Function graphs Rate-distortion graph Hamming distortion: R(D)=n(1-H(D/n)) |x_i|=n, D= expected # bit flips Rate-distortion graph List distortion: R(D) = n – D |x_i|=n, D = expected log-cardinality of list (set). Rate-distortion graph Euclidean distortion: R(D)= log 1/D x_i is a real in [0,1], D = expected distance between x_i and rational code word y_i

Problems with this approach Functions give expectations or at best rate-distortion relation for a high-probability set of typical sequences It is often assumed that the random source is stationary ergodic (to be able to determine the curve) This is fine for data that satisfy simple statistical properties, But not for complex data that satisfy global relations like images, music, video Such complex pictures are usually atypical. Just like lossless compression requires lots of tricks to be able to compress meaningful data, so does lossy compression. There is a wealth of ad hoc theories and solutions for special application fields and problems. Can we find a general theory for lossy compression of individual data?

Andrey Nikolaevich Kolmogorov ( , Tambov, Russia) Measure Theory Probability Analysis Intuitionistic Logic Cohomology Dynamical Systems Hydrodynamics Kolmogorov complexity

Background Kolmogorov complexity: Randomness of individual objects. First: A story of Dr. Samuel Johnson … Dr. Beattie observed, as something remarkable which had happened to him, that he chanced to see both No.1 and No.1000 hackney-coaches. “Why sir,” said Johnson “there is an equal chance for one’s seeing those two numbers as any other two.” Boswell’s Life of Johnson

Defining Randomness: Precursor Ideas Von Mises: A sequence is random if it has about same # of 1’s and 0’s, and this holds for its ‘reasonably’ selected subsequences. P. LaPlace: A sequence is “extraordinary” (nonrandom) because it contains rare “regularity”. But what is “reasonable”? A. Wald: Countably many selection functions A. Church: Recursive functions J. Ville: von Mises-Wald-Church randomness no good.

Kolmogorov Complexity Solomonoff (1960)-Kolmogorov (1965)-Chaitin (1969): The amount of information in a string is the size of the smallest program generating that string. Invariance Theorem: It does not matter which universal Turing machine U we choose. I.e. all “encoding methods” are ok.

Kolmogorov complexity  K(x)= length of shortest description of x  K(x|y)=length of shortest description of x given y. A string is random if K(x) ≥ |x|.  K(x)-K(x|y) is information y knows about x.  Theorem (Mutual Information). K(x)-K(x|y) = K(y)-K(y|x)

Applications of Kolmogorov complexity Mathematics --- probability theory, logic. Physics --- chaos, thermodynamics. Computer Science Biology: complex systems Philosophy – randomness. Information theory – Today’s topic.

Individual Rate-Distortion Given datum x, class of models Y={y}, distortion d(x, y): Rate-distortion function: r (d) = min {K(y): d(x,y) ≤ d} Distortion-rate function: d (r) = min {d(x,y): K(y) ≤ r} x x y y

Individual characteristics: More detail, especially for meaningful (nonrandom) Data Example list distortion: data x,y,z of length n, with K(y) = n/2, K(x)= n/3, K(z)= n/9. All >(1-1/n)2^n data strings u of complexity n- log n ≤ K(u) ≤ n +O(log n) have individual rate-distortion curves approximately coinciding with Shannon’s single curve. Therefore, the expected individual rate-distortion curve coincides with Shannon’s single curve (up to small error). Those data are typical data that are essentially ‘random’ (noise) and have no meaning. Data with meaning we may be interested in, music, text, picture, are extraordinary (rare) and have regularities expressing that meaning, And hence small Kolmogorov complexity, and rate-distortion curves differing in size and shape from Shannon’s.

Upper bound Rate-Distortion graph For all data x the rate-distortion function is monotonic non-increasing and: r (d ) ≤ K(y ) r (d) ≤ r (d’)+ log [α B(d’)/B(d)] + O(small) [all d ≤ d’ ] x max 0 x x Cardinality ball is B_y(d) = |{x: d(x,y) ≤d}| Ball of all data x within distortion d of code word (`center’) y. We often don’t write the center if it is understood Set of source words X is a ball of radius d_max with center y_0 For all d ≤ d’ such that B(d)>0, every ball of radius d’ in X can be covered by at most α B(d’)/B(d) balls of radius d B(d) Ball of radius d’ Covering by balls of radius d ≤ d’ B(d) B(d’) For this to be usefull, we require that α be polynomial in n—the number of bits in data x. This is satisfied for many distortion measures This means the funxtion r_x(d)+log B(d) is monotonic non-increasing up to fluctuations of size O(log α). B(d)

Lower Bound Rate-Distortion Graph r (d) ≥ K(x) – log B(d) + O(small) If we have the center of the ball in r (d) bits, together with value d in O(log d) bits, then we can enumerate all B(d) elements and give the index of x in log B(d) bits. x x

Rate-distortion functions of every shape Lemma: Let r(d)+log B(d) be monotonic non-decreasing, and r(d_max) =0. Then there is datum x such that |r(d)-r_x(d)|≤ O(small) That is, for very code and distortion, every function between lower bound and upper bound is realized by some datum x (up to some small error and provided the function decreases at at least the proper slope)

Hamming Distortion Lemma: For n-bit strings, α = O(n^4) B(d’) B(d) D is Hamming distance, and radius d=D/n. There is a cover of a ball of Hamming radius d’ with O(n^4) B(d’)/B(d)) balls of Hamming radius d, for every d ≤ d’. New result (as far as we know) of sparse covering large Hamming balls by small Hamming balls. Lemma: i) B(d) = n H(d)+O(log n) with d = D/n ≤ ½ and H(d)= d log 1/d + (1-d) log (1-d) ; ii) d_max = ½ with D = n/2. Every string is within n/2 bit flips of either center or center

Hamming Distortion, Continued d = D/n: distortion r_x (d): rate K(x) ½ log n Upper bound: n(1-H(d)) n Lower bound: K(x)-nH(d) Actual curve: r_x(d) Minumum sufficient statistic At K(x) rate we can Describe data x Perfectly: no distortion: D=D/n=0 With distortion d=n/D = ½ we only need to specify number of bits of data x In O(log n) bits Every monotonic non-increasing function r(d), with r(d)+log B(d) is monotonic non-decreasing, and r(½ )=0, That is, in between the lower- and upper bounds and descending at at least the proper slope, can be realized as rate-distortion function of some datum x, with precision |r(d)-r_x(d)| ≤ O(√n log n)+K(r)

Theory to practice, using real compressors—with Steven de Rooij Datum x Rate Distortion Rate Distortion

Mouse: Original Picture

Mouse: Increasing Rate of Codes

Mouse: MDL code-length

Penguin: Original (Linux)

Penguin: Rate of code-lengths

Euclidean Distortion Lemma: d=|x-y| (Euclidean distance between real data x and rational code y) α = 2; d_max = ½; r_x(½) =O(1); r_x(d) ≤ r_x(d’)+log d’/d [all 0<d ≤d’ ≤½] Every non-increasing function r(d), such that r(d)+log d is monotonic non-decreasing, and r(½ )=0, can be realized as rate-distortion function of some real x, with precision |r(d)-r_x(d)| ≤ O(√log 1/d) [all 0<d≤½]

List Distortion Lemma: d=|y|-- the cardinality of finite set y (the code) containing x with length |x|=n. α = 2; d_max = 2^n; r_x(2^n) =O(log n); r_x(d) ≤ r_x(d’)+log d’/d +O(small) [all 0<d ≤d’ ≤2^n] Every non-increasing function r(d), such that r(d)+log d is monotonic non-decreasing, and r(2^n )=0, can be realized as rate-distortion function of some string x of length n, with precision |r(d)-r_x(d)| ≤ O(log n +K(r)) [all 1<d≤2^n]

List distortion continued: Distortion- rate graph d_x(r) r log |y| Distortion-rate graph Lower bound d_x(r)=K(x)-r

List distortion continued: Positive and negative randomness d_x(r) d_x’(r) K(x)=K(x’) r log |y| |x|=|x’| X’ x

List distortion continued: Precision of following given function d(r) d(r) d d_x(r) Rate r Distortion log |y|

Expected individual rate-distortion equals Shannon’s rate-distortion Lemma: Given m repetitions of an i.i.d. random variable with probability f(x) of obtaining outcome x, and f is a total recursive function (K(f) is finite), lim ∑ p(x^m) (1/m) d_x^m (mR) = D(R) where x^m = x_1... x_m, and p(.) is the extension of f to m repetitions of the random variable. m  ∞ x^m

Algorithmic Statistics Paul Vitanyi CWI, University of Amsterdam, National ICT Australia Joint work with Kolya Vereshchagin

Kolmogorov’s Structure function

Non-Probabilistic Statistics

Classic Statistics--Recalled

Sufficient Statistic

Sufficient Statistic, Contn’d

Kolmogorov Complexity--Revisited

Kolmogorov complexity and Shannon Information

Randomness Deficiency

Algorithmic Sufficient Statistic

Maximum Likelihood Estimator, Best-Fit Estimator

Minimum Description Length estimator, Relations between estimators

Primogeniture of ML/MDL estimators ML/MDL estimators can be approximated from above; Best-fit estimator cannot be approximated Either from above or below, up to any Precision. But the approximable ML/MDL estimators yield the best-fitting models, even though we don’t know the quantity of goodness- of-fit  ML/MDL estimators implicitly optimize goodness-of-fit.

Positive- and Negative Randomness, and Probabilistic Models

List distortion continued

Recapitulation

Selected Bibliography N.K. Vereshchagin, P.M.B. Vitanyi, A theory of lossy compression of individual data, Submitted. P.D. Grunwald, P.M.B. Vitanyi, Shannon Information and Kolmogorov complexity, IEEE Trans. Information Theory, Submitted. N.K. Vereshchagin and P.M.B. Vitanyi, Kolmogorov's Structure functions and model selection, IEEE Trans. Inform. Theory, 50:12(2004), P. Gacs, J. Tromp, P. Vitanyi, Algorithmic statistics, IEEE Trans. Inform. Theory, 47:6(2001), Q. Gao, M. Li and P.M.B. Vitanyi, Applying MDL to learning best model granularity, Artificial Intelligence, 121:1-2(2000), P.M.B. Vitanyi and M. Li, Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity, IEEE Trans. Inform. Theory, IT-46:2(2000),