Entropy and Shannon’s First Theorem

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Chapter 4 Variable–Length and Huffman Codes. Unique Decodability We must always be able to determine where one code word ends and the next one begins.
15-583:Algorithms in the Real World
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Sampling and Pulse Code Modulation
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Recursive Definitions and Structural Induction
Chapter 10 Shannon’s Theorem. Shannon’s Theorems First theorem:H(S) ≤ L n (S n )/n < H(S) + 1/n where L n is the length of a certain code. Second theorem:
Information Theory EE322 Al-Sanie.
An introduction to Data Compression
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Data Compression.
Chapter 5 Markov processes Run length coding Gray code.
Entropy Rates of a Stochastic Process
Chapter 6 Information Theory
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
Data Structures – LECTURE 10 Huffman coding
1 Chapter 5 A Measure of Information. 2 Outline 5.1 Axioms for the uncertainty measure 5.2 Two Interpretations of the uncertainty function 5.3 Properties.
CSE115/ENGR160 Discrete Mathematics 03/31/11
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
Variable-Length Codes: Huffman Codes
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
STATISTIC & INFORMATION THEORY (CSNB134)
Information Theory & Coding…
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Applied Discrete Mathematics Week 9: Relations
Induction and recursion
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
Section 5.3. Section Summary Recursively Defined Functions Recursively Defined Sets and Structures Structural Induction.
Chapter 8 With Question/Answer Animations 1. Chapter Summary Applications of Recurrence Relations Solving Linear Recurrence Relations Homogeneous Recurrence.
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
Channel Capacity.
Prepared by: Amit Degada Teaching Assistant, ECED, NIT Surat
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
Linawati Electrical Engineering Department Udayana University
DCSP-8: Minimal length coding I Jianfeng Feng Department of Computer Science Warwick Univ., UK
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Copyright © Zeph Grunschlag, Induction Zeph Grunschlag.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Basic Principles (continuation) 1. A Quantitative Measure of Information As we already have realized, when a statistical experiment has n eqiuprobable.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)
1 What happens to the location estimator if we minimize with a power other that 2? Robert J. Blodgett Statistic Seminar - March 13, 2008.
Copyright © Zeph Grunschlag, Induction Zeph Grunschlag.
CompSci 102 Discrete Math for Computer Science March 13, 2012 Prof. Rodger Slides modified from Rosen.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
SEAC-3 J.Teuhola Information-Theoretic Foundations Founder: Claude Shannon, 1940’s Gives bounds for:  Ultimate data compression  Ultimate transmission.
Approximation Algorithms based on linear programming.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
UNIT –V INFORMATION THEORY EC6402 : Communication TheoryIV Semester - ECE Prepared by: S.P.SIVAGNANA SUBRAMANIAN, Assistant Professor, Dept. of ECE, Sri.
Information Theory Information Suppose that we have the source alphabet of q symbols s 1, s 2,.., s q, each with its probability p(s i )=p i. How much.
Ch4. Zero-Error Data Compression Yuan Luo. Content  Ch4. Zero-Error Data Compression  4.1 The Entropy Bound  4.2 Prefix Codes  Definition and.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
EE465: Introduction to Digital Image Processing
I.4 Polyhedral Theory (NW)
I.4 Polyhedral Theory.
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Entropy and Shannon’s First Theorem Chapter 6 Entropy and Shannon’s First Theorem

B. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent events Information A quantitative measure of the amount of information any event represents. I(p) = the amount of information in the occurrence of an event of probability p. Axioms: A. I(p) ≥ 0 for any event p B. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent events C. I(p) is a continuous function of p single symbol source Cauchy functional equation units of information: in base 2 = a bit in base e = a nat in base 10 = a Hartley Existence: I(p) = log_(1/p) 6.2

Uniqueness: Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take any 0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0, and hence logk (1/p0) = I′(p0). Now, any z  (0,1) can be written as p0r, r a real number  R+ (r = logp0 z). The Cauchy Functional Equation implies that I′(p0n) = n I′(p0) and m  Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) = (n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0). Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z).  Note: In this proof, we introduce an arbitrary p0, show how any z relates to it, and then eliminate the dependency on that particular p0. 6.2

In radix r, when all the probabilities are independent: Entropy The average amount of information received on a per symbol basis from a source S = {s1, …, sq} of symbols, si has probability pi. It is measuring the information rate. In radix r, when all the probabilities are independent: Entropy is amount of information in probability distribution. Alternative approach: consider a long message of N symbols from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to appear Npi times, and the probability of this typical message is: 6.3

Consider f(p) = p ln (1/p): (works for any base, not just e) f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p) f″(p) = p(-p-2) = - 1/p < 0 for p  (0,1)  f is concave down f′(1/e) = 0 f(1/e) = 1/e 1/e f f′(1) = -1 f′(0) = ∞ 1/e 1 p f(1) = 0 6.3

Basic information about logarithm function Tangent line to y = ln x at x = 1 (y  ln 1) = (ln)′x=1(x  1)  y = x  1 (ln x)″ = (1/x)′ = -(1/x2) < 0 x  ln x is concave down. Conclusion: ln x  x  1 y = x  1 ln x x 1 -1 6.4

Fundamental Gibbs inequality Minimum Entropy occurs when one pi = 1 and all others are 0. Maximum Entropy occurs when? Consider Hence H(S) ≤ log q, and equality occurs only when pi = 1/q. 6.4

Entropy Examples S = {s1} p1 = 1 H(S) = 0 (no information) S = {s1,s2} p1 = p2 = ½ H2(S) = 1 (1 bit per symbol) S = {s1, …, sr} p1 = … = pr = 1/r Hr(S) = 1 but H2(S) = log2r. Run length coding (for instance, in binary predictive coding): p = 1  q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q) As q  0 the term q log2(1/q) dominates (compare slopes). C.f. average run length = 1/q and average # of bits needed = log2(1/q). So q log2(1/q) = avg. amount of information per bit of original code.

Entropy as a Lower Bound for Average Code Length Given an instantaneous code with length li in radix r, let How do we know we can get arbitrarily close in all other cases? By the McMillan inequality, this hold for all uniquely decodable codes. Equality occurs when K = 1 (the decoding tree is complete) and 6.5

Summing this inequality over i: Shannon-Fano Coding Simplest variable length method. Less efficient than Huffman, but allows one to code symbol si with length li directly from probability pi. li = logr(1/pi) Summing this inequality over i: Kraft inequality is satisfied, therefore there is an instantaneous code with these lengths. 6.6

Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1 1 L = 5/2 H2(S) = 2.5 if K = 1, then the average code length = the entropy (put on final exam) 1 1 1 1 6.6

The Entropy of Code Extensions Recall: The nth extension of a source S = {s1, …, sq} with probabilities p1, …, pq is the set of symbols T = Sn = {si1 ∙∙∙ sin : sij  S 1  j  n} where ti = si1 ∙∙∙ sin has probability pi1 ∙∙∙ pin = Qi assuming independent probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q. The entropy is: [] concatenation multiplication 6.8

H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity]  H(Sn) = n∙H(S) Hence the average S-F code length Ln for T satisfies: H(T)  Ln < H(T) + 1  n ∙ H(S)  Ln < n ∙ H(S) + 1  H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity] 6.8

S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1) Extension Example S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1) p1 = 2/3 p2 = 1/3 ~ 0.9182958 … Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1 Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3 2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F: l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4 LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666… Sn = (s1 + s2)n, probabilities are corresponding terms in (p1 + p2)n 6.9

Extension cont. 2n 3n-1 * (2 + 1)n = 3n 6.9

Markov Process Entropy 6.10

Example .8 Si1 Si2 Si p(si | si1, si2) p(si1, si2) p(si1, si2, si) 0.8 previous state next state 0, 0 Si1 Si2 Si p(si | si1, si2) p(si1, si2) p(si1, si2, si) 0.8 5/14 4/14 1 0.2 1/14 0.5 2/14 .2 .5 .5 1, 0 0, 1 .5 .5 .2 1, 1 .8 equilibrium probabilities: p(0,0) = 5/14 = p(1,1) p(0,1) = 2/14 = p(1,0) 6.11

The Fibonacci numbers Let f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , …. be defined by fn+1 = fn + fn−1. The lim 𝑛→∞ 𝑓 𝑛+1 𝑓 𝑛 = 1+ 5 2 = the golden ratio, a root of the equation x2 = x + 1. Use these as the weights for a system of number representation with digits 0 and 1, without adjacent 1’s (because (100)phi = (11)phi).

Base Fibonacci Representation Theorem: every number from 0 to fn − 1 can be uniquely written as an n-bit number with no adjacent one’s . Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction hypothesis. Otherwise, fn ≤ i < fn+1 = fn−1 + fn , so 0 ≤ (i − fn) < fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi with bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also has no adjacent ones. Uniqueness: Let i be the smallest number ≥ 0 with two distinct representations (no leading zeros). i = (bn−1 … b0)phi = (b′n−1 … b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so without loss of generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1 which can’t be true.

Base Fibonacci The golden ratio  = (1+√5)/2 is a solution to x2 − x − 1 = 0 and is equal to the limit of the ratio of adjacent Fibonacci numbers. … r − 1 1/r H2 = log2 r 1/2 1/ 1 1st order Markov process: 1/ 1 1/2 1/ Think of source as emitting variable length symbols: 10 1/ 1/2 1 0 1/ + 1/2 = 1 See accompanying file 1/2 Entropy = (1/)∙log  + ½(1/²)∙log ² = log  which is maximal take into account variable length symbols

For simplicity, consider a first-order Markov system, S The Adjoint System (skip) For simplicity, consider a first-order Markov system, S Goal: bound the entropy by a source with zero memory, yet whose probabilities are the equilibrium probabilities. Let p(si) = equilibrium prob. of si p(sj) = equilibrium prob. of sj p(sj, si) = equilibrium probability of getting sjsi. with = only if p(sj, si) = p(si) · p(sj) Now, p(sj, si) = p(si | sj) · p(sj). = = = 6.12