Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entropy and Shannon’s First Theorem

Similar presentations


Presentation on theme: "Entropy and Shannon’s First Theorem"— Presentation transcript:

1 Entropy and Shannon’s First Theorem
Chapter 6 Entropy and Shannon’s First Theorem

2 B. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent events
Information A quantitative measure of the amount of information any event represents. I(p) = the amount of information in the occurrence of an event of probability p. Axioms: A. I(p) ≥ 0 for any event p B. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent events C. I(p) is a continuous function of p single symbol source Cauchy functional equation units of information: in base 2 = a bit in base e = a nat in base 10 = a Hartley Existence: I(p) = log_(1/p) 6.2

3 Uniqueness: Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take any 0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0, and hence logk (1/p0) = I′(p0). Now, any z  (0,1) can be written as p0r, r a real number  R+ (r = logp0 z). The Cauchy Functional Equation implies that I′(p0n) = n I′(p0) and m  Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) = (n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0). Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z).  Note: In this proof, we introduce an arbitrary p0, show how any z relates to it, and then eliminate the dependency on that particular p0. 6.2

4 In radix r, when all the probabilities are independent:
Entropy The average amount of information received on a per symbol basis from a source S = {s1, …, sq} of symbols, si has probability pi. It is measuring the information rate. In radix r, when all the probabilities are independent: Entropy is amount of information in probability distribution. Alternative approach: consider a long message of N symbols from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to appear Npi times, and the probability of this typical message is: 6.3

5 Consider f(p) = p ln (1/p): (works for any base, not just e)
f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p) f″(p) = p(-p-2) = - 1/p < 0 for p  (0,1)  f is concave down f′(1/e) = 0 f(1/e) = 1/e 1/e f f′(1) = -1 f′(0) = ∞ 1/e 1 p f(1) = 0 6.3

6 Basic information about logarithm function
Tangent line to y = ln x at x = 1 (y  ln 1) = (ln)′x=1(x  1)  y = x  1 (ln x)″ = (1/x)′ = -(1/x2) < 0 x  ln x is concave down. Conclusion: ln x  x  1 y = x  1 ln x x 1 -1 6.4

7 Fundamental Gibbs inequality
Minimum Entropy occurs when one pi = 1 and all others are 0. Maximum Entropy occurs when? Consider Hence H(S) ≤ log q, and equality occurs only when pi = 1/q. 6.4

8 Entropy Examples S = {s1} p1 = 1 H(S) = 0 (no information) S = {s1,s2} p1 = p2 = ½ H2(S) = 1 (1 bit per symbol) S = {s1, …, sr} p1 = … = pr = 1/r Hr(S) = 1 but H2(S) = log2r. Run length coding (for instance, in binary predictive coding): p = 1  q is probability of a H2(S) = p log2(1/p) + q log2(1/q) As q  0 the term q log2(1/q) dominates (compare slopes). C.f. average run length = 1/q and average # of bits needed = log2(1/q). So q log2(1/q) = avg. amount of information per bit of original code.

9 Entropy as a Lower Bound for Average Code Length
Given an instantaneous code with length li in radix r, let How do we know we can get arbitrarily close in all other cases? By the McMillan inequality, this hold for all uniquely decodable codes. Equality occurs when K = 1 (the decoding tree is complete) and 6.5

10 Summing this inequality over i:
Shannon-Fano Coding Simplest variable length method. Less efficient than Huffman, but allows one to code symbol si with length li directly from probability pi. li = logr(1/pi) Summing this inequality over i: Kraft inequality is satisfied, therefore there is an instantaneous code with these lengths. 6.6

11 Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1
1 L = 5/2 H2(S) = 2.5 if K = 1, then the average code length = the entropy (put on final exam) 1 1 1 1 6.6

12 The Entropy of Code Extensions
Recall: The nth extension of a source S = {s1, …, sq} with probabilities p1, …, pq is the set of symbols T = Sn = {si1 ∙∙∙ sin : sij  S 1  j  n} where ti = si1 ∙∙∙ sin has probability pi1 ∙∙∙ pin = Qi assuming independent probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q. The entropy is: [] concatenation multiplication 6.8

13 H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity]
 H(Sn) = n∙H(S) Hence the average S-F code length Ln for T satisfies: H(T)  Ln < H(T) + 1  n ∙ H(S)  Ln < n ∙ H(S) + 1  H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity] 6.8

14 S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1)
Extension Example S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1) p1 = 2/3 p2 = 1/3 ~ … Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1 Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3 2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F: l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4 LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666… Sn = (s1 + s2)n, probabilities are corresponding terms in (p1 + p2)n 6.9

15 Extension cont. 2n 3n-1 * (2 + 1)n = 3n 6.9

16 Markov Process Entropy
6.10

17 Example .8 Si1 Si2 Si p(si | si1, si2) p(si1, si2) p(si1, si2, si) 0.8
previous state next state 0, 0 Si1 Si2 Si p(si | si1, si2) p(si1, si2) p(si1, si2, si) 0.8 5/14 4/14 1 0.2 1/14 0.5 2/14 .2 .5 .5 1, 0 0, 1 .5 .5 .2 1, 1 .8 equilibrium probabilities: p(0,0) = 5/14 = p(1,1) p(0,1) = 2/14 = p(1,0) 6.11

18 The Fibonacci numbers Let f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , …. be defined by fn+1 = fn + fn−1. The lim 𝑛→∞ 𝑓 𝑛+1 𝑓 𝑛 = = the golden ratio, a root of the equation x2 = x + 1. Use these as the weights for a system of number representation with digits 0 and 1, without adjacent 1’s (because (100)phi = (11)phi).

19 Base Fibonacci Representation Theorem: every number from 0 to fn − 1 can be uniquely written as an n-bit number with no adjacent one’s . Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction hypothesis. Otherwise, fn ≤ i < fn+1 = fn−1 + fn , so 0 ≤ (i − fn) < fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi with bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also has no adjacent ones. Uniqueness: Let i be the smallest number ≥ 0 with two distinct representations (no leading zeros). i = (bn−1 … b0)phi = (b′n−1 … b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so without loss of generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1 which can’t be true.

20 Base Fibonacci The golden ratio  = (1+√5)/2 is a solution to x2 − x − 1 = 0 and is equal to the limit of the ratio of adjacent Fibonacci numbers. r − 1 1/r H2 = log2 r 1/2 1/ 1 1st order Markov process: 1/ 1 1/2 1/ Think of source as emitting variable length symbols: 10 1/ 1/2 1/ + 1/2 = 1 See accompanying file 1/2 Entropy = (1/)∙log  + ½(1/²)∙log ² = log  which is maximal take into account variable length symbols

21 For simplicity, consider a first-order Markov system, S
The Adjoint System (skip) For simplicity, consider a first-order Markov system, S Goal: bound the entropy by a source with zero memory, yet whose probabilities are the equilibrium probabilities. Let p(si) = equilibrium prob. of si p(sj) = equilibrium prob. of sj p(sj, si) = equilibrium probability of getting sjsi. with = only if p(sj, si) = p(si) · p(sj) Now, p(sj, si) = p(si | sj) · p(sj). = = = 6.12


Download ppt "Entropy and Shannon’s First Theorem"

Similar presentations


Ads by Google