Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.

Similar presentations


Presentation on theme: "Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model."— Presentation transcript:

1 Information Theory Rong Jin

2 Outline  Information  Entropy  Mutual information  Noisy channel model

3 Information  Information  knowledge  Information: reduction in uncertainty  Example: 1. flip a coin 2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the outcome of #2 than #1

4 Definition of Information  Let E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log 2 (1/P(E)) bits of information  Example: Result of a fair coin flip (log 2 2=1 bit) Result of a fair die roll (log 2 6=2.585 bits)

5 Information is Additive  I(k fair coin tosses) = log2 k =k bits  Example: information conveyed by words Random word from a 100,000 word vocabulary:  I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source  I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture:  I(picture) = 307,200 * log16 = 1,228,800 bits  A picture is worth a 1000 words!

6 Information is Additive  I(k fair coin tosses) = log2 k =k bits  Example: information conveyed by words Random word from a 100,000 word vocabulary:  I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source  I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture:  I(picture) = 307,200 * log16 = 1,228,800 bits  A picture is worth a 1000 words!

7 Information is Additive  I(k fair coin tosses) = log2 k =k bits  Example: information conveyed by words Random word from a 100,000 word vocabulary:  I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source  I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture:  I(picture) = 307,200 * log16 = 1,228,800 bits  A picture is worth a 1000 words!

8 Information is Additive  I(k fair coin tosses) = log2 k =k bits  Example: information conveyed by words Random word from a 100,000 word vocabulary:  I(word) = log(100,000) = 16.6 bits A 1000 word document from the same source  I(document) = 16,600 bits A 480x640 pixel, 16-greyscale video picture:  I(picture) = 307,200 * log16 = 1,228,800 bits  A picture is worth more than a 1000 words!

9 Outline  Information  Entropy  Mutual Information  Cross Entropy and Learning

10 Entropy  A zero-memory information source S is a source that emits symbols from an alphabet {s 1, s 2,…, s k } with probability {p 1, p 2,…,p k }, respectively, where the symbols emitted are statistically independent.  What is the average amount of information in observing the output of the source S?  Call this entropy:

11 Entropy  A zero-memory information source S is a source that emits symbols from an alphabet {s 1, s 2,…, s k } with probability {p 1, p 2,…,p k }, respectively, where the symbols emitted are statistically independent.  What is the average amount of information in observing the output of the source S?  Call this entropy:

12 Explanation of Entropy 1.Average amount of information provided per symbol 2.Average # of bits needed to communicate each symbol

13 Properties of Entropy 1. Non-negative: H(P)  0 2. For any other probability distribution {q 1,…,q k }, 3. H(P)  logk, with equality iff p i =1/k for all i 4. The further P is from uniform, the lower the entropy.

14 Entropy: k = 2 Notice zero information at edges maximum information at 0.5 (1 bit) drop off more quickly close edges than in the middle

15 The Entropy of English  27 characters (A-Z, space)  100,000 words (average 6.5 char each)  Assuming independence between successive characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character  Assuming independence between successive words: Uniform word distribution: log100,1000/6.5 = 2.55 bits/char True word distribution: 9.45/6.5 = 1.45 bits/character  True entropy of English is much lower!

16 The Entropy of English  27 characters (A-Z, space)  100,000 words (average 6.5 char each)  Assuming independence between successive characters: Uniform character distribution: log27 = 4.75 bits/char True character distribution: 4.03 bits/character  Assuming independence between successive words: Uniform word distribution: (log100,1000)/6.5 = 2.55 bits/char True word distribution: 9.45/6.5 = 1.45 bits/character  True entropy of English is much lower!

17 Entropy of Two Sources Temperature T P(T = hot) = 0.3 P(T = mild) = 0.5 P(T = cold) = 0.2  H(T) = H(0.3, 0.5, 0.2) = 1.485 Humidity M P(M = low) = 0.6 P(M = high) = 0.4  H(M) = H(0.6, 0.4) = 0.971 Random variable T, M are not independent P(T=t, M=m)  P(T=t)P(M=m)

18 H(T) = 1.485 H(M) = 0.971 H(T) + H(M) = 2.456 Joint Entropy H(T, M) = H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1) = 2.321 H(T, M)  H(T) + H(M) Joint Entropy Joint Probability P(T, M)

19 Conditional Entropy  Conditional Entropy H(T|M = low) = 1.252 H(T|M = high) = 1.5 Average conditional entropy How much is M telling us on average about T? H(T) – H(T|M) = 1.485 – 1.351 = 0.134 bits Conditional Probability P(T| M)

20 Mutual Information  Properties: Indicate the amount of information one random variable can provide to another one Symmetric I(X;Y) = I(Y;X) Non-negative Zero iff X, Y are independent

21 Relationship H(X, Y) H(X) H(Y) H(X|Y) H(Y|X) I(X;Y)

22 A Distance Measure Between Distributions  Kullback-Leibler distance:  Properties of Kullback-Leibler distance Non-negative: KL(P D ||P M )=0 iff P D = P M  Minimizing KL distance  P M get close to P D Non-symmetric: KL(P D ||P M )  KL(P M ||P D )

23 Bregman Distance ' ( x ) i saconvex f unc t i on.

24 Compression Algorithm for TC Sports Training Examples Politics Compress 109K 116K New Document

25 Compression Algorithm for TC Sports Training Examples Politics Compress 109K 116K New Document Politics New Document Sports Compress 129K 126K Topic: Sports

26 26 The Noisy Channel  Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...  Model: probability of error (noise):  Example: p(0|1) =.3 p(1|1) =.7 p(1|0) =.4 p(0|0) =.6  The Task: known: the noisy output; want to know: the input (decoding)  Source coding theorem  Channel coding theorem

27 27 Noisy Channel Applications  OCR straightforward: text  print (adds noise), scan  image  Handwriting recognition text  neurons, muscles (“noise”), scan/digitize  image  Speech recognition (dictation, commands, etc.) text  conversion to acoustic signal (“noise”)  acoustic waves  Machine Translation text in target language  translation (“noise”)  source language


Download ppt "Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model."

Similar presentations


Ads by Google