Presentation is loading. Please wait.

Presentation is loading. Please wait.

Some basic concepts of Information Theory and Entropy

Similar presentations


Presentation on theme: "Some basic concepts of Information Theory and Entropy"— Presentation transcript:

1 Some basic concepts of Information Theory and Entropy
Information theory, IT Entropy Mutual Information Use in NLP Introduction: marc general i la motivació d’aquest treball

2 Entropy Related to the coding theory- more efficient code: fewer bits for more frequent messages at the cost of more bits for the less frequent

3 EXAMPLE: You have to send messages about the two occupants in a house every five minutes
Equal probability: 0 no occupants 1 first occupant 2 second occupant 3 Both occupants Different probability Situation Probability Code no occupants first occupant second occupant Both occupants

4 Let X a random variable taking values x1, x2,
Let X a random variable taking values x1, x2, ..., xn from a domain de according to a probability distribution We can define the expected value of X, E(x) as the summatory of the possible values weighted with their probability E(X) = p(x1)X(x1) + p(x2)X(x2) p(xn)X(xn)

5 Entropy A message can thought of as a random variable W that can take one of several values V(W) and a probability distribution P. Is there a lower bound on the number of bits neede tod encode a message? Yes, the entropy It is possible to get close to the minimum (lower bound) It is also a measure of our uncertainty about wht the message says (lot of bits- uncertain, few - certain)

6 Given an event we want to associate its information content (I)
From Shannon in the 1940s Two constraints: Significance: The less probable is an event the more information it contains P(x1) > P(x2) => I(x2) > I(x1) Additivity: If two events are independent I(x1x2) = I(x1) + I(x2) Shannon was interested in the problem of Shannon wanted to determine theoretical maxima for !)

7 I(m) = 1/p(m) does not satisfy the second requirement
I(x) = - log p(x) satisfies both So we define I(X) = - log p(X) Shannon was interested in the problem of Shannon wanted to determine theoretical maxima for !)

8 Entropy: is the expected value of I: E(I)
Let X a random variable, described by p(X), owning an information content I Entropy: is the expected value of I: E(I) Entropy measures information content of a random variable. We can consider it as the average length of the message needed to transmite a value of this variable using an optimal coding. Entropy measures the degree of desorder (uncertainty) of the random variable. using the optimal code = the entropy will be the minimum

9 Uniform distribution of a variable X.
Each possible value xi  X with |X| = M has the same probability pi = 1/M If the value xi is codified in binary we need log2 M bits of information Non uniform distribution. by analogy Each value xi has a different probability pi Let assume pi to be independent If Mpi = 1/ pi we will need log2 Mpi = log2 (1/ pi ) = - log2 pi bits of information

10 Let X ={a, b, c, d} with pa = 1/2; pb = 1/4; pc = 1/8; pd = 1/8
entropy(X) = E(I)= -1/2 log2 (1/2) -1/4 log2 (1/4) -1/8 log2 (1/8) -1/8 log2 (1/8) = 7/4 = 1.75 bits X = a? si no a X = b? no si b X = c? si no c a Average number of questions: 1.75

11 Let X with a binomial distribution
X = 0 with probability p X = 1 with probability (1-p) H(X) = -p log2 (p) -(1-p) log2 (1-p) p = 0 => 1 - p = 1 H(X) = 0 p = 1 => 1 - p = 0 H(X) = 0 p = 1/2 => 1 - p = 1/2 H(X) = 1 H(Xp) 1 / p

12 H is a weighted average for log(p(X) where the weighting depends on the probability of each x
H INCREASES WITH MESSAGE LENGTH

13 joint entropy of two random variables, X, Y is average information content for specifying both variables

14 The conditional entropy of a random variable Y given another random variable X, describes what amount of information is needed in average to communicate when the reader already knows X

15 P(A,B) = P(A|B)P(B) = P(B|A)P(A)
Chaining rule for probabilities P(A,B) = P(A|B)P(B) = P(B|A)P(A) P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..) Generalization of this rule to multiple events is the chain rule The Chain Rule is used in many places in Stat NLP such as Markov Model

16 Chaining rule for entropies

17 I(X,Y) is the mutual information between X and Y.
I(X,Y) measures the reduction of incertaincy of X when Y is known It measures too the amouny of information X owns about Y (or Y about X)

18 I = 0 only when X and Y are independent: H(X)=H(X)-H(X|X)=I(X,X)
H(X|Y)=H(X) H(X)=H(X)-H(X|X)=I(X,X) Entropy is the autoinformation (mutual information between X and X) For 2 dependent variables, I grows not only with the degree of their dependence but only with their entropy H(X) = I(X<X) This explain also how mutual information between 2 totally dependent variables is not constant but depends on their entropy

19

20 Pointwise Mutual Information
The PMI of a pair of outcomes x and y belonging to discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution versus the probability of their coincidence given only their individual distributions and assuming independence The mutual information of X and Y is the expected value of the Specific Mutual Information of all possible outcomes.

21 H: entropy of a language L
We ignore p(X) Let q(X) a LM How good is q(X) as an estimation of p(X) ?

22 Cross Entropy Measures the “surprise” of a model q when it describes events following a distribution p

23 Relative Entropy Relativa or Kullback-Leibler (KL) divergence
Measures the difference between two probabilistic distributions


Download ppt "Some basic concepts of Information Theory and Entropy"

Similar presentations


Ads by Google