Some basic concepts of Information Theory and Entropy

Slides:



Advertisements
Similar presentations
Information theory Multi-user information theory A.J. Han Vinck Essen, 2004.
Advertisements

Information Theory EE322 Al-Sanie.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Chain Rules for Entropy
Chapter 6 Information Theory
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Probability Notation Review Prior (unconditional) probability is before evidence is obtained, after is posterior or conditional probability P(A) – Prior.
A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”,
Information Theory Eighteenth Meeting. A Communication Model Messages are produced by a source transmitted over a channel to the destination. encoded.
Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.
Lossless data compression Lecture 1. Data Compression Lossless data compression: Store/Transmit big files using few bytes so that the original files.
1 Chapter 5 A Measure of Information. 2 Outline 5.1 Axioms for the uncertainty measure 5.2 Two Interpretations of the uncertainty function 5.3 Properties.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Distributed Source Coding 教師 : 楊士萱 老師 學生 : 李桐照. Talk OutLine Introduction of DSCIntroduction of DSC Introduction of SWCQIntroduction of SWCQ ConclusionConclusion.
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
CSI Uncertainty in A.I. Lecture 201 Basic Information Theory Review Measuring the uncertainty of an event Measuring the uncertainty in a probability.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Noise, Information Theory, and Entropy
Probability and Statistics Review Thursday Sep 11.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Noise, Information Theory, and Entropy
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
Introduction to information theory
Entropy and some applications in image processing Neucimar J. Leite Institute of Computing
§1 Entropy and mutual information
STATISTIC & INFORMATION THEORY (CSNB134)
2. Mathematical Foundations
Information Theory & Coding…
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
§4 Continuous source and Gaussian channel
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
Basic Concepts of Encoding Codes, their efficiency and redundancy 1.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
Coding Theory Efficient and Reliable Transfer of Information
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
1 Lecture 7 System Models Attributes of a man-made system. Concerns in the design of a distributed system Communication channels Entropy and mutual information.
Presented by Minkoo Seo March, 2006
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Mutual Information, Joint Entropy & Conditional Entropy
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Statistical methods in NLP Course 2
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
Introduction to Information theory
Corpora and Statistical Methods
Hiroki Sayama NECSI Summer School 2008 Week 3: Methods for the Study of Complex Systems Information Theory p I(p)
Information Theory Michael J. Watts
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Mathematical Foundations
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
CPSC 503 Computational Linguistics
Presentation transcript:

Some basic concepts of Information Theory and Entropy Information theory, IT Entropy Mutual Information Use in NLP Introduction: marc general i la motivació d’aquest treball

Entropy Related to the coding theory- more efficient code: fewer bits for more frequent messages at the cost of more bits for the less frequent

EXAMPLE: You have to send messages about the two occupants in a house every five minutes Equal probability: 0 no occupants 1 first occupant 2 second occupant 3 Both occupants Different probability Situation Probability Code no occupants .5 0 first occupant .125 110 second occupant .125 111 Both occupants .25 10

Let X a random variable taking values x1, x2, Let X a random variable taking values x1, x2, ..., xn from a domain de according to a probability distribution We can define the expected value of X, E(x) as the summatory of the possible values weighted with their probability E(X) = p(x1)X(x1) + p(x2)X(x2) + ... p(xn)X(xn)

Entropy A message can thought of as a random variable W that can take one of several values V(W) and a probability distribution P. Is there a lower bound on the number of bits neede tod encode a message? Yes, the entropy It is possible to get close to the minimum (lower bound) It is also a measure of our uncertainty about wht the message says (lot of bits- uncertain, few - certain)

Given an event we want to associate its information content (I) From Shannon in the 1940s Two constraints: Significance: The less probable is an event the more information it contains P(x1) > P(x2) => I(x2) > I(x1) Additivity: If two events are independent I(x1x2) = I(x1) + I(x2) Shannon was interested in the problem of Shannon wanted to determine theoretical maxima for !)

I(m) = 1/p(m) does not satisfy the second requirement I(x) = - log p(x) satisfies both So we define I(X) = - log p(X) Shannon was interested in the problem of Shannon wanted to determine theoretical maxima for !)

Entropy: is the expected value of I: E(I) Let X a random variable, described by p(X), owning an information content I Entropy: is the expected value of I: E(I) Entropy measures information content of a random variable. We can consider it as the average length of the message needed to transmite a value of this variable using an optimal coding. Entropy measures the degree of desorder (uncertainty) of the random variable. using the optimal code = the entropy will be the minimum

Uniform distribution of a variable X. Each possible value xi  X with |X| = M has the same probability pi = 1/M If the value xi is codified in binary we need log2 M bits of information Non uniform distribution. by analogy Each value xi has a different probability pi Let assume pi to be independent If Mpi = 1/ pi we will need log2 Mpi = log2 (1/ pi ) = - log2 pi bits of information

Let X ={a, b, c, d} with pa = 1/2; pb = 1/4; pc = 1/8; pd = 1/8 entropy(X) = E(I)= -1/2 log2 (1/2) -1/4 log2 (1/4) -1/8 log2 (1/8) -1/8 log2 (1/8) = 7/4 = 1.75 bits X = a? si no a X = b? no si b X = c? si no c a Average number of questions: 1.75

Let X with a binomial distribution X = 0 with probability p X = 1 with probability (1-p) H(X) = -p log2 (p) -(1-p) log2 (1-p) p = 0 => 1 - p = 1 H(X) = 0 p = 1 => 1 - p = 0 H(X) = 0 p = 1/2 => 1 - p = 1/2 H(X) = 1 H(Xp) 1 0 1/2 1 p

H is a weighted average for log(p(X) where the weighting depends on the probability of each x H INCREASES WITH MESSAGE LENGTH

joint entropy of two random variables, X, Y is average information content for specifying both variables

The conditional entropy of a random variable Y given another random variable X, describes what amount of information is needed in average to communicate when the reader already knows X

P(A,B) = P(A|B)P(B) = P(B|A)P(A) Chaining rule for probabilities P(A,B) = P(A|B)P(B) = P(B|A)P(A) P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..) Generalization of this rule to multiple events is the chain rule The Chain Rule is used in many places in Stat NLP such as Markov Model

Chaining rule for entropies

I(X,Y) is the mutual information between X and Y. I(X,Y) measures the reduction of incertaincy of X when Y is known It measures too the amouny of information X owns about Y (or Y about X)

I = 0 only when X and Y are independent: H(X)=H(X)-H(X|X)=I(X,X) H(X|Y)=H(X) H(X)=H(X)-H(X|X)=I(X,X) Entropy is the autoinformation (mutual information between X and X) For 2 dependent variables, I grows not only with the degree of their dependence but only with their entropy H(X) = I(X<X) This explain also how mutual information between 2 totally dependent variables is not constant but depends on their entropy

Pointwise Mutual Information The PMI of a pair of outcomes x and y belonging to discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution versus the probability of their coincidence given only their individual distributions and assuming independence The mutual information of X and Y is the expected value of the Specific Mutual Information of all possible outcomes.

H: entropy of a language L We ignore p(X) Let q(X) a LM How good is q(X) as an estimation of p(X) ?

Cross Entropy Measures the “surprise” of a model q when it describes events following a distribution p

Relative Entropy Relativa or Kullback-Leibler (KL) divergence Measures the difference between two probabilistic distributions