Chain Rules for Entropy

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Simple Regression Model
Chapter 2 Multivariate Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Random Variable A random variable X is a function that assign a real number, X(ζ), to each outcome ζ in the sample space of a random experiment. Domain.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Data Compression.
Independence of random variables
Entropy Rates of a Stochastic Process
Introduction to stochastic process
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Class notes for ISE 201 San Jose State University
SUMS OF RANDOM VARIABLES Changfei Chen. Sums of Random Variables Let be a sequence of random variables, and let be their sum:
Information Theory and Security. Lecture Motivation Up to this point we have seen: –Classical Crypto –Symmetric Crypto –Asymmetric Crypto These systems.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Visual Recognition Tutorial
Shannon ’ s theory part II Ref. Cryptography: theory and practice Douglas R. Stinson.
The moment generating function of random variable X is given by Moment generating function.
Continuous Random Variables and Probability Distributions
Information Theory and Security
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Standard error of estimate & Confidence interval.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Some basic concepts of Information Theory and Entropy
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Information and Coding Theory
§1 Entropy and mutual information
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Chapter 12 Review of Calculus and Probability
Example We can also evaluate a definite integral by interpretation of definite integral. Ex. Find by interpretation of definite integral. Sol. By the interpretation.
§4 Continuous source and Gaussian channel
1 As we have seen in section 4 conditional probability density functions are useful to update the information about an event based on the knowledge about.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Two Functions of Two Random.
Continuous Distributions The Uniform distribution from a to b.
Channel Capacity.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
§2 Discrete memoryless channels and their capacity function
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
1 Information Theory Nathanael Paul Oct. 09, 2002.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Confidence Intervals vs. Prediction Intervals Confidence Intervals – provide an interval estimate with a 100(1-  ) measure of reliability about the mean.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Chapter 3 DeGroot & Schervish. Functions of a Random Variable the distribution of some function of X suppose X is the rate at which customers are served.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Continuous Random Variables and Probability Distributions
Joint Moments and Joint Characteristic Functions.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Lecture 3 Appendix 1 Computation of the conditional entropy.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Chebyshev’s Inequality Markov’s Inequality Proposition 2.1.
Introduction to Real Analysis Dr. Weihu Hong Clayton State University 11/11/2008.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
Lecture 3 B Maysaa ELmahi.
Introduction to Information theory
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Tutorial 9: Further Topics on Random Variables 2
Presentation transcript:

Chain Rules for Entropy The entropy of a collection of random variables is the sum of conditional entropies. Theorem: Let X1, X2,…Xn be random variables having the mass probability p(x1,x2,….xn). Then The proof is obtained by repeating the application of the two-variable expansion rule for entropies.

Conditional Mutual Information We define the conditional mutual information of random variable X and Y given Z as: Mutual information also satisfy a chain rule:

Convex Function We recall the definition of convex function. A function is said to be convex over an interval (a,b) if for every x1, x2 (a.b) and 0 λ ≤ 1, A function f is said to be strictly convex if equality holds only if λ=0 or λ=1. Theorem: If the function f has a second derivative which is non-negative (positive) everywhere, then the function is convex (strictly convex).

Jensen’s Inequality If f is a convex function and X is a random variable, then Moreover, if f is strictly convex, then equality implies that X=EX with probability 1, i.e. X is a constant.

Information Inequality Theorem: Let p(x), q(x), x χ, be two probability mass function. Then With equality if and only if for all x. Corollary: (Non negativity of mutual information): For any two random variables, X, Y, With equality f and only if X and Y are independent

Bounded Entropy We show that the uniform distribution over the range χ is the maximum entropy distribution over this range. It follows that any random variable with this range has an entropy no greater than log|χ|. Theorem: H(X)≤ log|χ|, where|χ| denotes the number of elements in the range of X, with equality if and only if X has a uniform distribution over χ. Proof: Let u(x) = 1/|χ| be the uniform probability mass function over χ and let p(x) be the probability mass function for X. Then Hence by the non-negativity of the relative entropy,

Conditioning Reduces Entropy Theorem: with equality if and only if X and Y are independent. Proof: Intuitively, the theorem says that knowing another random variable Y can only reduce the uncertainty in X. Note that this is true only on the average. Specifically, H(X|Y=y) may be greater than or less than or equal to H(X), but on the average

Example Let (X,Y) have the following joint distribution Then H(X)=(1/8, 7/8)=0,544 bits, H(X|Y=1)=0 bits and H(X|Y=2)=1 bit. We calculate H(X|Y)=3/4 H(X|Y=1)+1/4 H(X|Y=2)=0.25 bits. Thus the uncertainty in X is increased if Y=2 is observed and decreased if Y=1 is observed, but uncertainty decreases on the average. X 1 2 Y 1 0 3/4 2 1/8 1/8

Independence Bound on Entropy Let X1, X2,…Xn are random variables with mass probability p(x1, x2,…xn ). Then: With equality if and only if the Xi are independent. Proof: By the chain rule of entropies: Where the inequality follows directly from the previous theorem. We have equality if and only if Xi is independent of X1, X2,…Xn for all i, i.e. if and only if the Xi’s are independent.

Fano’s Inequality Suppose that we know a random variable Y and we wish to guess the value of a correlated random variable X. Fano’s inequality relates the probability of error in guessing the random variable X to its conditional entropy H(X|Y). It will be crucial in proving the converse to Shannon’s channel capacity theorem. We know that the conditional entropy of a random variable X given another random variable Y is zero if and only if X is a function of Y. Hence we can estimate X from Y with zero probability of error if and only if H(X|Y) = 0. Extending this argument, we expect to be able to estimate X with a low probability of error only if the conditional entropy H(X|Y) is small. Fano’s inequality quantifies this idea. Suppose that we wish to estimate a random variable X with a distribution p(x). We observe a random variable Y that is related to X by the conditional distribution p(y|x).

Fano’s Inequality From Y, we calculate a function g(Y) = X^ , where X^ is an estimate of X and takes on values in X^. We will not restrict the alphabet X^ to be equal to X, and we will also allow the function g(Y) to be random. We wish to bound the probability that X^ ≠ X. We observe that X → Y → X^ forms a Markov chain. Define the probability of error: Pe = Pr{X^ = X}. Theorem: The inequality can be weakened to: Remark: Note that Pe = 0 implies that H(X|Y) = 0 as intuition suggests.