1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.

Slides:



Advertisements
Similar presentations
Information Theory EE322 Al-Sanie.
Advertisements

Chain Rules for Entropy
Protein- Cytokine network reconstruction using information theory-based analysis Farzaneh Farhangmehr UCSD Presentation#3 July 25, 2011.
Chapter 6 Information Theory
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”,
Fundamental limits in Information Theory Chapter 10 :
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
Distributed Source Coding 教師 : 楊士萱 老師 學生 : 李桐照. Talk OutLine Introduction of DSCIntroduction of DSC Introduction of SWCQIntroduction of SWCQ ConclusionConclusion.
June 1, 2004Computer Security: Art and Science © Matt Bishop Slide #32-1 Chapter 32: Entropy and Uncertainty Conditional, joint probability Entropy.
X= {x 0, x 1,….,x J-1 } Y= {y 0, y 1, ….,y K-1 } Channel Finite set of input (X= {x 0, x 1,….,x J-1 }), and output (Y= {y 0, y 1,….,y K-1 }) alphabet.
Noise, Information Theory, and Entropy
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
Introduction to information theory
Some basic concepts of Information Theory and Entropy
§1 Entropy and mutual information
STATISTIC & INFORMATION THEORY (CSNB134)
2. Mathematical Foundations
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
§4 Continuous source and Gaussian channel
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
Chapter 11: The Data Survey Supplemental Material Jussi Ahola Laboratory of Computer and Information Science.
Channel Capacity.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
Problem Introduction Chow’s Problem Solution Example Proof of correctness.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
§2 Discrete memoryless channels and their capacity function
Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.
1 Information Theory Nathanael Paul Oct. 09, 2002.
Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.
Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Source Coding Efficient Data Representation A.J. Han Vinck.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
1 Lecture 7 System Models Attributes of a man-made system. Concerns in the design of a distributed system Communication channels Entropy and mutual information.
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
7 - 1 Chapter 7 Mathematical Foundations Notions of Probability Theory Probability theory deals with predicting how likely it is that something.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Central Limit Theorem Let X 1, X 2, …, X n be n independent, identically distributed random variables with mean  and standard deviation . For large n:
Lecture 3 Appendix 1 Computation of the conditional entropy.
Mutual Information, Joint Entropy & Conditional Entropy
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Computational Linguistics Seminar LING-696G Week 5.
Mutual Information and Channel Capacity Multimedia Security.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.
UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.
Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Statistical methods in NLP Course 2
Introduction to Information theory
Information Theory Michael J. Watts
Mathematical Foundations
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Statistical NLP: Lecture 4
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Information Theoretical Analysis of Digital Watermarking
Presentation transcript:

1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory

2 Entropy 4 The entropy is the average uncertainty of a single random variable.  Let p(x)=P(X=x); where x  X.  H(p)= H(X)= -  x  X p(x)log 2 p(x) 4 In other words, entropy measures the amount of information in a random variable. It is normally measured in bits.

3 Joint Entropy and Conditional Entropy 4 The joint entropy of a pair of discrete random variables X, Y ~ p(x,y) is the amount of information needed on average to specify both their values.  H(X,Y) = -  x  X  y  Y p(x,y)log 2 p(X,Y) 4 The conditional entropy of a discrete random variable Y given another X, for X, Y ~ p(x,y), expresses how much extra information you still need to supply on average to communicate Y given that the other party knows X.  H(Y|X) = -  x  X  y  Y p(x,y)log 2 p(y|x) 4 Chain Rule for Entropy: H(X,Y)=H(X)+H(Y|X)

4 Mutual Information 4 By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) 4 Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X) 4 This difference is called the mutual information between X and Y. 4 It is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another.

5 The Noisy Channel Model 4 Assuming that you want to communicate messages over a channel of restricted capacity, optimize (in terms of throughput and accuracy) the communication in the presence of noise in the channel. 4 A channel’s capacity can be reached by designing an input code that maximizes the mutual information between the input and output over all possible input distributions. 4 This model can be applied to NLP.

6 Relative Entropy or Kullback- Leibler Divergence 4 For 2 pmfs, p(x) and q(x), their relative entropy is:  D(p||q)=  x  X p(x)log(p(x)/q(x)) 4 The relative entropy (also known as the Kullback- Leibler divergence) is a measure of how different two probability distributions (over the same event space) are. 4 The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

7 The Relation to Language: Cross-Entropy 4 Entropy can be thought of as a matter of how surprised we will be to see the next word given previous words we already saw. 4 The cross entropy between a random variable X with true probability distribution p(x) and another pmf q (normally a model of p) is given by: H(X,q)=H(X)+D(p||q). 4 Cross-entropy can help us find out what our average surprise for the next word is.

8 The Entropy of English 4 We can model English using n-gram models (also known a Markov chains). 4 These models assume limited memory, i.e., we assume that the next word depends only on the previous k ones [kth order Markov approximation]. 4 What is the Entropy of English?

9 Perplexity 4 A measure related to the notion of cross- entropy and used in the speech recognition community is called the perplexity. 4 Perplexity(x 1n, m) =2 H(x 1n,m) =m(x 1n ) -1/n 4 A perplexity of k means that you are as surprised on average as you would have been if you had had to guess between k equiprobable choices at each step.