1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.

Slides:

Advertisements

Similar presentations

Information Theory EE322 Al-Sanie.

Advertisements

Chain Rules for Entropy

Protein- Cytokine network reconstruction using information theory-based analysis Farzaneh Farhangmehr UCSD Presentation#3 July 25, 2011.

Chapter 6 Information Theory

Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”,

Fundamental limits in Information Theory Chapter 10 :

Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.

Distributed Source Coding 教師 : 楊士萱老師學生 : 李桐照. Talk OutLine Introduction of DSCIntroduction of DSC Introduction of SWCQIntroduction of SWCQ ConclusionConclusion.

June 1, 2004Computer Security: Art and Science © Matt Bishop Slide #32-1 Chapter 32: Entropy and Uncertainty Conditional, joint probability Entropy.

X= {x 0, x 1,….,x J-1 } Y= {y 0, y 1, ….,y K-1 } Channel Finite set of input (X= {x 0, x 1,….,x J-1 }), and output (Y= {y 0, y 1,….,y K-1 }) alphabet.

Noise, Information Theory, and Entropy

If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.

Basic Concepts in Information Theory

Introduction to information theory

Some basic concepts of Information Theory and Entropy

§1 Entropy and mutual information

STATISTIC & INFORMATION THEORY (CSNB134)

2. Mathematical Foundations

INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.

: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.

8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.

§4 Continuous source and Gaussian channel

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.

Chapter 11: The Data Survey Supplemental Material Jussi Ahola Laboratory of Computer and Information Science.

Channel Capacity.

9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.

Problem Introduction Chow’s Problem Solution Example Proof of correctness.

JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab

§2 Discrete memoryless channels and their capacity function

Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.

1 Information Theory Nathanael Paul Oct. 09, 2002.

Information Theory Basics What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel.

Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation.

Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.

Source Coding Efficient Data Representation A.J. Han Vinck.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.

1 Lecture 7 System Models Attributes of a man-made system. Concerns in the design of a distributed system Communication channels Entropy and mutual information.

Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.

7 - 1 Chapter 7 Mathematical Foundations Notions of Probability Theory Probability theory deals with predicting how likely it is that something.

Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.

Central Limit Theorem Let X 1, X 2, …, X n be n independent, identically distributed random variables with mean  and standard deviation . For large n:

Lecture 3 Appendix 1 Computation of the conditional entropy.

Mutual Information, Joint Entropy & Conditional Entropy

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab

Computational Linguistics Seminar LING-696G Week 5.

Mutual Information and Channel Capacity Multimedia Security.

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.

ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.

UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.

Learning Machine Learning Further Introduction from Eleftherios inspired mainly from Ghahramani’s and Bishop’s lectures.

(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.

Statistical methods in NLP Course 2 Diana Trandab ă ț

Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.

Statistical methods in NLP Course 2

Introduction to Information theory

Information Theory Michael J. Watts

Mathematical Foundations

COT 5611 Operating Systems Design Principles Spring 2012

COT 5611 Operating Systems Design Principles Spring 2014

Statistical NLP: Lecture 4

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics

Information Theoretical Analysis of Digital Watermarking

Presentation transcript:

1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory

2 Entropy 4 The entropy is the average uncertainty of a single random variable.  Let p(x)=P(X=x); where x  X.  H(p)= H(X)= -  x  X p(x)log 2 p(x) 4 In other words, entropy measures the amount of information in a random variable. It is normally measured in bits.

3 Joint Entropy and Conditional Entropy 4 The joint entropy of a pair of discrete random variables X, Y ~ p(x,y) is the amount of information needed on average to specify both their values.  H(X,Y) = -  x  X  y  Y p(x,y)log 2 p(X,Y) 4 The conditional entropy of a discrete random variable Y given another X, for X, Y ~ p(x,y), expresses how much extra information you still need to supply on average to communicate Y given that the other party knows X.  H(Y|X) = -  x  X  y  Y p(x,y)log 2 p(y|x) 4 Chain Rule for Entropy: H(X,Y)=H(X)+H(Y|X)

4 Mutual Information 4 By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) 4 Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X) 4 This difference is called the mutual information between X and Y. 4 It is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another.

5 The Noisy Channel Model 4 Assuming that you want to communicate messages over a channel of restricted capacity, optimize (in terms of throughput and accuracy) the communication in the presence of noise in the channel. 4 A channel’s capacity can be reached by designing an input code that maximizes the mutual information between the input and output over all possible input distributions. 4 This model can be applied to NLP.

6 Relative Entropy or Kullback- Leibler Divergence 4 For 2 pmfs, p(x) and q(x), their relative entropy is:  D(p||q)=  x  X p(x)log(p(x)/q(x)) 4 The relative entropy (also known as the Kullback- Leibler divergence) is a measure of how different two probability distributions (over the same event space) are. 4 The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

7 The Relation to Language: Cross-Entropy 4 Entropy can be thought of as a matter of how surprised we will be to see the next word given previous words we already saw. 4 The cross entropy between a random variable X with true probability distribution p(x) and another pmf q (normally a model of p) is given by: H(X,q)=H(X)+D(p||q). 4 Cross-entropy can help us find out what our average surprise for the next word is.

8 The Entropy of English 4 We can model English using n-gram models (also known a Markov chains). 4 These models assume limited memory, i.e., we assume that the next word depends only on the previous k ones [kth order Markov approximation]. 4 What is the Entropy of English?

9 Perplexity 4 A measure related to the notion of cross- entropy and used in the speech recognition community is called the perplexity. 4 Perplexity(x 1n, m) =2 H(x 1n,m) =m(x 1n ) -1/n 4 A perplexity of k means that you are as surprised on average as you would have been if you had had to guess between k equiprobable choices at each step.