A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Slides:



Advertisements
Similar presentations
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Advertisements

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.
The General Linear Model. The Simple Linear Model Linear Regression.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Managerial Economics Managerial Economics = economic theory + mathematical eco + statistical analysis.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
SVM by Sequential Minimal Optimization (SMO)
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Traffic Modeling.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Random Sampling, Point Estimation and Maximum Likelihood.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
EE/Econ 458 Duality J. McCalley.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
HMM - Part 2 The EM algorithm Continuous density HMM.
Basic Principles (continuation) 1. A Quantitative Measure of Information As we already have realized, when a statistical experiment has n eqiuprobable.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Autoregressive (AR) Spectral Estimation
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Computacion Inteligente Least-Square Methods for System Identification.
Approximation Algorithms based on linear programming.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Recap: Conditional Exponential Model
Recap: Naïve Bayes classifier
Information Theoretical Analysis of Digital Watermarking
EM Algorithm 主講人:虞台文.
The Improved Iterative Scaling Algorithm: A gentle Introduction
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Optimization under Uncertainty
Presentation transcript:

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Outline  Overview Motivating example  Maxent modeling Training data Features and constraints The maxent principle Exponential form Maximum likelihood  Skipped sections and further reading

Overview  Statistical modeling Models the behavior of a random process Utilizes samples of output data to construct a representation of the process Predicts the future behavior of the process  Maximum Entropy Models A family of distributions within the class of exponential models for statistical modeling

Motivating example (1/4) English-to-French Translator in au cours de pendant en à dans  An English-to-French translator translates the English word in into 5 French phrases  Goal 1.Extract a set of facts about the decision-making process 2.Construct a model of this process

Motivating example (2/4)  The translator always chooses among those 5 French words  The most intuitively appealing model (most uniform model subject to our knowledge) is:

Motivating example (3/4)  A reasonable choice for P would be (the most uniform one):  If the second clue is discovered: translator chose either dans or en 30% of the time, P must satisfy 2 constraints:

Motivating example (4/4)  What if the third constraint is discovered:  The choice for the model is not as obvious  Two problems arise when complexity is added: The meaning of “ uniform ” and how to measure the uniformity of a model How to find the most uniform model subject to a set of constraints  One solution: Maximum Entropy (maxent) Model

Maxent Modeling  Consider a random process which produces an output value y, a member of a finite set Y  In generating y, the process may be influenced by some contextual information x, a member of a finite set X.  The task is to construct a stochastic model that accurately represents the behavior of the random process This model estimates the conditional probability that, given a context x, the process will output y.  We denote by P the set of all conditional probability distributions. A model is an element of P

Training data  Training sample:  Training sample ’ s empirical probability distribution

Features and constraints (1/4)  Use a set of statistics of the training sample to construct a statistical model of the process  Statistics that is independent of the context:  Statistics that depends on the conditioning information x, e.g. in training sample, if April is the word following in, then the translation of in is en with frequency 9/10.

Features and constraints (2/4)  To express the event that in translates as en when April is the following word, we can introduce the indicator function:  The expected value of f with respect to the empirical distribution is exactly the statistic we are interested in. This expected value is given by:  We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short.

Features and constraints (3/4)  The expected value of f with respect to the model is: where is the empirical distribution of x in the training sample  We constrain this expected value to be the same as the expected value of f in the training sample: a constraint equation or simply a constraint

 By restricting attention to those models for which the constraint holds, we are eliminating from considering those models which do not agree with the training sample on how often the output of the process should exhibit the feature f.  What we have so far: A means of representing statistical phenomena inherent in a sample of data, namely A means of requiring that our model of the process exhibit these phenomena, namely Features and constraints (4/4)  Combining the above 3 equations yields:

The maxent principle (1/2)  Suppose n feature functions f i are given  We would like our model to accord with these statistics, i.e. we would like p to lie in the subset C of P defined by  Among the models, we would like to select the distribution which is most uniform. But what does “ uniform ” mean?  A mathematical measure of the uniformity of a conditional distribution is provided by the conditional entropy:

The maxent principle (2/2)  The entropy is bounded from below by zero The entropy of a model with no uncertainty at all  The entropy is bounded from above by The entropy of the uniform distribution over all possible values of y  The principle of maximum entropy: To select a model from a set of allowed probability distributions, choose the model with maximum entropy is always well-defined; that is, there is always a unique model with maximum entropy in any constrained set

Exponential form (1/3)  The method of Lagrange multipliers is applied to impose the constraint on the optimization  The constrained optimization problem is to find  Maximize subject to the following constraints: Guarantee that p is a conditional probability distribution In other words,

Exponential form (2/3)  When the Lagrange multiplier is introduced, the objective function becomes:  The real-valued parameters and correspond to the 1+n constraints imposed on the solution  Solve by using EM algorithm See

The maximum entropy model subject to the constraints has the parametric form of the equation below, where can be determined by maximizing the dual function Exponential form (3/3)  The final result:

Maximum likelihood  The log-likelihood of the empirical distribution as predicted by a model is defined by:  The dual function of the previous section is just the log- likelihood for the exponential model ; that is:  The result from the previous section can be rephrased as: The model with maximum entropy is the model in the parametric family that maximizes the likelihood of the training sample

Skipped sections  Computing the parameters orial/node10.html#SECTION orial/node10.html#SECTION  Algorithms for inductive learning orial/node11.html#SECTION orial/node11.html#SECTION  Further readings orial/node14.html#SECTION orial/node14.html#SECTION