A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.

Slides:

Advertisements

Similar presentations

Support Vector Machine

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.

Chapter 4 Randomized Blocks, Latin Squares, and Related Designs

Linear Inequalities and Linear Programming Chapter 5 Dr.Hayk Melikyan/ Department of Mathematics and CS/ Linear Programming in two dimensions:

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

The General Linear Model. The Simple Linear Model Linear Regression.

Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 22 By Herbert I. Gross and Richard A. Medeiros next.

 Econometrics and Programming approaches › Historically these approaches have been at odds, but recent advances have started to close this gap  Advantages.

1.2 Row Reduction and Echelon Forms

Linear Equations in Linear Algebra

Visual Recognition Tutorial

Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Maximum likelihood (ML) and likelihood ratio (LR) test

Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.

Finite Mathematics & Its Applications, 10/e by Goldstein/Schneider/SiegelCopyright © 2010 Pearson Education, Inc. 1 of 99 Chapter 4 The Simplex Method.

Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.

Expectation Maximization Algorithm

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Slide 1 Statistics Workshop Tutorial 4 Probability Probability Distributions.

Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.

Chapter 4 The Simplex Method

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Radial Basis Function Networks

Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,

The Game of Algebra or The Other Side of Arithmetic The Game of Algebra or The Other Side of Arithmetic © 2007 Herbert I. Gross by Herbert I. Gross & Richard.

Managerial Economics Managerial Economics = economic theory + mathematical eco + statistical analysis.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

SVM by Sequential Minimal Optimization (SMO)

All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.

Statistical Decision Theory

The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,

Linear Programming: Basic Concepts

Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.

General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.

1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

EE/Econ 458 Duality J. McCalley.

Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Chapter 1 Introduction n Introduction: Problem Solving and Decision Making n Quantitative Analysis and Decision Making n Quantitative Analysis n Model.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.

5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Business Mathematics MTH-367 Lecture 14. Last Lecture Summary: Finished Sec and Sec.10.3 Alternative Optimal Solutions No Feasible Solution and.

Computacion Inteligente Least-Square Methods for System Identification.

EMGT 6412/MATH 6665 Mathematical Programming Spring 2016

12. Principles of Parameter Estimation

Recap: Conditional Exponential Model

12. Principles of Parameter Estimation

16. Mean Square Estimation

The Improved Iterative Scaling Algorithm: A gentle Introduction

Dr. Arslan Ornek MATHEMATICAL MODELS

Optimization under Uncertainty

Presentation transcript:

A brief maximum entropy tutorial

Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this model, we typically have at our disposal a sample of output from the process. From the sample, which constitutes an incomplete state of knowledge about the process, the modeling problem is to parlay this knowledge into a succinct, accurate representation of the process We can then use this representation to make predictions of the future behavior of the process

Motivating example Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word in. A model p of the expert’s decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in. Develop p – collect a large sample of instances of the expert’s decisions

Motivating example Our goal is to –Extract a set of facts about the decision-making process from the sample (the first task of modeling) –Construct a model of this process (the second task)

Motivating example One obvious clue we might glean from the sample is the list of allowed translations –in  {dans, en, à, au cours de, pendant} With this information in hand, we can impose our first constraint on our model p: This equation represents our first statistic of the process; we can now proceed to search for a suitable model which obeys this equation –There are infinite number of models p for which this identify holds

Motivating example One model which satisfies the above equation is p(dans)=1 ； in other words, the model always predicts dans. Another model which obeys this constraint predicts pendant with a probability of ½, and à with a probability of ½. But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions?

Motivating example Knowing only that the expert chose exclusively from among these five French phrases, the most intuitively appealing model is This model, which allocates the total probability evenly among the five possible phrases, is the most uniform model subject to our knowledge It is not, however, the most uniform overall; that model would grant an equal probability to every possible French phrase.

Motivating example We might hope to glean more clues about the expert’s decisions from our sample. Suppose we notice that the expert chose either dans or en 30% of the time Once again there are many probability distributions consistent with these two constraints. In the absence of any other knowledge, a reasonable choice for p is again the most uniform – that is, the distribution which allocates its probability as evenly as possible, subject to the constrains:

Motivating example Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the expert chose either dans or à. We can incorporate this information into our model as a third constraint: We can once again look for the most uniform p satisfying these constraints, but now the choice is not as obvious.

Motivating example As we have added complexity, we have encountered two problems: –First, what exactly is meant by “uniform,” and how can one measure the uniformity of a model? –Second, having determined a suitable answer to these questions, how does one find the most uniform model subject to a set of constraints like those we have described?

Motivating example The maximum entropy method answers both these questions. Intuitively, the principle is simple: –model all that is known and assume nothing about that which is unknown –In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. This is precisely the approach we took in selecting our model p at each step in the above example

Maxent Modeling Consider a random process which produces an output value y, a member of a finite set У. –y may be any word in the set {dans, en, à, au cours de, pendant} In generating y, the process may be influenced by some contextual information x, a mamber of a finite set X. –x could include the words in the English sentence surrounding in To construct a stochastic model that accurately represents the behavior of the random process –Given a context x, the process will output y.

Training data Collect a large number of samples (x 1, y 1 ), (x 2, y 2 ),…, (x N, y N ) –Each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in which the process produced Typically, a particular pair (x, y) will either not occur at all in the sample, or will occur at most a few times. –smoothing

Features and constraints The goal is to construct a statistical model of the process which generated the training sample The building blocks of this model will be a set of statistics of the training sample –The frequency that in translated to either dans or en was 3/10 –The frequency that in translated to either dans or au cours de was ½ –And so on Statistics of the training sample

Features and constraints Conditioning information x –E.g., in the training sample, if April is the word following in, then the translation of in is en with frequency 9/10 Indicator function Expected value of f

Features and constraints We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f –We call such function a feature function or feature for short

Features and constraints When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our model accord with it We do this by constraining the expected value that the model assigns to the corresponding feature function f The expected value of f with respect to the model p(y | x) is

Features and constraints We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require –We call the requirement (3) a constraint equation or simply a constraint Combining (1), (2) and (3) yields

Features and constraints To sum up so far, we now have –A means of representing statistical phenomena inherent in a sample of data (namely, ) –A means of requiring that our model of the process exhibit these phenomena (namely, ) Feature: –Is a binary-value function of (x, y) Constraint –Is an equation between the expected value of the feature function in the model and its expected value in the training data

The maxent principle Suppose that we are given n feature functions f i, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics That is, we would like p to lie in the subset C of P defined by

(a) P (b) P (d) P (c) P C1C1 C1C1 C1C1 C2C2 C2C2 Figure 1: If we impose no constraints, then all probability models are allowable Imposing one linear constraint C 1 restricts us to those p  P which lie on the region defined by C 1 A second linear constraint could determine p exactly, if the two constraints are satisfiable, where the intersection of C 1 and C 2 is non-empty. p  C 1  C 2 Alternatively, a second linear constraint could be inconsistent with the first (i,e, C 1  C 2 =  ); no p  P can satisfy them both

The maxent principle In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent Furthermore, the linear constraints in our applications will not even come close to determining p  P uniquely as they do in (c); instead, the set C = C 1  C 2  …  C n of allowable models will be infinite

The maxent principle Among the models p  C, the maximum entropy philosophy dictates that we select the distribution which is most uniform A mathematical measure of the uniformity of a conditional distribution p(y|x) is provided by the conditional entropy

The maxent principle The principle of maximum entropy –To select a model from a set C of allowed probability distributions, choose the model p ★  C with maximum entropy H(p):

Exponential form The maximum entropy principle presents us with a problem in constrained optimization: find the p ★  C which maximizes H(p) Find

Exponential form We refer to this as the primal problem; it is a succinct way of saying that we seek to maximize H(p) subject to the following constraints: –1. –2. This and the previous condition guarantee that p is a conditional probability distribution –3. In other words, p  C, and so satisfies the active constraints C

Exponential form To solve this optimization problem, introduce the Lagrangian

Exponential form

We have thus found the parametric form of p ★, and so we now take up the task of solving for the optimal values  ★,  ★. Recognizing that the second factor in this equation is the factor corresponding to the second of the constraints listed above, we can rewrit (10) as where Z(x), the normalizing factor, is given by

Exponential form We have found  ★ but not yet  ★. Towards this end we introduce some further notation. Define the dual function  (  ) as and the dual optimization problem as Since p ★ and  ★ are fixed, the righthand side of (14) has only the free variables  ={ 1, 2,…, n }.

Exponential form Final result –The maximum entropy model subject to the constraints C has the parametric form p ★ of (11), where Λ ★ can be determined by maximizing the dual function  (  )

Maximum likelihood

Outline (Maxent Modeling summary) We began by seeking the conditional distribution p(y|x) which had maximal entropy H(p) subject to a set of linear constraints (7) Following the traditional procedure in constrained optimization, we introduced the Lagrangian  ( p, ,  ), where ,  are a set of Lagrange multipliers for the constraints we imposed on p(y|x) To find the solution to the optimization problem, we appealed to the Kuhn-Tucker theorem, which states that we can (1) first solve  ( p, ,  ) for p to get a parametric form for p ★ in terms of ,  ; (2) then plug p ★ back in to  ( p, ,  ), this time solving for  ★,  ★.

Outline (Maxent Modeling summary) The parametric form for p ★ turns out to have the exponential form (11) The  ★ gives rise to the normalizing factor Z(x), given in (12) The  ★ will be solved for numerically using the dual function (14). Furthermore, it so happens that this function,  (  ), is the log-likelihood for the exponential model p (11). So what started as the maximization of entropy subject to a set of linear constraints turns out to be equivalent to the unconstrained maximization of likelihood of a certain parametric family of distributions.

Outline (Maxent Modeling summary) Table 1 summarize the primal-dual framework PrimalDual problem description type of search search domain solution argmax p  C H(p) maximum entropy constrained optimization p  C p ★ argmax  (  ) maximum likelihood unconstrained optimization real-value vectors { 1 2,…}  ★ Kuhn-Tucker theorem: p ★ = p  ★

Computing the parameters