Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014.
Naïve Bayes Classifier
Parameter Estimation using likelihood functions Tutorial #1
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Chapter 4 Discrete Random Variables and Probability Distributions
Final review LING572 Fei Xia Week 10: 03/13/08 1.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Visual Recognition Tutorial
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Machine Learning CMPT 726 Simon Fraser University
Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Semi-Supervised Learning
Crash Course on Machine Learning
Final review LING572 Fei Xia Week 10: 03/11/
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Bayesian Networks. Male brain wiring Female brain wiring.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26,
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Conditional Probability
Data Mining Lecture 11.
ELN – Natural Language Processing
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Recap: Naïve Bayes classifier
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
The Improved Iterative Scaling Algorithm: A gentle Introduction
Presentation transcript:

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1

MaxEnt in NLP The maximum entropy principle has a long history. The MaxEnt algorithm was introduced to the NLP field by Berger et. al. (1996). Used in many NLP tasks: Tagging, Parsing, PP attachment, … 2

Reference papers (Ratnaparkhi, 1997) (Berger et. al., 1996) (Ratnaparkhi, 1996) (Klein and Manning, 2003) People often choose different notations. 3

Notation InputOutputThe pair (Berger et. al., 1996) xy(x,y) (Ratnaparkhi, 1997) bax (Ratnaparkhi, 1996) ht(h,t) (Klein and Manning, 2003) dc(c,d) 4 We following the notation in (Berger et al., 1996)

Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study: POS tagging 5

The Overview 6

Joint vs. Conditional models Given training data {(x,y)}, we want to build a model to predict y for new x’s. For each model, we need to estimate the parameters µ. Joint (generative) models estimate P(x,y) by maximizing the likelihood: P(X,Y| µ ) –Ex: n-gram models, HMM, Naïve Bayes, PCFG –Choosing weights is trivial: just use relative frequencies. Conditional (discriminative) models estimate P(y|x) by maximizing the conditional likelihood: P(Y|X, µ ) –Ex: MaxEnt, SVM, etc. –Choosing weights is harder. 7

Naïve Bayes Model C f1f1 f2f2 fnfn … Assumption: each f m is conditionally independent from f n given C. 8

The conditional independence assumption f m and f n are conditionally independent given c: P(f m | c, f n ) = P(f m | c) Counter-examples in the text classification task: - P(“Iowa” | politics) != P(Iowa | politics, “caucus”) - P(“delegate”| politics) != P(“delegate” | politics, “primary”) Q: How to deal with correlated features? A: Many models, including MaxEnt, do not assume that features are conditionally independent. 9

Naïve Bayes highlights Choose c* = arg max c P(c)  k P(f k | c) Two types of model parameters: –Class prior: P(c) –Conditional probability: P(f k | c) The number of model parameters: |C|+|CV| 10

P(f | c) in NB 11 f1f1 f2f2 …fjfj c1c1 P(f 1 |c 1 )P(f 2 |c 1 )…P(f j | c 1 ) c2c2 P(f 1 |c 2 )……… …… cici P(f 1 |c i )……P(f j | c i ) Each cell is a weight for a particular (class, feat) pair.

Weights in NB and MaxEnt In NB –P(f | c) are probabilities (i.e., 2 [0,1]) –P(f | c) are multiplied at test time In MaxEnt –the weights are real numbers: they can be negative. –the weights are added at test time 12

The highlights in MaxEnt 13 Training: to estimate Testing: to calculate P(y|x) f j (x,y) is a feature function, which normally corresponds to a (feature, class) pair.

Main questions What is the maximum entropy principle? What is a feature function? Modeling: Why does P(y|x) have the form? Training: How do we estimate ¸ j ? 14

Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study 15

The maximal entropy principle 16

The maximum entropy principle Related to Occam’s razor and other similar justifications for scientific inquiry Make the minimum assumptions about unseen data. Also: Laplace’s Principle of Insufficient Reason: when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely. 17

Maximum Entropy Why maximum entropy? –Maximize entropy = Minimize commitment Model all that is known and assume nothing about what is unknown. –Model all that is known: satisfy a set of constraints that must hold –Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy 18

Ex1: Coin-flip example (Klein & Manning, 2003) Toss a coin: p(H)=p1, p(T)=p2. Constraint: p1 + p2 = 1 Question: what’s p(x)? That is, what is the value of p1? Answer: choose the p that maximizes H(p) p1 H p1= H

Coin-flip example (cont) p1 p2 H p1 + p2 = 1 p1+p2=1.0, p1=0.3 20

Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer: 21

An MT example (cont) Constraints: Intuitive answer: 22

An MT example (cont) Constraints: Intuitive answer: ?? 23

Ex3: POS tagging (Klein and Manning, 2003) 24

Ex3 (cont) 25

Ex4: Overlapping features (Klein and Manning, 2003) 26 p1 p2 p3 p4

Ex4 (cont) 27 p1 p2 2/3-p1 1/3-p2

Ex4 (cont) 28 p1 2/3-p1 2/3-p1 p1-1/3

Reading #3 (Q1): Let P(X=i) be the probability of getting an i when rolling a dice. What is the value of P(x) with the maximum entropy if the following is true? (a)P(X=1) + P(X=2) = ½ ¼, ¼, 1/8, 1/8, 1/8, 1/8 (b) P(X=1) + P(X=2) = 1/2 and P(X=6) = 1/3 ¼, ¼, 1/18, 1/18, 1/18, 1/3 29

(Q2) In the text classification task, |V| is the number of features, |C| is the number of classes. How many feature functions are there? |C| * |V| 30

The MaxEnt Principle summary Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). Q1: How to represent constraints? Q2: How to find such distributions? 31

Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study 32

Modeling 33

The Setting From the training data, collect (x, y) pairs: –x 2 X: the observed data –y 2 Y: thing to be predicted (e.g., a class in a classification problem) –Ex: In a text classification task x: a document y: the category of the document To estimate P(y|x) 34

The basic idea Goal: to estimate p(y|x) Choose p(x,y) with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”). 35

The outline for modeling Feature function: Calculating the expectation of a feature function The forms of P(x,y) and P(y|x) 36

Feature function 37

The definition A feature function is a binary-valued function on events: A feature function corresponds to a (feature, class) pair: (t, c) f j (x,y) =1 if and only if (t is present in x) and (y == c). Ex: 38

The weights in NB 39 f1f1 f2f2 …fkfk c1c1 c2c2 … cici

The weights in NB 40 f1f1 f2f2 …fjfj c1c1 P(f 1 |c 1 )P(f 2 |c 1 )…P(f j | c 1 ) c2c2 P(f 1 |c 2 )……… …… cici P(f 1 |c i )……P(f j | c i ) Each cell is a weight for a particular (class, feat) pair.

The matrix in MaxEnt 41 t1t1 t2t2 …tktk c1c1 f1 f1 f2f2 …fkfk c2c2 f k+1 f k+2 …f 2k …… cici f k*(i-1)+1 f k*i Each feature function f j corresponds to a (feat, class) pair.

The weights in MaxEnt 42 t1t1 t2t2 …tktk c1c1 ¸1 ¸1 ¸2¸2 … ¸k ¸k c2c2 ………… …… cici … ¸ ki Each feature function f j has a weight ¸ j.

Feature function summary A feature function in MaxEnt corresponds to a (feat, class) pair. The number of feature functions in MaxEnt is approximately |C| * |F|. A MaxEnt trainer is to learn the weights for the feature functions. 43

The outline for modeling Feature function: Calculating the expectation of a feature function The forms of P(x,y) and P(y|x) 44

The expectation Ex1: –Flip a coin if it is a head, you will get 100 dollars if it is a tail, you will lose 50 dollars –What is the expected return? P(X=H) * P(X=T) * (-50) Ex2: –If it is a x i, you will receive v i dollars? –What is the expected return? 45

Empirical expectation Denoted as : Ex1: Toss a coin four times and get H, T, H, and H. The average return: ( )/4 = 62.5 Empirical distribution: Empirical expectation: ¾ * ¼ * (-50) =

Model expectation Ex1: Toss a coin four times and get H, T, H, and H. A model: Model expectation: 1/2 * ½ * (-50) = 25 47

Some notations Training data: Empirical distribution: A model: Model expectation of Empirical expectation of The j th feature: 48

Empirical expectation 49

An example Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t4 x4 c3 t1 t3 50 t1t2t3t4 c11111 c21001 c31010 Raw counts

An example Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t4 x4 c3 t1 t3 51 t1t2t3t4 c11/4 c21/40/4 1/4 c31/40/41/40/4 Empirical expectation

Calculating empirical expectation Let N be the number of training instances for each instance x let y be the true class label of x for each feature t in x empirical_expect [t] [y] += 1/N 52

Model expectation 53

An example Suppose P(y|x i ) = 1/3 Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t4 x4 c3 t1 t3 54 t1t2t3t4 c11/3*31/32/3 c21/3*31/32/3 c31/3*31/32/3 “Raw” counts

Calculating model expectation Let N be the number of training instances for each instance x calculate P(y|x) for every y 2 Y for each feature t in x for each y 2 Y model_expect [t] [y] += 1/N * P(y|x) 55

The outline for modeling Feature function: Calculating the expectation of a feature function The forms of P(x,y) and P(y|x) 56

Constraints Model expectation = Empirical expectation Why impose such constraints? –The MaxEnt principle: Model what is known –To maximize the conditional likelihood: see Slides #24-28 in (Klein and Manning, 2003) 57

The conditional likelihood (**) Given the data (X,Y), the conditional likelihood is a function of the parameters ¸ 58

The effect of adding constraints Lower the entropy Raise the likelihood of data Bring the distribution further away from uniform Bring the distribution closer to the data 59

Restating the problem The task: find p* s.t. where Objective function: H(p) Constraints: 60

Questions Is P empty? Does p* exist? Is p* unique? What is the form of p*? How to find p*? 61

What is the form of p*? (Ratnaparkhi, 1997) Theorem: if then Furthermore, p* is unique. 62

Two equivalent forms 63

Modeling summary - it maximizes the conditional likelihood of the training data - it is a model in Q Goal: find p* in P, which maximizes H(p). It can be proved that, when p* exists - it is unique 64

Outline Overview The Maximum Entropy Principle Modeling** Decoding Training* Case study: POS tagging 65

Decoding 66

Decoding 67 Z is a constant w.r.t. y t1t1 t2t2 …tktk c1c1 ¸1 ¸1 ¸2¸2 … ¸k ¸k c2c2 ………… …… cici … ¸ ki

The procedure for calculating P(y | x) Z=0; for each y 2 Y sum = 0; for each feature t present in x sum += the weight for (t, y); result[y] = exp(sum); Z += result[y]; for each y 2 Y P(y|x) = result[y] / Z; 68

MaxEnt summary so far Idea: choose the p* that maximizes entropy while satisfying all the constraints. p* is also the model within a model family that maximizes the conditional likelihood of the training data. MaxEnt handles overlapping features well. In general, MaxEnt achieves good performances on many NLP tasks. Next: Training: many methods (e.g., GIS, IIS, L-BFGS). 69

Hw5 70

Q1: run Mallet MaxEnt learner The format of the model file: FEATURES FOR CLASS c t t …. FEATURES FOR CLASS c t t …. 71

Q2: write a MaxEnt decoder The formula for P(y|x): 72 ¸ 0 is the weight for the default feature.

Q3: Write functions to calculate expectation 73

Q4-Q6: The stoplight example 74 Which model? Bernoulli or multi-nominal? Why? What features? Q4: f1 and f2 Q5: f1, f2, and f3 Q6: f3 only