Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.

Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.

Expectation Maximization Algorithm

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Maximum Likelihood (ML), Expectation Maximization (EM)

Expectation-Maximization

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Semi-Supervised Learning

Final review LING572 Fei Xia Week 10: 03/11/

A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Graphical models for part of speech tagging

Albert Gatt Corpora and Statistical Methods Lecture 10.

MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

MAXIMUM ENTROPY, SUPPORT VECTOR MACHINES, CONDITIONAL RANDOM FIELDS, NEURAL NETWORKS Heng Ji 04/12, 04/15, 2016.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Hidden Markov Models BMI/CS 576

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

ELN – Natural Language Processing

Bayesian Models in Machine Learning

Introduction to EM algorithm

Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.

The Improved Iterative Scaling Algorithm: A gentle Introduction

Presentation transcript:

Maximum Entropy Model LING 572 Fei Xia 02/08/07

Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging –Self-training

Topics in LING 572 Slightly more complicated –Boosting –Co-training Hard (to some people): –MaxEnt –EM

History The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. Introduced to NLP area by Berger et. al. (1996). Used in many NLP tasks: Tagging, Parsing, PP attachment, LM, …

Outline Main idea Modeling Training: estimating parameters Feature selection during training Case study

Main idea

Maximum Entropy Why maximum entropy? –Maximize entropy = Minimize commitment Model all that is known and assume nothing about what is unknown. –Model all that is known: satisfy a set of constraints that must hold –Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

Ex1: Coin-flip example (Klein & Manning 2003) Toss a coin: p(H)=p1, p(T)=p2. Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p) p1 H p1=0.3

Coin-flip example (cont) p1 p2 H p1 + p2 = 1 p1+p2=1.0, p1=0.3

Ex2: An MT example (Berger et. al., 1996) Possible translation for the word “in” is: Constraint: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer: ??

Ex3: POS tagging (Klein and Manning, 2003)

Ex3 (cont)

Ex4: overlapping features (Klein and Manning, 2003)

Modeling the problem Objective function: H(p) Goal: Among all the distributions that satisfy the constraints, choose the one, p*, that maximizes H(p). Question: How to represent constraints?

Modeling

Reference papers (Ratnaparkhi, 1997) (Ratnaparkhi, 1996) (Berger et. al., 1996) (Klein and Manning, 2003)  Different notations.

The basic idea Goal: estimate p Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

Setting From training data, collect (a, b) pairs: –a: thing to be predicted (e.g., a class in a classification problem) –b: the context –Ex: POS tagging: a=NN b=the words in a window and previous two tags Learn the prob of each (a, b): p(a, b)

Features in POS tagging (Ratnaparkhi, 1996) context (a.k.a. history)allowable classes

Features Feature (a.k.a. feature function, Indicator function) is a binary- valued function on events: A: the set of possible classes (e.g., tags in POS tagging) B: space of contexts (e.g., neighboring words/ tags in POS tagging) Ex:

Some notations Finite training sample of events: Observed probability of x in S: The model p’s probability of x: Model expectation of : Observed expectation of : The j th feature: (empirical count of )

Constraints Model’s feature expectation = observed feature expectation How to calculate ?

Training data  observed events

Restating the problem The task: find p* s.t. where Objective function: H(p) Constraints: Add a feature

Questions Is P empty? Does p* exist? Is p* unique? What is the form of p*? How to find p*?

What is the form of p*? (Ratnaparkhi, 1997) Theorem: if then Furthermore, p* is unique.

Using Lagrange multipliers Minimize A(p):

Two equivalent forms

Relation to Maximum Likelihood The log-likelihood of the empirical distribution as predicted by a model q is defined as Theorem: if then Furthermore, p* is unique.

Summary (so far) The model p* in P with maximum entropy is the model in Q that maximizes the likelihood of the training sample Goal: find p* in P, which maximizes H(p). It can be proved that, when p* exists, it is unique.

Summary (cont) Adding constraints (features): (Klein and Manning, 2003) –Lower maximum entropy –Raise maximum likelihood of data –Bring the distribution further away from uniform –Bring the distribution closer to data

Training

Algorithms Generalized Iterative Scaling (GIS): (Darroch and Ratcliff, 1972) Improved Iterative Scaling (IIS): (Della Pietra et al., 1995)

GIS: setup Requirements for running GIS: Obey form of model and constraints: An additional constraint: Let Add a new feature f k+1 :

GIS algorithm Compute d j, j=1, …, k+1 Initialize (any values, e.g., 0) Repeat until converge –For each j Compute Update

Approximation for calculating feature expectation

Properties of GIS L(p (n+1) ) >= L(p (n) ) The sequence is guaranteed to converge to p*. The converge can be very slow. The running time of each iteration is O(NPA): –N: the training set size –P: the number of classes –A: the average number of features that are active for a given event (a, b).

Feature selection

Throw in many features and let the machine select the weights –Manually specify feature templates Problem: too many features An alternative: greedy algorithm –Start with an empty set S –Add a feature at each iteration

Notation The gain in the log-likelihood of the training data: After adding a feature: With the feature set S:

Feature selection algorithm (Berger et al., 1996) Start with S being empty; thus p s is uniform. Repeat until the gain is small enough –For each candidate feature f Computer the model using IIS Calculate the log-likelihood gain –Choose the feature with maximal gain, and add it to S  Problem: too expensive

Approximating gains (Berger et. al., 1996) Instead of recalculating all the weights, calculate only the weight of the new feature.

Training a MaxEnt Model Scenario #1: no feature selection during training Define features templates Create the feature set Determine the optimum feature weights via GIS or IIS Scenario #2: with feature selection during training Define feature templates Create candidate feature set S At every iteration, choose the feature from S (with max gain) and determine its weight (or choose top-n features and their weights).

Case study

POS tagging (Ratnaparkhi, 1996) Notation variation: –f j (a, b): a: class, b: context –f j (h i, t i ): h: history for i th word, t: tag for i th word History: Training data: –Treat it as a list of (h i, t i ) pairs. –How many pairs are there?

Using a MaxEnt Model Modeling: Training: –Define features templates –Create the feature set –Determine the optimum feature weights via GIS or IIS Decoding:

Modeling

Training step 1: define feature templates History h i Tag t i

Step 2: Create feature set  Collect all the features from the training data  Throw away features that appear less than 10 times

Step 3: determine the feature weights GIS Training time: –Each iteration: O(NTA): N: the training set size T: the number of allowable tags A: average number of features that are active for a (h, t). – About 24 hours on an IBM RS/6000 Model 380. How many features?

Decoding: Beam search Generate tags for w 1, find top N, set s 1j accordingly, j=1, 2, …, N For i=2 to n (n is the sentence length) –For j=1 to N Generate tags for wi, given s (i-1)j as previous tag context Append each tag to s (i-1)j to make a new sequence. –Find N highest prob sequences generated above, and set s ij accordingly, j=1, …, N Return highest prob sequence s n1.

Beam search

Viterbi search

Decoding (cont) Tags for words: –Known words: use tag dictionary –Unknown words: try all possible tags Ex: “time flies like an arrow” Running time: O(NTAB) –N: sentence length –B: beam size –T: tagset size –A: average number of features that are active for a given event

Experiment results

Comparison with other learners HMM: MaxEnt uses more context SDT: MaxEnt does not split data TBL: MaxEnt is statistical and it provides probability distributions.

MaxEnt Summary Concept: choose the p* that maximizes entropy while satisfying all the constraints. Max likelihood: p* is also the model within a model family that maximizes the log-likelihood of the training data. Training: GIS or IIS, which can be slow. MaxEnt handles overlapping features well. In general, MaxEnt achieves good performances on many NLP tasks.

Additional slides

Ex4 (cont) ??

IIS algorithm Compute d j, j=1, …, k+1 and Initialize (any values, e.g., 0) Repeat until converge –For each j Let be the solution to Update

Calculating If Then GIS is the same as IIS Else must be calcuated numerically.