First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.

Slides:



Advertisements
Similar presentations
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Learning Clusterwise Similarity with First-Order Features Aron Culotta and Andrew McCallum University of Massachusetts - Amherst NIPS Workshop on Theoretical.
What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.
Discriminative Training of Markov Logic Networks
Random Forest Predrag Radenković 3237/10
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Search-Based Structured Prediction
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Chapter 4: Linear Models for Classification
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Easy-First Coreference Resolution Veselin Stoyanov and Jason Eisner Johns Hopkins University.
A Probabilistic Framework for Semi-Supervised Clustering
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Conditional Random Fields
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois.
A Global Relaxation Labeling Approach to Coreference Resolution Coling 2010 Emili Sapena, Llu´ıs Padr´o and Jordi Turmo TALP Research Center Universitat.
Graphical models for part of speech tagging
Least-Mean-Square Training of Cluster-Weighted-Modeling National Taiwan University Department of Computer Science and Information Engineering.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Randomized Algorithms for Bayesian Hierarchical Clustering
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
Markov Random Fields & Conditional Random Fields
NTU & MSRA Ming-Feng Tsai
Graph Algorithms for Vision Amy Gale November 5, 2002.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Probabilistic Equational Reasoning Arthur Kantor
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
SA-1 University of Washington Department of Computer Science & Engineering Robotics and State Estimation Lab Dieter Fox Stephen Friedman, Lin Liao, Benson.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Lecture 7: Constrained Conditional Models
Semi-Supervised Clustering
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Bayesian Models in Machine Learning
Model generalization Brief summary of methods
Presentation transcript:

First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

Probabilistic First-Order Logic for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts, Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

Previous work: Conditional Random Fields for Coreference

A Pairwise Conditional Random Field for Coreference... Mr Powell Powell she... y [McCallum & Wellner, 2003, ICML] (PW-CRF) y y x2x2 x3x3 x1x1 Coreferent(x 2,x 3 )?

A Pairwise Conditional Random Field for Co-reference... Mr Powell Powell she... y [McCallum & Wellner, 2003, ICML] (PW-CRF) y y x2x2 x3x3 x1x1

A Pairwise Conditional Random Field for Co-reference... Mr Powell Powell she  30 y [McCallum & Wellner, 2003, ICML] (PW-CRF) 11 y y Pairwise compatibility score learned from training data x2x2 x3x3 x1x1

A Pairwise Conditional Random Field for Co-reference... Mr Powell Powell she  30 y [McCallum & Wellner, 2003, ICML] (PW-CRF) 11 y y Pairwise compatibility score learned from training data Hard transitivity constraints enforced by prediction algorithm x2x2 x3x3 x1x1

... Mr Powell Powell she  Prediction in PW-CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] = 64 Often approximated with agglomerative clustering x2x2 x3x3 x1x1

Parameter Estimation in PW-CRFs Given labeled documents, generate all pairs of mentions –Optionally prune distant mention pairs [Soon, Ng, Lim 2001] Learn binary classifier to predict coreference Edge weights proportional to classifier output

Sometimes pairwise comparisons are insufficient Entities have multiple attributes (name, , institution, location); need to measure “compatibility” among them. Having 2 “given names” is common, but not 4. –e.g. Howard M. Dean / Martin, Dean / Howard Martin Need to measure size of the clusters of mentions.  a pair of name strings where edit distance differs > 0.5? Maximum distance between mentions in document A entity contains only pronoun mentions? We need measures on hypothesized “entities” We need First-order logic

First-Order Logic CRFs for Coreference

First-Order Logic CRFs for Co-reference... Mr Powell Powell she... (FOL-CRF) y  56 x2x2 x3x3 x1x1 Coreferent(x 1,x 2,x 3 )?

First-Order Logic CRFs for Co-reference... Mr Powell Powell she... (FOL-CRF) y Clusterwise compatibility score learned from training data Features are arbitrary FOL predicates over a set of mentions  56 x2x2 x3x3 x1x1 Coreferent(x 1,x 2,x 3 )?

First-Order Logic CRFs for Co-reference... Mr Powell Powell she... (FOL-CRF) y As in PW-CRF, prediction can be approximated with agglomerative clustering  56 Coreferent(x 1,x 2,x 3 )? x2x2 x3x3 x1x1

Learning Parameters of FOL-CRFs Generate classification examples where input is a set of mentions Unlike Pairwise CRF, cannot generate all possible examples in training data

HePowellRice She heSecretary Coreferent(x 1,x 2 ) … Coreferent(x 1,x 2,x 3 ) … Coreferent(x 1,x 2,x 3,x 4 ) Coreferent(x 1,x 2,x 3,x 4,x 5 ) Coreferent(x 1,x 2,x 3,x 4,x 5,x 6 ) … … … Combinatorial Explosion! Learning Parameters of FOL-CRFs

This space complexity is common in probabilistic first-order logic Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003 Richardson & Domingos 2006

Training in Probabilistic FOL Parameter estimation; weight learning Input –First-order formulae  x S(x)  T(x) –Labeled data a, b, c S(a), T(a), S(b), T(b), S(c) Output –Weights for each formula  x S(x)  T(x) [0.67]  xy Coreferent(x,y)  Pronoun(x)  xy Coreferent(x,y)  Pronoun(x) [-2.3]

Training in Probabilistic FOL Previous Work Maximum likelihood –Require intractable normalization constant Pseudo-likelihood [Richardson, Domingos 2006] –Ignores uncertainty of relational information E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997] Sampling [Paskin 2002] Perceptron [Singla, Domingos 2005] –Can be inefficient when prediction is expensive Piecewise training [Sutton, McCallum 2005] –Train “pieces” of world in isolation –Performance sensitive to which pieces are chosen

Most methods require “unrolling” [grounding] Unrolling has exponential space complexity –E.g.,  xyz S(x,y,z) -> T(x,y,z) For constants [a b c d e f g h] must examine all triples Sampling can be inefficient due to large sample space. Proposal: Let prediction errors guide sampling Training in Probabilistic FOL Parameter estimation; weight learning

Error-driven Training Input –Observed data X // Input mentions –True labeling P // True clustering –Prediction algorithm A // Clustering algorithm –Initial weights W, prediction Q // Initial clustering Iterate until convergence –Q’  A(Q, W, O) // Merge clusters –If Q’ introduces an error UpdateWeights(Q, Q’, P, O, W) –Else Q  Q’

UpdateWeights(Q, Q’, P, O, W) Learning to Rank Pairs of Predictions Using truth P, generate a new Q’’ that is a better modification of Q than Q’. Update W s.t. Q’’  A(Q, W, O) Update parameters so Q’’ is ranked higher than Q’

Ranking vs Classification Training Instead of training [Powell, Mr. Powell, he] --> YES [Powell, Mr. Powell, she] --> NO...Rather... [Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] In general, higher-ranked example may contain errors [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

Ranking Parameter Update In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003] W t+1 = argmin W ||W t - W|| s.t. Score(Q’’, W) - Score(Q’, W) ≥ 1

Advantages Never need to unroll entire network –Only explore partial solutions prediction algorithm likely to produce Weights tuned for prediction algorithm Adaptable to different prediction algorithms –beam search, simulated annealing, etc. Adaptable to different loss functions Related: Incremental Perceptron [Collins, Roark 2004] LaSO [Daume, Marcu 2005] Extended here for FOL, ranking, max-margin loss. Rank partial, possibly mistaken predictions.

Disadvantages Difficult to analyze exactly what global objective function is being optimized Convergence issues –Average weight updates

Experiments ACE 2004 coreference –443 newswire documents Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002] –Text match, gender, number, context, Wordnet Additional first-order features –Min/Max/Average/Majority of pairwise features E.g., Average string edit distance, Max document distance –Existential/Universal quantifications of pairwise features E.g., There exists gender disagreement Prediction: Greedy agglomerative clustering

Experiments Sampling + Classification Error-driven + Ranking FOL-CRF PW-CRF B-Cubed F1 Score on ACE 2004 Noun Coreference [to our knowledge, best previously reported results ~ 69% (Ng, 2005)] Better Representation Better Training

Conclusions Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP. Simple approximations can make these approaches practical for real-world problems.

Future Work Fancier features –Over entire clusterings Less greedy inference –Metropolis-Hastings sampling Analysis of training –Which positive/negative examples to select when updating –Loss function sensitive to local minima of prediction Analyze theoretical/empirical convergence

Thank you