Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

Slides:

Advertisements

Similar presentations

Adversarial Search Chapter 6 Section 1 – 4. Types of Games.

Advertisements

Conceptual Clustering

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Traveling Salesperson Problem

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

Computing Kemeny and Slater Rankings Vincent Conitzer (Joint work with Andrew Davenport and Jayant Kalagnanam at IBM Research.)

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

CS 4705 Lecture 21 Algorithms for Reference Resolution.

Supervised models for coreference resolution Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1.

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Improving Machine Learning Approaches to Coreference Resolution Vincent Ng and Claire Cardie Cornell Univ. ACL 2002 slides prepared by Ralph Grishman.

NP and NP- Completeness Bryan Pearsaul. Outline Decision and Optimization Problems Decision and Optimization Problems P and NP P and NP Polynomial-Time.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

A Global Relaxation Labeling Approach to Coreference Resolution Coling 2010 Emili Sapena, Llu´ıs Padr´o and Jordi Turmo TALP Research Center Universitat.

Bayesian Networks. Male brain wiring Female brain wiring.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

A Cross-Lingual ILP Solution to Zero Anaphora Resolution Ryu Iida & Massimo Poesio (ACL-HLT 2011)

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming Xiaofeng Yang 1 Jian Su 1 Jun Lang 2 Chew Lim Tan 3 Ting Liu 2 Sheng.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.

ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.

Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

1 Chapter 6 Game Playing. 2 Chapter 6 Contents l Game Trees l Assumptions l Static evaluation functions l Searching game trees l Minimax l Bounded lookahead.

Semantic Role Labelling Using Chunk Sequences Ulrike Baldewein Katrin Erk Sebastian Padó Saarland University Saarbrücken Detlef Prescher Amsterdam University.

Game Playing Why do AI researchers study game playing?

Adversarial Search and Game-Playing

Data Mining Practical Machine Learning Tools and Techniques

Semi-Supervised Clustering

Last time: search strategies

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Iterative Deepening A*

PENGANTAR INTELIJENSIA BUATAN (64A614)

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Pengantar Kecerdasan Buatan

NYU Coreference CSCI-GA.2591 Ralph Grishman.

Games with Chance Other Search Algorithms

Adversarial Search.

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Artificial Intelligence

Lecture 9: Entity Resolution

K Nearest Neighbor Classification

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Hierarchical clustering approaches for high-throughput data

Clustering Algorithms for Noun Phrase Coreference Resolution

Statistical NLP: Lecture 9

Algorithms for Reference Resolution

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

NIM - a two person game n objects are in one pile

Artificial Intelligence

An Interactive Approach to Collectively Resolving URI Coreference

The Alpha-Beta Procedure

Introduction to Artificial Intelligence Lecture 9: Two-Player Games I

Ensemble learning.

Text Categorization Berlin Chen 2003 Reference:

Information Theoretical Analysis of Digital Watermarking

Minimax strategies, alpha beta pruning

CS51A David Kauchak Spring 2019

Statistical NLP : Lecture 9 Word Sense Disambiguation

Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning

Presentation transcript:

State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes Simone Paolo Ponzetto University of Heidelberg Massimo Poesio University of Trento

Road map The “simple” model from Soon et al. (2001) has two major drawback: Decision locality Knowledge bottleneck

Global constraints for coreference Decision locality: coreference decisions are only locally optimized no dependency assumption is made between different local coreference decisions  we would like to enforce transitivity

Overcoming the knowledge bottleneck Numerous knowledge sources play a role in coreference, e.g. world and common-sense knowledge … but the model rely on a small set of shallow, i.e. surface features

Twin-candidate model for anaphora resolution (Yang et al., 2008) Learn the preference relationship between competing candidates The antecedent is then the best, i.e. most preferred, candidate among a set of competing candidates

Twin-candidate model for anaphora resolution (Yang et al., 2008) The probability that a candidate is preferred over all the other competing candidates: Assuming that the preferences between candidate pairs are independent of each other:

Twin-candidate model for anaphora resolution (Yang et al., 2008) The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability

Twin-candidate model for anaphora resolution (Yang et al., 2008) The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability

Single-candidate vs. twin-candidate model Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>

Single-candidate vs. twin-candidate model Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2

Yang et al. (2008): generating training instances

Yang et al. (2008): generating training instances <Its, Friday, Israel> 01

Yang et al. (2008): generating training instances <Its, defense minister, Israel> 01

Yang et al. (2008): generating training instances <Its, non-conventional weapons, Israel> 01

Yang et al. (2008): classifier generation In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt

Yang et al. (2008): antecedent identification as tournament elimination Candidates are compared linearly from the beginning of the document to the end. Each candidate in turn is paired with the next candidate and passed to the classifier to determine the preference. The “losing” candidate that is judged less preferred by the classifier is eliminated and never considered. The “winner” is compared with the next candidate.

The process continues until all the preceding candidates are compared Yang et al. (2008): antecedent identification as tournament elimination The process continues until all the preceding candidates are compared The candidate that wins in the last comparison is selected as the antecedent Computational complexity of O(N) for N candidates

Yang et al. (2008): antecedent identification as tournament elimination <Its, Israel, the Unites States> => Israel

Yang et al. (2008): antecedent identification as tournament elimination <Its, Israel, a military strike> => Israel

Yang et al. (2008): antecedent identification as tournament elimination <Its, Israel, Iraq> => Iraq

Yang et al. (2008): antecedent identification as tournament elimination <Its, Iraq, the Jewish state> => the Jewish state

Yang et al. (2008): antecedent identification as round robin Compare all antecedent candidates with each other Select the antecedent with the best record of wins Computational complexity of for N candidates

Yang et al. (2008): antecedent identification as round robin

Yang et al. (2008): antecedent identification as round robin

Yang et al. (2008): antecedent identification as round robin

Yang et al. (2008): antecedent identification as round robin

Yang et al. (2008): antecedent identification as round robin

Yang et al. (2008): antecedent identification as round robin

Antecedent identification as simple round robin NP1 Israel NP2 United States NP3 a military strike Score +1 +2 -1 -2

Antecedent identification as weighted round robin NP1 Israel NP2 United States NP3 a military strike Score +0.55 +0.9 1.45 -0.45 +0.8 0.35 -0.1 -0.2 -2

Yang et al. (2008): Results

Twin-candidate model for coreference resolution (Yang et al., 2008) The model we have seen so far works for pronominal anaphora resolution For each NP it will always look for the best antecedent However, for coreference some NPs are not anaphoric Extend the classification model to include a special class for non-anaphors

Single-candidate vs. twin-candidate model (coreference) Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>

Single-candidate vs. twin-candidate model (coreference) Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2, NONE

Yang et al. (2008): generating training instances (coreference) <Israel, the Jewish state, Iraqi attack> 10

Yang et al. (2008): generating training instances (coreference) <Israel, the Jewish state, non-conventional weapons> 10

Yang et al. (2008): generating training instances (coreference) <Israel, the United States, the Jewish state> 01

Yang et al. (2008): generating training instances (coreference)

Yang et al. (2008): generating training instances (coreference) <Lipkin-Shahak, the United States, Iraq> NONE

Yang et al. (2008): generating training instances (coreference) <Lipkin-Shahak, the United States, Friday> NONE

Yang et al. (2008): classifier generation for coreference In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt

Same as for pronominal anaphors Yang et al. (2008): antecedent identification as tournament elimination (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates are discarded If both of the candidates in the last match are judged to be in a NONE relation, the mention is left unresolved

Same as for pronominal anaphors Yang et al. (2008): antecedent identification as round robin (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates receive a penalty of -1 A mention is considered non-anaphoric and left unresolved if no candidate has a positive final score

Yang et al. (2008): Results for coreference

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [1]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [12] [1][2] [1]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12] [12][3] [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] Computationally infeasible to expand all nodes in the Bell tree [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] How to determine which nodes are promising? [1][2]

Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] In-focus entities highlighted on the edges Active mentions highlighted with * [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

Bell-Tree Clustering (Luo et al., 2004) The model we are after must estimate Ek is the set of partially-established entities mk is the current mention to be linked or not Ak tells us which entity is in-focus

Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3*

Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3* P(L=0|E2={[1]},”2”)

Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

Bell-Tree Clustering (Luo et al., 2004) How to compute the probability of an entity-starting mention? Derive it from linking probabilities [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=0|E3={[1,2]},”3”) = ? P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) P(L=0|E3={[1],[2]},”3”) = ? [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

Entity starting probability The probability of starting a new entity

Entity-mention model What about ? ASSUMPTION: entities other than the one in focus have no influence on the linking decision

Mention-pair model What about ? ASSUMPTION: entity-mention score can be obtained from the maximum mention pair score

Classifier training Probabilities for both models are estimated from the training data using a maximum entropy model

Mention-pair model: features Lexical features

Mention-pair model: features Distance features Syntax features

Mention-pair model: features Count feature How many times a mention occurred in the document Pronoun features

Entity-mention model: features Remove pair-specific features, e.g. (PoS pairs) Lexical features test the active mention against all mentions in the in-focus entity Distance features take the minimum distance between mentions in the in-focus entity and the active mention

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [1] 1

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] [1] 1

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 1 * Pc(1,2) = 1 * 0.6 = 0.6 [1] 1

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42 [12] [1][2] 0.6 [1] 1 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 1 - 0.6 * max (Pc(1,3), Pc(2,3)) = 1 - 0.6 * max(0.2, 0.7) = 1 - 0.6 * 0.7 = 0.58 0.6 [1] 1 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * max (Pc(1,3)) = 0.4 * 0.2 = 0.08 1 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4 * max (Pc(2,3)) = 0.4 * 0.7 = 0.28 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * (1 – (0 * Pc(1,3) + 1 * Pc(2,3))) = 0.4 * (1 – 0.7) = 0.12 1 0.4

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 0.58 [1] 0.08 1 0.28 0.4 0.12

Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 expands only the N most probable nodes at each level 0.58 [1] 0.08 1 0.28 0.4 0.12

Bell Tree: search algorithm

Bell Tree: search algorithm Mention-linking Entity-starting

Bell Tree: search algorithm Pruning

Bell Tree: results No statistical significant difference between MP and EM (at p-value 0.05) MP requires 20 times more features than EM Features for EM needs more engineering…