State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes Simone Paolo Ponzetto University of Heidelberg Massimo Poesio University of Trento
Road map The “simple” model from Soon et al. (2001) has two major drawback: Decision locality Knowledge bottleneck
Global constraints for coreference Decision locality: coreference decisions are only locally optimized no dependency assumption is made between different local coreference decisions we would like to enforce transitivity
Overcoming the knowledge bottleneck Numerous knowledge sources play a role in coreference, e.g. world and common-sense knowledge … but the model rely on a small set of shallow, i.e. surface features
Twin-candidate model for anaphora resolution (Yang et al., 2008) Learn the preference relationship between competing candidates The antecedent is then the best, i.e. most preferred, candidate among a set of competing candidates
Twin-candidate model for anaphora resolution (Yang et al., 2008) The probability that a candidate is preferred over all the other competing candidates: Assuming that the preferences between candidate pairs are independent of each other:
Twin-candidate model for anaphora resolution (Yang et al., 2008) The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability
Twin-candidate model for anaphora resolution (Yang et al., 2008) The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability
Single-candidate vs. twin-candidate model Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>
Single-candidate vs. twin-candidate model Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2
Yang et al. (2008): generating training instances
Yang et al. (2008): generating training instances <Its, Friday, Israel> 01
Yang et al. (2008): generating training instances <Its, defense minister, Israel> 01
Yang et al. (2008): generating training instances <Its, non-conventional weapons, Israel> 01
Yang et al. (2008): classifier generation In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt
Yang et al. (2008): antecedent identification as tournament elimination Candidates are compared linearly from the beginning of the document to the end. Each candidate in turn is paired with the next candidate and passed to the classifier to determine the preference. The “losing” candidate that is judged less preferred by the classifier is eliminated and never considered. The “winner” is compared with the next candidate.
The process continues until all the preceding candidates are compared Yang et al. (2008): antecedent identification as tournament elimination The process continues until all the preceding candidates are compared The candidate that wins in the last comparison is selected as the antecedent Computational complexity of O(N) for N candidates
Yang et al. (2008): antecedent identification as tournament elimination <Its, Israel, the Unites States> => Israel
Yang et al. (2008): antecedent identification as tournament elimination <Its, Israel, a military strike> => Israel
Yang et al. (2008): antecedent identification as tournament elimination <Its, Israel, Iraq> => Iraq
Yang et al. (2008): antecedent identification as tournament elimination <Its, Iraq, the Jewish state> => the Jewish state
Yang et al. (2008): antecedent identification as round robin Compare all antecedent candidates with each other Select the antecedent with the best record of wins Computational complexity of for N candidates
Yang et al. (2008): antecedent identification as round robin
Yang et al. (2008): antecedent identification as round robin
Yang et al. (2008): antecedent identification as round robin
Yang et al. (2008): antecedent identification as round robin
Yang et al. (2008): antecedent identification as round robin
Yang et al. (2008): antecedent identification as round robin
Antecedent identification as simple round robin NP1 Israel NP2 United States NP3 a military strike Score +1 +2 -1 -2
Antecedent identification as weighted round robin NP1 Israel NP2 United States NP3 a military strike Score +0.55 +0.9 1.45 -0.45 +0.8 0.35 -0.1 -0.2 -2
Yang et al. (2008): Results
Twin-candidate model for coreference resolution (Yang et al., 2008) The model we have seen so far works for pronominal anaphora resolution For each NP it will always look for the best antecedent However, for coreference some NPs are not anaphoric Extend the classification model to include a special class for non-anaphors
Single-candidate vs. twin-candidate model (coreference) Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>
Single-candidate vs. twin-candidate model (coreference) Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2, NONE
Yang et al. (2008): generating training instances (coreference) <Israel, the Jewish state, Iraqi attack> 10
Yang et al. (2008): generating training instances (coreference) <Israel, the Jewish state, non-conventional weapons> 10
Yang et al. (2008): generating training instances (coreference) <Israel, the United States, the Jewish state> 01
Yang et al. (2008): generating training instances (coreference)
Yang et al. (2008): generating training instances (coreference) <Lipkin-Shahak, the United States, Iraq> NONE
Yang et al. (2008): generating training instances (coreference) <Lipkin-Shahak, the United States, Friday> NONE
Yang et al. (2008): classifier generation for coreference In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt
Same as for pronominal anaphors Yang et al. (2008): antecedent identification as tournament elimination (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates are discarded If both of the candidates in the last match are judged to be in a NONE relation, the mention is left unresolved
Same as for pronominal anaphors Yang et al. (2008): antecedent identification as round robin (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates receive a penalty of -1 A mention is considered non-anaphoric and left unresolved if no candidate has a positive final score
Yang et al. (2008): Results for coreference
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [1]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [12] [1][2] [1]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12] [12][3] [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] Computationally infeasible to expand all nodes in the Bell tree [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] How to determine which nodes are promising? [1][2]
Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]
Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] In-focus entities highlighted on the edges Active mentions highlighted with * [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*
Bell-Tree Clustering (Luo et al., 2004) The model we are after must estimate Ek is the set of partially-established entities mk is the current mention to be linked or not Ak tells us which entity is in-focus
Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3*
Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3* P(L=0|E2={[1]},”2”)
Bell-Tree Clustering (Luo et al., 2004) The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*
Bell-Tree Clustering (Luo et al., 2004) How to compute the probability of an entity-starting mention? Derive it from linking probabilities [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=0|E3={[1,2]},”3”) = ? P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) P(L=0|E3={[1],[2]},”3”) = ? [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*
Entity starting probability The probability of starting a new entity
Entity-mention model What about ? ASSUMPTION: entities other than the one in focus have no influence on the linking decision
Mention-pair model What about ? ASSUMPTION: entity-mention score can be obtained from the maximum mention pair score
Classifier training Probabilities for both models are estimated from the training data using a maximum entropy model
Mention-pair model: features Lexical features
Mention-pair model: features Distance features Syntax features
Mention-pair model: features Count feature How many times a mention occurred in the document Pronoun features
Entity-mention model: features Remove pair-specific features, e.g. (PoS pairs) Lexical features test the active mention against all mentions in the in-focus entity Distance features take the minimum distance between mentions in the in-focus entity and the active mention
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [1] 1
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] [1] 1
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 1 * Pc(1,2) = 1 * 0.6 = 0.6 [1] 1
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42 [12] [1][2] 0.6 [1] 1 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 1 - 0.6 * max (Pc(1,3), Pc(2,3)) = 1 - 0.6 * max(0.2, 0.7) = 1 - 0.6 * 0.7 = 0.58 0.6 [1] 1 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * max (Pc(1,3)) = 0.4 * 0.2 = 0.08 1 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4 * max (Pc(2,3)) = 0.4 * 0.7 = 0.28 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * (1 – (0 * Pc(1,3) + 1 * Pc(2,3))) = 0.4 * (1 – 0.7) = 0.12 1 0.4
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 0.58 [1] 0.08 1 0.28 0.4 0.12
Bell-Tree Clustering: example Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 expands only the N most probable nodes at each level 0.58 [1] 0.08 1 0.28 0.4 0.12
Bell Tree: search algorithm
Bell Tree: search algorithm Mention-linking Entity-starting
Bell Tree: search algorithm Pruning
Bell Tree: results No statistical significant difference between MP and EM (at p-value 0.05) MP requires 20 times more features than EM Features for EM needs more engineering…