Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

Similar presentations


Presentation on theme: "Simone Paolo Ponzetto University of Heidelberg Massimo Poesio"— Presentation transcript:

1 State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio University of Trento

2 Road map The “simple” model from Soon et al. (2001) has two major drawback: Decision locality Knowledge bottleneck

3 Global constraints for coreference
Decision locality: coreference decisions are only locally optimized no dependency assumption is made between different local coreference decisions  we would like to enforce transitivity

4 Overcoming the knowledge bottleneck
Numerous knowledge sources play a role in coreference, e.g. world and common-sense knowledge … but the model rely on a small set of shallow, i.e. surface features

5 Twin-candidate model for anaphora resolution (Yang et al., 2008)
Learn the preference relationship between competing candidates The antecedent is then the best, i.e. most preferred, candidate among a set of competing candidates

6 Twin-candidate model for anaphora resolution (Yang et al., 2008)
The probability that a candidate is preferred over all the other competing candidates: Assuming that the preferences between candidate pairs are independent of each other:

7 Twin-candidate model for anaphora resolution (Yang et al., 2008)
The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability

8 Twin-candidate model for anaphora resolution (Yang et al., 2008)
The probability that a candidate is selected the antecedent can be calculated using the preference classification results between the candidate and its opponents The actual antecedent for an anaphor is the one maximizing such probability

9 Single-candidate vs. twin-candidate model
Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>

10 Single-candidate vs. twin-candidate model
Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2

11 Yang et al. (2008): generating training instances

12 Yang et al. (2008): generating training instances
<Its, Friday, Israel> 01

13 Yang et al. (2008): generating training instances
<Its, defense minister, Israel> 01

14 Yang et al. (2008): generating training instances
<Its, non-conventional weapons, Israel> 01

15 Yang et al. (2008): classifier generation
In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt

16 Yang et al. (2008): antecedent identification as tournament elimination
Candidates are compared linearly from the beginning of the document to the end. Each candidate in turn is paired with the next candidate and passed to the classifier to determine the preference. The “losing” candidate that is judged less preferred by the classifier is eliminated and never considered. The “winner” is compared with the next candidate.

17 The process continues until all the preceding candidates are compared
Yang et al. (2008): antecedent identification as tournament elimination The process continues until all the preceding candidates are compared The candidate that wins in the last comparison is selected as the antecedent Computational complexity of O(N) for N candidates

18 Yang et al. (2008): antecedent identification as tournament elimination
<Its, Israel, the Unites States> => Israel

19 Yang et al. (2008): antecedent identification as tournament elimination
<Its, Israel, a military strike> => Israel

20 Yang et al. (2008): antecedent identification as tournament elimination
<Its, Israel, Iraq> => Iraq

21 Yang et al. (2008): antecedent identification as tournament elimination
<Its, Iraq, the Jewish state> => the Jewish state

22 Yang et al. (2008): antecedent identification as round robin
Compare all antecedent candidates with each other Select the antecedent with the best record of wins Computational complexity of for N candidates

23 Yang et al. (2008): antecedent identification as round robin

24 Yang et al. (2008): antecedent identification as round robin

25 Yang et al. (2008): antecedent identification as round robin

26 Yang et al. (2008): antecedent identification as round robin

27 Yang et al. (2008): antecedent identification as round robin

28 Yang et al. (2008): antecedent identification as round robin

29 Antecedent identification as simple round robin
NP1 Israel NP2 United States NP3 a military strike Score +1 +2 -1 -2

30 Antecedent identification as weighted round robin
NP1 Israel NP2 United States NP3 a military strike Score +0.55 +0.9 1.45 -0.45 +0.8 0.35 -0.1 -0.2 -2

31 Yang et al. (2008): Results

32 Twin-candidate model for coreference resolution (Yang et al., 2008)
The model we have seen so far works for pronominal anaphora resolution For each NP it will always look for the best antecedent However, for coreference some NPs are not anaphoric Extend the classification model to include a special class for non-anaphors

33 Single-candidate vs. twin-candidate model (coreference)
Single-candidate instance: Twin-candidate instance: <ANAPHOR (j), ANTECEDENT (i)> <ANAPHOR (j), COMPETITOR_1 (i), COMPETITOR_2 (k)>

34 Single-candidate vs. twin-candidate model (coreference)
Single-candidate class label: Twin-candidate class label: COREF, NOT COREF COMPETITOR_1, COMPETITOR_2, NONE

35 Yang et al. (2008): generating training instances (coreference)
<Israel, the Jewish state, Iraqi attack> 10

36 Yang et al. (2008): generating training instances (coreference)
<Israel, the Jewish state, non-conventional weapons> 10

37 Yang et al. (2008): generating training instances (coreference)
<Israel, the United States, the Jewish state> 01

38 Yang et al. (2008): generating training instances (coreference)

39 Yang et al. (2008): generating training instances (coreference)
<Lipkin-Shahak, the United States, Iraq> NONE

40 Yang et al. (2008): generating training instances (coreference)
<Lipkin-Shahak, the United States, Friday> NONE

41 Yang et al. (2008): classifier generation for coreference
In the twin-candidate model, replace each feature “Candi_X” with “Candi1_X” “and Candi2_X” Classifiers include C5 and MaxEnt

42 Same as for pronominal anaphors
Yang et al. (2008): antecedent identification as tournament elimination (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates are discarded If both of the candidates in the last match are judged to be in a NONE relation, the mention is left unresolved

43 Same as for pronominal anaphors
Yang et al. (2008): antecedent identification as round robin (coreference) Same as for pronominal anaphors Modification for non-anaphoric mentions: If an instance is classified as NONE, both competing candidates receive a penalty of -1 A mention is considered non-anaphoric and left unresolved if no candidate has a positive final score

44 Yang et al. (2008): Results for coreference

45 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree

46 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [1]

47 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [12] [1][2] [1]

48 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12] [12][3] [1] [1][2]

49 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

50 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

51 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] [1] [1][2]

52 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] Leaves contain all the possible partitions of all of the mentions [12] Computationally infeasible to expand all nodes in the Bell tree [1] [1][2]

53 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] [1][2]

54 Bell-Tree Clustering (Luo et al., 2004)
searches for the most probable partition of a set of mentions structures the search space as a Bell tree [123] [12][3] [13][2] [1][23] [1][2][3] expands only the most promising nodes [12] [1] How to determine which nodes are promising? [1][2]

55 Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1] [1][2]

56 Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] [12] In-focus entities highlighted on the edges Active mentions highlighted with * [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

57 Bell-Tree Clustering (Luo et al., 2004)
The model we are after must estimate Ek is the set of partially-established entities mk is the current mention to be linked or not Ak tells us which entity is in-focus

58 Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3*

59 Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [12] 3* P(L=1|E2={[1]},”2”,A3=[1]) [1] [1] 2* 3 [1][2] 3* P(L=0|E2={[1]},”2”)

60 Bell-Tree Clustering (Luo et al., 2004)
The partial entity a mention considers linking with is the in-focus entity The current mention to link is the active mention [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

61 Bell-Tree Clustering (Luo et al., 2004)
How to compute the probability of an entity-starting mention? Derive it from linking probabilities [123] [12][3] [13][2] [1][23] [1][2][3] P(L=1|E3={[1,2]},”3”,A3=[1,2]) P(L=0|E3={[1,2]},”3”) = ? P(L=1|E3={[1],[2]},”3”,A3=[1]) P(L=1|E3={[1],[2]},”3”,A3=[2]) P(L=0|E3={[1],[2]},”3”) = ? [12] [12] 3* [1] [1] 2* 3 [1] [2] [1][2] 3*

62 Entity starting probability
The probability of starting a new entity

63 Entity-mention model What about ?
ASSUMPTION: entities other than the one in focus have no influence on the linking decision

64 Mention-pair model What about ?
ASSUMPTION: entity-mention score can be obtained from the maximum mention pair score

65 Classifier training Probabilities for both models are estimated from the training data using a maximum entropy model

66 Mention-pair model: features
Lexical features

67 Mention-pair model: features
Distance features Syntax features

68 Mention-pair model: features
Count feature How many times a mention occurred in the document Pronoun features

69 Entity-mention model: features
Remove pair-specific features, e.g. (PoS pairs) Lexical features test the active mention against all mentions in the in-focus entity Distance features take the minimum distance between mentions in the in-focus entity and the active mention

70 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

71 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [1] 1

72 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] [1] 1

73 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 1 * Pc(1,2) = 1 * 0.6 = 0.6 [1] 1

74 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 1 * (1 - Pc(1,2)) = 1 * ( ) = 0.4

75 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [12] [1][2] 0.6 [1] 1 0.4

76 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4

77 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42 [12] [1][2] 0.6 [1] 1 0.4

78 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] * max (Pc(1,3), Pc(2,3)) = * max(0.2, 0.7) = * 0.7 = 0.58 0.6 [1] 1 0.4

79 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * max (Pc(1,3)) = 0.4 * 0.2 = 0.08 1 0.4

80 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 1 0.4 * max (Pc(2,3)) = 0.4 * 0.7 = 0.28 0.4

81 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] [12] [1][2] 0.6 [1] 0.4 * (1 – (0 * Pc(1,3) + 1 * Pc(2,3))) = 0.4 * (1 – 0.7) = 0.12 1 0.4

82 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 0.58 [1] 0.08 1 0.28 0.4 0.12

83 Bell-Tree Clustering: example
Idea: assign a score to each node, based on the pairwise probabilities returned by the coreference classifier Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7 [123] [12][3] [13][2] [1][23] [1][2][3] 0.42 [12] [1][2] 0.6 expands only the N most probable nodes at each level 0.58 [1] 0.08 1 0.28 0.4 0.12

84 Bell Tree: search algorithm

85 Bell Tree: search algorithm
Mention-linking Entity-starting

86 Bell Tree: search algorithm
Pruning

87 Bell Tree: results No statistical significant difference between MP and EM (at p-value 0.05) MP requires 20 times more features than EM Features for EM needs more engineering…


Download ppt "Simone Paolo Ponzetto University of Heidelberg Massimo Poesio"

Similar presentations


Ads by Google