Knowledge Gaps for Entity Linking Heng Ji. 2 Outline Relation Clustering Remaining Challenges for Entity Linking.

3 3 Relation Clustering: The paper from ten years ago (Hasegawa et al., 2014) Relation Clustering Remaining Challenges for Entity Linking

4 Relation Discovery Overview Assume that pairs of entities occurring in similar context can be clustered and each pair in a cluster is an instance of the relation. o 1. Tag NE in text corpora o 2. Get co-occurrence pairs of NE and their context o 3. Measure context similarities among pairs of NEs. o 4. Make clusters of pairs of NEs. o 5. Label each cluster of pairs of NEs. Run NE tagger, get all context words within a certain distance; if context words of A-B and C-D pair are similar, these two pairs are placed into the same cluster(the same relation), in this case the relation is merger and acquisition.

5 Relation Discovery

6 NE tagging use the extended NE tagger(Sekine, 2001) to detect useful relations. Collect intervening words between two NEs for each co- occurrence. o Two NEs are considered to co-occur if they appear within the same sentence and separated by at most N intervening words. o Different orders are considered as different contexts. That is, e 1 …e 2 and e 2 …e 1 are collected as different contexts. o Passive voice : collect the base forms of words which are stemmed by a POS tagger, but verb past participles are distinguished from other verb forms. Less frequent pairs of NEs should be eliminated. o Set a frequency threshold

7 Relation Discovery Calculate similarity between the set of contexts of NE pairs. o Vector space model and cosine similarity o Only compare NE pairs which have the same types, e.g., one PERSON-GPE pair and another PERSON-GPE pair. o Eliminate stop words, words in parallel expressions, and expressions peculiar to particular source documents. A context vector for each NE pair consists of the bag of words formed from all intervening words from all co-occurrences of two NEs. o Different orders: if a word w i occurred L times in e 1 …e 2, M times in e 2 …e 1, the tf i of w i is defined as L-M. o If the norm |α| is small due to the lack of context words, the similarity might be unreliable, so define a threshold to eliminate short context vectors.

8 Relation Discovery We can cluster the NE pairs base on the similarity among context vectors of them. o We do not know the # of clusters in advance so we adopt hierarchical clustering. o Using complete linkage Label the cluster with the most frequent word in all combinations of the NE pairs in the same cluster. o The frequencies are normalized.

9 Discussions How will embeddings play a role here? Did/Will we make fundamental changes to this “old” framework?

10 10 Outline Relation Clustering Remaining Challenges for Entity Linking (10% Errors for News and 15% Errors for Social Media)

11 11 Entity Linking It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”. Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid Chicago VIII was one of the early 70s-era Chicago albums to catch my ear, along with Chicago II.

12 Knowledge Gap between Source and KB Source: breaking news/new information/rumor KB: bio, summary, snapshot of life According to Darwin it is the Males who do the vamping. Charles Robert Darwin, was an English naturalist and geologist best known for his contributions to evolutionary theory. I had no idea the victim in the Jackson cases was publicized. In the summer of 1993, Jackson was accused of child sexual abuse by a 13-year-old boy named Jordan Chandler and his father, Dr. Evan Chandler, a dentist. I went to youtube and checked out the Gulf oil crisis: all of the posts are one month old, or older… On April 20, 2010, the Deepwarter Horizon oil platform, located in the Mississippi Canyon about 40 miles (64 km) off the Louisiana coast, suffered a catastrophic explosion; it sank a day-and-a-half later 12

13 Fill in the Gap with Background Knowledge Source: breaking news/new information/rumors KB: bio, summary, snapshot of life Christies denial of marriage privledges to gays will alienate independents and his “I wanted to have the people vote on it” will ring hollow. Christie has said that he favoured New Jersey's law allowing same-sex couples to form civil unions, but would veto any bill legalizing same- sex marriage in New Jersey Translation out of hype-speak: some kook made threatening noises at Brownback and go arrested Samuel Dale "Sam" Brownback (born September 12, 1956) is an American politician, the 46th and current Governor of Kansas. Connect/Sort Background Knowledge Man Accused Of Making Threatening Phone Call To Kansas Gov. Sam Brownback May Face Felony Charge 13

14 The Stockholm Institute stated that 23 of 25 major armed conflicts in the world in 2000 occurred in impoverished nations. Knowledge Synthesis Stockholm_International_Peace_Research_Institute Stockholm_Institute_of_Education 14

15 They passed a bill, and Christie the Hutt decides he's stull sucking up to be RomBot's running mate. Morphs Chris Christie Mitt Romney 15 They passed a bill, and Christie the Hutt decides he's stull sucking up to be RomBot's running mate.

16 During talks in Geneva attended by William J. Burns Iran refused to respond to Solana’s offers. Commonsense Knowledge William_J._Burns ( ) William_Joseph_Burns (1956- ) 16

17 The petition demanded the introduction of a parliament elected by all adults - men and women in Saudi Arabia. Commonsense Knowledge Consultative Assembly of Saudi_Arabia 17

18 Millions of Americans went to war for America, and came back broken or otherwise gave up a lot, and now we look to take a huge chunk of their hide because Washington no longer works. World Knowledge Federal government of the United States 18

19 19 Entity mentions involved in AMR conjunction relations should be linked jointly to KB; their candidates in KB should also be strongly connected to each other with high semantic relatedness o “and”, “or”, “contrast-01”, “either”, “compared to”, “prep along with”, “neither”, “slash”, “between” and “both” Collective Inference: What We've done Before (Pan et al., 2015)

20 I think Mitt drops out... Ok, my answer is no one and Obama wins the GE. Collective Inference: Beyond Sentence and Beyond Syntax 20

