Presentation is loading. Please wait.

Presentation is loading. Please wait.

To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman.

Similar presentations


Presentation on theme: "To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman."— Presentation transcript:

1 To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman

2 Motivation  Microblogs are data gold mines! Twitter reports that it alone captures over 340M short messages per day  Many applications on tweet information extraction Election results (Tumasjan et al., 2010) Disease spreading (Paul and Dredze, 2011) Tracking product feedback and sentiment (Asur and Huberman, 2010)...  Existing tools (for example, NER) are often too limited Stanford NER on tweets set achieves 44% F1 [Ritter et. al, 2011] 2

3 Entity Linking (Wikifier) in Tweets Oh Yes!! giants vs packers game now!! Touchdown!! Q1: Which phrase should be linked? (mention detection) Q2: Which Wikipedia page should be linked for selected phrases? (disambiguation) 3

4 Contributions  Proposed a new evaluation scheme for entity linking A natural evaluation scheme for microblogs  A system that performs significantly better on tweets than other systems Learn to detect mention and perform linking jointly Outperform Tagme [Ferragina & Scaiella 2010] and [Cucerzan 07] by 15% F1  What we have learned Mention detection is a difficult problem Entity information can help mention detection 4

5 Outline  Task Definition (again!)  Two stage versus Joint  Model + Features  Results + Analysis 5

6 What should be linked? Oh Yes!! giants vs packers game now!! Touchdown!! Comparing different Wikifiers is a tough problem [Cornolti, WWW 2013] Really, there is no good definition on what should be linked 6

7 Our Scenario 7 What people are talking about the movie “The Town” on twitter?  Assume our customers are only interested in entities of certain types Movies; Video Games; Sports Team;… Type information can be directly inferred by the corresponding Wikipedia page  Now, it is fair to compare different systems We assume PER, LOC, ORG, BOOK, TVSHOW, MOVIE

8 The Desired Results 8 Oh Yes!! giants vs packers game now!! Touchdown!!

9 Terminology 9 Oh Yes!! giants vs packers game now!! Touchdown!! Mention Candidates Entity Mentions Assignment

10 Related Work  Wikifier [Cucerzan, 2007; Milne and Witten, 2008…….] Given a document, create Wikipedia-like links Very difficult to evaluate/compare Mention detection and disambiguation are often treated separately  NER [Li et al., 2012; Ritter et al., 2011,...] No Linking Limited Types  KBP [Ji et al., 2010; Ji et al., 2011,...] Focus on disambiguation aspect 10

11 Outline  Task Definition (again!)  Two stage versus Joint  Model + Features  Results + Analysis 11

12 What approach should we use?  Task: Wikifier to the entities of the certain types (all named entities)  Approach 1: Train a general named entity recognizer for those types Link to entities from the output of the first stage  Approach 2: Learn to jointly detect mention and disambiguate entities Take advantage of Wikipedia information Take advantage of type information into our model 12 Advanced model Limited Types; Adaptation

13 The Necessity of the Joint Approach The town is so so good, Don’t worry Ben, we already forgave you for Gigli  Q: Is “the town” a mention?  Deep analysis with knowledge is required Gigli is Ben Affleck’s movie, which did not receive a good review Ben Affleck is the lead actor in the movie “The Town” 13

14 Outline  Task Definition (again!)  Two stage versus Joint  Model + Features  Results + Analysis 14

15 Features 15 Oh Yes!! giants vs packers game now!! Touchdown!! Mention Specific Features Mention, Entity Pair Features 2-nd Order FeaturesType Features

16 Mention Specific Features 16

17 View Count  The Wikipedia statistics http://dumps.wikimedia.org/other/pagecounts-raw/ Log exists for every hour Very valuable data  View count is useful Sometimes the most linked entity in Wikipedia is not the most popular one “jersey shore” ==> ? Jersey Shore links: 441 views: 509140 Jersey Shore (TV_series) links: 324 views: 5081377 17

18 Second Order Features 18

19 Type Features  The information content on Wikipedia are different from Twitter Wikipedia is informational; Tweets are actionable Misspelled words: “watchin, watchn, …… “  We want to find context for PER, LOC, ORG,… for tweets Step 1: train on a system Step 2: labeled 10 million unlabeled tweets Step 3: Collect popular contextual words for each type Step 4: train a new system with one new feature Check if the context match the type 19

20 Mining Contextual Words Entity TypeWords appearing before the mention Words appearing after the mention Personwr, dominating, rip, quarterback, singer, featuring, defender, rb, minister, actress, twitition, secretary tarde, format, noite, suffers, dire, admits, senators, urges, performs, joins TV Showsbs, assistir, assistindo, otm, watching, nw, watchn, viagra, watchin, ver skit, performances, premieres, finale, parody, marathon, season, episodes, spoilers, sketch 20

21 Procedure  Testing: step 1 Given a tweet Tokenize it, remove symbols, segment hashtags  Testing: step 2 For all k-gram words in the tweet, do table look up To find mention candidates and the entities they can link to  Testing: step 3 Construct features and output the assignment with the trained model  Learning: Structural SVM; Inference: Exact/Beamseach A rule-base system for categorizing Wikipedia 21

22 Outline  Task Definition (again!)  Two stage versus Joint  Model + Features  Results + Analysis 22

23 Data #Tweets#Cand#MentionP@1 Train473821221885.3% Test 1500895024987.7% Test 2488778133289.6%  We sample two sets of tweets Train, Test 1 from [Ritter 2011] Test 2 from Twitter with entertainment keywords “director, actress”……  P@1 is very high Many, many algorithms focus on disambiguation However, if the mention are correctly extracted, the system is already very good 23

24 Main Results  TagMe [Ferragina & Scaiella 2010] and Cucerzan [Cucerzan 07] Cucerzan is designed for well-written documents We have a more principle way to handle mention detection than Tagme 24

25 Impact of Features  Entity information helps mention detections  Mining contextual words helps a bit  Capturing Entity-Entity relation also improves the model 25 Feature TypeTest 1 Base + Cap. Rate45.6

26 Conclusion & Discussions  We provide an experimental study on tweets Jointly detect mentions and disambiguate A structured learning approach  What have we learned Mention detection is a difficult problem Entity information could potentially help mention detection  Future work Explore the connections between the joint approaches and the two stage approaches [Illinois—ACL 2011, Aida-- VLDB 2011] A more principled way to handle context 26


Download ppt "To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman."

Similar presentations


Ads by Google