Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering.

Similar presentations


Presentation on theme: "Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering."— Presentation transcript:

1 Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

2 Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference Which performing artists were born in Chicago? … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who was selected as this year’s … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …

3 Next-Generation Search Information Extraction … Ontology Actor ISA Performing Artist… Inference Born-In(A) ^ PartOf(A,B) => Born-In(B)… … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …

4 Wikipedia – Bootstrap for the Web  Goal: search over the Web  Now: search over Wikipedia  Comprehensive  High-quality  (Semi-)Structured data

5  Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format  An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms. Infoboxes

6 Infobox examples Basic infobox Taxobox –Plant species

7 More example Infobox People - Actor Infobox- Convention Center

8 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

9 Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts Form training dataset based on infoboxes Extract semantic relations from Wikipedia articles Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage. ------Wikipedia

10  It is a prototype of self-supervised, machine learning system  It looks for classes of pages with similar infoboxes  It determines common attributes  It creates training examples Kylin

11 Infobox Generation

12 Preprocessor Schema Refinement Free edit -> schema drift  Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19)  Low usage of attribute  Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Kylin:  Strict name match  15% occurrences

13 Its county seat is Clearfield. As of 2005, the population density was 28.2/km². Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. Preprocessor Training Dataset Construction

14  Document Classifier List and Category  Fast  Precision(98.5%)  Recall(68.8%)  Sentence Classifier  Predicts which attribute value are contained in given sentence.  It uses maximum entropy model.  To decrease noisy and incomplete training dataset, Kylin apply bagging. Classifier

15 Conditional Random Fields Model  Attribute value extraction: sequential data labeling  CRF model for each attribute independently  Relabel–filter false negative training examples  2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area  Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data. CRF Extractor

16 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

17 Long-Tail 1: Sparse Infobox Class Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90% Recall: low 50% ~ mid 90% Kylin Flounders on Sparse Classes – Little Training Data e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

18 Long-Tail 2: Incomplete Articles Desired Information Missing from Wikipedia Among 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.

19 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

20  Attempt to improve Kylin’s performance using shrinkage.  We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes Shrinkage

21 performer (44).location actor (8738) comedian (106).birthplace.birth_place.cityofbirth.origin person (1201).birth_place [McCallum et al., ICML98]

22 KOG (Kylin Ontology Generator) [Wu & Weld, WWW08] performer (44).location actor (8738) comedian (106). birthplace.birth_place.cityofbirth.origin person (1201).birth_place Shrinkage

23 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

24 Retraining Key: Identify relevant sentences given the sea of Web data? Complementary to Shrinkage: Harvest extra training data from broader Web Andrew Murray was born in Scotland in 1828 ……

25 Retraining Kylin Extraction: TextRunner Extraction: Query TextRunner for relevant sentences: r1= Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870. r2= Ada Cambridge was born in Norfolk, England, in 1844. t=

26 Effect of Shrinkage & Retraining

27 1755% improvement for a sparse class 13.7% improvement for a popular class

28 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

29 Extraction from the Web Idea: apply Kylin extractors trained on Wikipedia to general Web pages Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects Key: retrieve relevant sentences Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages

30 Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray” “andrew murray” birth date “andrew murray” was born in “andrew murray” … predicates from TextRunner attribute name

31 Weighting Extractions Which extractions are more relevant? Features : # sentences between sentence and closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence

32 Web Extraction Experiment Extractor confidence alone performs poor Weighted combination is the best

33 Combining Wikipedia & Web Recall Benefit from Shrinkage / Retraining…

34 Combining Wikipedia & Web Benefit from Shrinkage + Retraining + Web

35 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

36 Problem  Information Extraction is Imprecise › Wikipedians Don’t Want 90% Precision  How Improve Precision? › People!

37 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

38 Intelligence in Wikipedia  What is IWP? › A project/system that aims to combine  IE (Information Extraction)  CCC (communal content creation)

39 Information Extraction  Examples: › Zoominfo.com › Fligdog.com › Citeseer › Google  Advantage: Autonomy  Disadvantage: Expensive

40 IE system contributors  Contributors in this room? › Wikipedia IE systems › Citeseer › Rexa › DBlife

41 Communal Content Creation  Examples › Wikipedia › Ebay › Netflix › Advantage: more accuracy then IE › Disadvantage: bootstrapping, incentives, and management

42 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

43 Virtuous Cycle

44 Contributing as a Non-Primary Task  Encourage contributions  Without annoying or abusing readers › Compared 5 different interfaces

45

46 Results Contribution Rate 1.6%  13% 90% of positive labels were correct

47 Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

48 IWP and Shrinkage, Retraining, and Extracting from the Web Shrinkage – improves IWP’s precision, and recall Retraining – improves the robustness of IWP’s extractors Extraction – further helps IWP’s performance

49 Multi-Lingual Extraction Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing information by copying from one language to another Utilize CCC to validate and improve updates. Example Nombre = “Jerry Seinfeld” and Name = “Jerry Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica Sienfeld”

50 Summary Kylin’s initial performance is unacceptable Methods for increasing recall Shrinkage Retraining Extraction from the web

51 Summary IWP – developing AI methods to facilitate the growth, operation and use of Wikipedia Initial goal – extraction of a giant knowledge bas of semantic triples Faceted browsing Input to reasoning based question- answering system How IE CCC

52 Questions


Download ppt "Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering."

Similar presentations


Ads by Google