Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering.

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Intelligence in Wikipedia: Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference Which performing artists were born in Chicago? … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who was selected as this year’s … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …

Next-Generation Search Information Extraction … Ontology Actor ISA Performing Artist… Inference Born-In(A) ^ PartOf(A,B) => Born-In(B)… … Bob was born in Northwestern Memorial Hospital. … … Bob Black is an active actor who … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago …

Wikipedia – Bootstrap for the Web  Goal: search over the Web  Now: search over Wikipedia  Comprehensive  High-quality  (Semi-)Structured data

 Infoboxes are designed to present summary information about an article's subject, such that similar subjects have a uniform look and in a common format  An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms. Infoboxes

Infobox examples Basic infobox Taxobox –Plant species

More example Infobox People - Actor Infobox- Convention Center

Outline Background: Kylin Extraction Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Problem with information Extraction IWP (Intelligence in Wikipedia) CCC and IE Virtuous Cycle IWP (Shrinkage, Retraining and Extracting from Web) Multilingual Extraction Summary

Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts Form training dataset based on infoboxes Extract semantic relations from Wikipedia articles Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage. ------Wikipedia

 It is a prototype of self-supervised, machine learning system  It looks for classes of pages with similar infoboxes  It determines common attributes  It creates training examples Kylin

Infobox Generation

Preprocessor Schema Refinement Free edit -> schema drift  Duplicate templates:U.S.County(1428), US County(574), Counties(50), County(19)  Low usage of attribute  Duplicate attributes:“Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” Kylin:  Strict name match  15% occurrences

Its county seat is Clearfield. As of 2005, the population density was 28.2/km². Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. Preprocessor Training Dataset Construction

 Document Classifier List and Category  Fast  Precision(98.5%)  Recall(68.8%)  Sentence Classifier  Predicts which attribute value are contained in given sentence.  It uses maximum entropy model.  To decrease noisy and incomplete training dataset, Kylin apply bagging. Classifier

Conditional Random Fields Model  Attribute value extraction: sequential data labeling  CRF model for each attribute independently  Relabel–filter false negative training examples  2,972km²(1,147mi²) of it is land and 17km²(7mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area  Though Kylin is successful on popular classes, its performance decreases on sparse classes where there is insufficient training data. CRF Extractor

Long-Tail 1: Sparse Infobox Class Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90% Recall: low 50% ~ mid 90% Kylin Flounders on Sparse Classes – Little Training Data e.g: for “US County class ” Kylin has 97.3% precision and 95.9% recall while many other classes like “Irish Newspaper” contains very small number of infobox containing articles

Long-Tail 2: Incomplete Articles Desired Information Missing from Wikipedia Among 1.8 millions pages [July 2007 of Wikipedia ] many are short articles and almost 800,000 (44.2%) are marked as stub pages indicating much needed information is missing.

 Attempt to improve Kylin’s performance using shrinkage.  We use Shrinkage when training an extractor of an instance-space infobox class by aggregating data from its parent and children classes Shrinkage

performer (44).location actor (8738) comedian (106).birthplace.birth_place.cityofbirth.origin person (1201).birth_place [McCallum et al., ICML98]

KOG (Kylin Ontology Generator) [Wu & Weld, WWW08] performer (44).location actor (8738) comedian (106). birthplace.birth_place.cityofbirth.origin person (1201).birth_place Shrinkage

Retraining Key: Identify relevant sentences given the sea of Web data? Complementary to Shrinkage: Harvest extra training data from broader Web Andrew Murray was born in Scotland in 1828 ……

Retraining Kylin Extraction: TextRunner Extraction: Query TextRunner for relevant sentences: r1= Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870. r2= Ada Cambridge was born in Norfolk, England, in 1844. t=

Effect of Shrinkage & Retraining

1755% improvement for a sparse class 13.7% improvement for a popular class

Extraction from the Web Idea: apply Kylin extractors trained on Wikipedia to general Web pages Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects Key: retrieve relevant sentences Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages

Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray” “andrew murray” birth date “andrew murray” was born in “andrew murray” … predicates from TextRunner attribute name

Weighting Extractions Which extractions are more relevant? Features : # sentences between sentence and closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence

Web Extraction Experiment Extractor confidence alone performs poor Weighted combination is the best

Combining Wikipedia & Web Recall Benefit from Shrinkage / Retraining…

Combining Wikipedia & Web Benefit from Shrinkage + Retraining + Web

Problem  Information Extraction is Imprecise › Wikipedians Don’t Want 90% Precision  How Improve Precision? › People!

Intelligence in Wikipedia  What is IWP? › A project/system that aims to combine  IE (Information Extraction)  CCC (communal content creation)

Information Extraction  Examples: › Zoominfo.com › Fligdog.com › Citeseer › Google  Advantage: Autonomy  Disadvantage: Expensive

IE system contributors  Contributors in this room? › Wikipedia IE systems › Citeseer › Rexa › DBlife

Communal Content Creation  Examples › Wikipedia › Ebay › Netflix › Advantage: more accuracy then IE › Disadvantage: bootstrapping, incentives, and management

Virtuous Cycle

Contributing as a Non-Primary Task  Encourage contributions  Without annoying or abusing readers › Compared 5 different interfaces

Results Contribution Rate 1.6%  13% 90% of positive labels were correct

IWP and Shrinkage, Retraining, and Extracting from the Web Shrinkage – improves IWP’s precision, and recall Retraining – improves the robustness of IWP’s extractors Extraction – further helps IWP’s performance

Multi-Lingual Extraction Idea: Further leverage the virtuous feedback cycle Utilize IE methods to add or update missing information by copying from one language to another Utilize CCC to validate and improve updates. Example Nombre = “Jerry Seinfeld” and Name = “Jerry Seinfeld Cónyuge = “Jessica Sklar” and Spouse = “Jessica Sienfeld”

Summary Kylin’s initial performance is unacceptable Methods for increasing recall Shrinkage Retraining Extraction from the web

Summary IWP – developing AI methods to facilitate the growth, operation and use of Wikipedia Initial goal – extraction of a giant knowledge bas of semantic triples Faceted browsing Input to reasoning based question- answering system How IE CCC

Questions

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering.

Similar presentations

Presentation on theme: "Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering.

Similar presentations

Presentation on theme: "Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering."— Presentation transcript:

Similar presentations

About project

Feedback