Information Extraction from Wikipedia: Moving Down the Long Tail

Information Extraction from Wikipedia: Moving Down the Long Tail
Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA 1

Which performing artists were born in Chicago?
Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference … Bob was born in Northwestern Memorial Hospital. … Which performing artists were born in Chicago? … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago … Bob Black is an active actor who was selected as this year’s Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 2

Next-Generation Search
… Bob was born in Northwestern Memorial Hospital. … Information Extraction <Bob, Born-In, NMH> <Bob Black, ISA, actor> <NMH, in Chicago> … Ontology Actor ISA Performing Artist … Inference Born-In(A) ^ PartOf(A,B) => Born-In(B) … … Bob Black is an active actor who … Northwestern Memorial Hospital is one of the country’s leading hospitals in Chicago Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 3

Next-Generation Search
Information Extraction Ontology Inference Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 4

Wikipedia – Bootstrap for the Web
Information Extraction Ontology Inference Goal: search over the Web Now: search over Wikipedia Comprehensive High-quality (Semi-)Structured data Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 5

Outline Background: Kylin Extraction [Wu & Weld CIKM07]
Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extracting from the Web Conclusion Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 6

Kylin: Information Extraction from Wikipedia [Wu & Weld CIKM07]
Self-supervised learning -> autonomous Form training dataset based on infoboxes Extract semantic relations from Wikipedia articles Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. Related work Its county seat is Clearfield. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. As of 2005, the population density was 28.2/km². 7

Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extraction from the Web Conclusion Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 8

Long-Tail 1: Sparse Infobox Class
Kylin Performs Well on Popular Classes: Precision: mid 70% ~ high 90% Recall: low 50% ~ mid 90% Kylin Flounders on Sparse Classes – Little Training Data Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 1442/1756(82%) <100 instance; 709/1756(40%) <10 instance [July 2007 of Wikipedia ] 9

Long-Tail 2: Incomplete Articles
Desired Information Missing from Wikipedia 800,000/1,800,000(44.2%) stub pages [July 2007 of Wikipedia ] Length Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. ID 10

Search over Wikip’s Whole Spectrum
Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 11

Long-Tailed Challenges Sparse infobox classes Incomplete articles Moving Down the Long Tails Shrinkage Retraining Extraction from the Web Conclusion Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 12 12

Shrinkage [McCallum et al., ICML98] performer (44) .location actor
(8738) comedian (106) .birthplace .birth_place .cityofbirth .origin person (1201) Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 13

Shrinkage KOG (Kylin Ontology Generator) [Wu & Weld, WWW08]
person (1201) performer (44) actor (8738) comedian (106) .birth_place Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. .location .birthplace .birth_place .cityofbirth .origin 14

Shrinkage Experiment Settings: Testing cases:
Dataset: 07/16/2007 snapshot of en.wikipedia.org Testing cases: Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 15

Shrinkage Experiment Semantic web is a great idea.
It makes web content also machine readable enables software agents to find, share and integrate information more easily. 16

Retraining TextRunner@UW Key: Complementary to Shrinkage:
Harvest extra training data from broader Web Key: Identify relevant sentences given the sea of Web data? Andrew Murray was born in Scotland in 1828 …… Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. <Andrew Murray, was born in, Scotland> <Andrew Murray, was born in, 1828> 18

Retraining Kylin Extraction: TextRunner Extraction:
Query TextRunner for relevant sentences: t=< Ada Cambridge, location, “St Germans , Norfolk , England”> r1=<Ada Cambridge, was born in, England> Ada Cambridge was born in England in 1844 and moved to Australia with her curate husband in 1870. r2=<Ada Cambridge, was born in, “Norfolk , England”> Ada Cambridge was born in Norfolk , England , in Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 19

Retraining Experiment

Extraction from the Web
Idea: apply Kylin extractors trained on Wikipedia to general Web pages Challenge: maintain high precision General Web pages are noisy Many Web pages describe multiple objects Key: retrieve relevant sentences Procedure Generate a set of search engine queries Retrieve top-k pages from Google Weight extractions from these pages

Choosing Queries Example: get birth date attribute for article titled “Andrew Murray (minister)” “andrew murray” “andrew murray” birth date “andrew murray” was born in “andrew murray” … attribute name predicates from TextRunner

Weighting Extractions
Which extractions are more relevant? Features : # sentences between sentence and closest occurrence of title (‘andrew murray’) : rank of page on Google’s result lists : Kylin’s extractor confidence

Web Extraction Experiment
Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. Extractor confidence alone performs poor Weighted combination is the best 25

Combining Wikipedia & Web
Compare: and Recall AUC Benefit from Shrinkage / Retraining…

Combining Wikipedia & Web
Compare: and AUC Benefit from Shrinkage + Retraining + Web 27

Summary Shrinkage Retraining based on TextRunner
IE from Wikipedia: Moving Down the Long Tail Shrinkage Sparse infobox classes Retraining based on TextRunner Extracting from the Web Incomplete articles

Next-Generation Search = Information Extraction + Ontology + Inference

Next-Generation Search = Information Extraction
Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. [Wu & Weld CIKM07] 30

Next-Generation Search = Information Extraction + Inference
+ Ontology KOG [Wu & Weld WWW08] Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 31

Next-Generation Search = Information Extraction
+ Ontology + Inference Semantic web is a great idea. It makes web content also machine readable enables software agents to find, share and integrate information more easily. 32

Related Work Unsupervised Information Extraction
SNOWBALL [Agichtein & Gravano ICDL00] MULDER [Kwok et al. TOIS01] AskMSR [Brill et al. EMNLP02] Ontology Driven Information Extraction SemTag and Seeker [Dill WWW03] PANKOW [Cimiano WWW05] OntoSyphon [McDowell & Cafarella ISWC06] Other Wikipedia Systems Yago [Suchanek et al. WWW07] DBpedia [Auer & Lehmann ESWC07] Wikipedia Reputation System [Adler & Alfaro WWW07] 33

Information Extraction from Wikipedia: Moving Down the Long Tail

Similar presentations

Presentation on theme: "Information Extraction from Wikipedia: Moving Down the Long Tail"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction from Wikipedia: Moving Down the Long Tail

Similar presentations

Presentation on theme: "Information Extraction from Wikipedia: Moving Down the Long Tail"— Presentation transcript:

Similar presentations

About project

Feedback