Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long Tail Presenter : Cheng-Feng Weng Authors : Fei Wu, Raphael Hoffmann, Daniel S. Weld 2008/11/18 KDD.9 (2008)

2 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 2 Outline Motivation Objective Methods and Experiments Conclusion Comments

3 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 3 Introduction KYLIN automatically constructs and completes infoboxes for the articles of Wikipedia.

4 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 4 Motivation The number of article instances per infobox class has a long- tailed distribution. Many articles simply does not have much information to extracted.

5 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 5 Objective This paper presents three novel techniques for increasing recall from Wikipedia’s long tail of sparse classes:  Shrinkage over an automatically-learned subsumption taxonomy  A retraining technique for improving the training data  Supplementing results by extracting from the broader Web

6 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 6 Shrinkage This paper use shrinkage when training an extractor of an instance-sparse infobox class by aggregating data from its parent and children classes. PersonScientist Chung- Chian Hsu PerformerActorComedian Performer.location=? Person.birth_plc=taiwan

7 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 7 Shrinkage using the KOG Ontology The Kylin Ontology Generator (KOG) is an autonomous system that builds a rich ontology by combining Wikipedia infoboxes with WordNet using statistical- relational machine learning [27]. The overall shrinkage procedure is as follows:  To collect the related class set  Query KOG for the mapped attribue  Assign weight to the training examples PersonScientist Chung- Chian Hsu PerformerActorComedian

8 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 8 Shrinkage Experiments Considering three strategies to determine the weights:  Uniform: W = 1  Size adjusted: W = min{1, k/(|C|+1) }  Precision Directed: W = p(extraction precision)

9 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 9 Shrinkage Experiments (con.)

10 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 10 Retraining A complementary idea is the notion of harvesting additional training data even from the outside Web. It utilizes TextRunner which extracts relations from a crawl of about 100 million Web pages.  TextRunner’ crawl includes the top ten pages returned by Google.

11 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 11 Using TextRunner for Retraining The retrainer uses this mapped set(C.a) from TextRunner to augment and clean the training data for C’s extractors in two ways:  Adding positive examples  Filtering negative examples Most common Position example

12 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 12 Retraining Experiments

13 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 13 Extracting From the Web It trained extractors on Wikipedia articles and apply them to relevant Web pages. Choosing search engine queries Weighting extractions Combining Wikipedia and Web extractions

14 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 14 Extracting From the Web (con.) Choosing search engine queries Weighting extractions Combining Wikipedia and Web extractions “Andrew Murray” Birthday of Andrew Murray “Andrew murray” birth date A set of query

15 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 15 Web Experiments

16 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 16 Combining Experiments

17 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 17 Conclusions This paper describes three powerful methods for increasing recall w.r.t. the above to long-tailed challenges: shrinkage, retraining, and supplementing Wikipedia extractions with those from the Web.

18 N.Y.U.S.T. I. M. Intelligent Database Systems Lab 18 Comments Advantage  It use a good idea to overcome long-tail problem. Drawback  Just about improving the performance of Kylin they developed Application  To construct the knowledge network


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long."

Similar presentations


Ads by Google