Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Similar presentations


Presentation on theme: "Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ."— Presentation transcript:

1 Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ

2 Outline  Motivation  System Architecture  Detail Techniques  Search Engine Interface  Extractor  Probabilistic Assessment  Experimental Result  Future Work  Conclusion

3 Motivation  Why Web-scale Information Extraction?  Web is the largest knowledge base.  Extracting information by searching the web is not easy: list the cities in the world whose population is above 400,000; humans who has visited space.  Unless we find the “right” document, this work could be tedious, error-prone process of piecemeal search.

4 Motivation (2)  Previous Information Extraction Works  Supervised Learning  Difficult to scale to the web  the diversity of the web  the prohibitive cost of creating an equally diverse set of hand- tagged documents  Weakly Supervised and Bootstrap  Need domain-specific seeds  Learn rule from seeds, and then vice versa  KnowItAll  Domain-Independent  Use Bootstrap technique

5 System Architecture  4 Components  Data Flow Extractor Search Engine Interface Assessor Database

6 System Architecture  System Work Flow Extractor Search Engine Interface Assessor Database Web Pages RuleRule templatekeywords NP1 “such as” NPList2 & head(NP1) = plural(name(Class1)) & properNoun(head(each(NPList2))) => instanceOf(Class1,head(each(NPList2))) Noun PhraseNoun Phrase List NP1 “such as” NPList2 & head(NP1) = “countries” & properNoun(head(each(NPList2))) => instanceOf(Country,head(each(NPList2))) Keywords: “countries such as”

7 System Architecture  System Work Flow Extractor Search Engine Interface Assessor Database Web PagesRule Extracted Information Knowledge the United Kingdom and Canada India North Korea, Iran, India and Pakistan Japan Iraq, Italy and Spain … the United Kingdom Canada India North Korea Iran … Discriminator Phrase Country AND X “Countries such as X” Country AND the United Kingdom Countries such as the United Kingdom Frequency

8 System Architecture Extractor Search Engine Interface Assessor Database  Search Engine Interface  Distribute jobs to different Search Engines  Extractor  Rule Instantiation  Information Extraction  Accessor  Discriminator Phrases Construction  Access of Information

9 Search Engine Interface  Metaphor: Information Food Chain  Search Engine  Herbivore  KnowItAll  Carnivore  Why build on top of search engine?  No need to duplicate existing work  Low cost/time/effort  Query Distribution  Make sure not to overload search engines

10 Extractor  Extraction Template Examples  NP1 {“,”} “such as” NPList2  NP2 {“,”} “and other” NP2  NP1 {“,”} “is a” NP2  All are domain-independent!

11 Extractor (2)  Noun phrase analysis  A. “China is a country in Asia”  B. “Garth Brooks is a country singer”  In A, the word “country” is the head of a simple noun phrase.  In B, the word “country” is not the head of a simple noun phrase.  So, China is indeed a country while Garth Brooks is not a country.

12 Extractor (3)  Rule Template:  NP1 “such as” NPList2 & head(NP1) = plural( name( Class1 )) & properNoun( head( each( NPList2 ))) => instanceOf( Class1, head( each( NPList2)))  The Extractor generates a rule for “Country” from this template by substituting “Country” for “Class 1”.

13 Assessor  Naïve Bayesian Model  Features: hits returned by search engine  Incident: whether the extracted inf. is a fact  Adjusting the threshold  Trade between precision and recall

14 Assessor (2)  Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф)  Define PMI (I,D) = |Hits(D+I)| / |Hits(I)|  I: the extracted NP  D: discriminator phrase  4 P(fi|Ф) and P(fi|¬Ф) Functions  Hits-Thresh:P(hits>Hits(D+I)|Ф)  Hits-Density:p(hits=Hits(D+I)|Ф)  PMI-Thresh:P(pmi>PMI(I,D)|Ф)  PMI-Density:p(pmi=PMI(I,D)|Ф)

15 Experimental Results  Precision vs. Recall  Thresh  better than Density  PMI  better than Hits

16 Experimental Results (2)  Time Len: 4 day  Web page retrieved vs. time  3000 pages/hour  New facts vs. Web page retrieved  1 new fact / 3 pages to 1 new fact / 7 pages

17 Conclusion & Future Works  Conclusion:  Domain-independent rule templates  Rule generated by rule templates  Built on top of search engine  Assessor Model: More data, more accurate  Future works:  Learn domain-specific rules to improve recall  Automatically extend the ontology

18 Q & A  Thanks!


Download ppt "Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ."

Similar presentations


Ads by Google