Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ

Outline  Motivation  System Architecture  Detail Techniques  Search Engine Interface  Extractor  Probabilistic Assessment  Experimental Result  Future Work  Conclusion

Motivation  Why Web-scale Information Extraction?  Web is the largest knowledge base.  Extracting information by searching the web is not easy: list the cities in the world whose population is above 400,000; humans who has visited space.  Unless we find the “right” document, this work could be tedious, error-prone process of piecemeal search.

Motivation (2)  Previous Information Extraction Works  Supervised Learning  Difficult to scale to the web  the diversity of the web  the prohibitive cost of creating an equally diverse set of hand- tagged documents  Weakly Supervised and Bootstrap  Need domain-specific seeds  Learn rule from seeds, and then vice versa  KnowItAll  Domain-Independent  Use Bootstrap technique

System Architecture  4 Components  Data Flow Extractor Search Engine Interface Assessor Database

System Architecture  System Work Flow Extractor Search Engine Interface Assessor Database Web Pages RuleRule templatekeywords NP1 “such as” NPList2 & head(NP1) = plural(name(Class1)) & properNoun(head(each(NPList2))) => instanceOf(Class1,head(each(NPList2))) Noun PhraseNoun Phrase List NP1 “such as” NPList2 & head(NP1) = “countries” & properNoun(head(each(NPList2))) => instanceOf(Country,head(each(NPList2))) Keywords: “countries such as”

System Architecture  System Work Flow Extractor Search Engine Interface Assessor Database Web PagesRule Extracted Information Knowledge the United Kingdom and Canada India North Korea, Iran, India and Pakistan Japan Iraq, Italy and Spain … the United Kingdom Canada India North Korea Iran … Discriminator Phrase Country AND X “Countries such as X” Country AND the United Kingdom Countries such as the United Kingdom Frequency

System Architecture Extractor Search Engine Interface Assessor Database  Search Engine Interface  Distribute jobs to different Search Engines  Extractor  Rule Instantiation  Information Extraction  Accessor  Discriminator Phrases Construction  Access of Information

Search Engine Interface  Metaphor: Information Food Chain  Search Engine  Herbivore  KnowItAll  Carnivore  Why build on top of search engine?  No need to duplicate existing work  Low cost/time/effort  Query Distribution  Make sure not to overload search engines

Extractor  Extraction Template Examples  NP1 {“,”} “such as” NPList2  NP2 {“,”} “and other” NP2  NP1 {“,”} “is a” NP2  All are domain-independent!

Extractor (2)  Noun phrase analysis  A. “China is a country in Asia”  B. “Garth Brooks is a country singer”  In A, the word “country” is the head of a simple noun phrase.  In B, the word “country” is not the head of a simple noun phrase.  So, China is indeed a country while Garth Brooks is not a country.

Extractor (3)  Rule Template:  NP1 “such as” NPList2 & head(NP1) = plural( name( Class1 )) & properNoun( head( each( NPList2 ))) => instanceOf( Class1, head( each( NPList2)))  The Extractor generates a rule for “Country” from this template by substituting “Country” for “Class 1”.

Assessor  Naïve Bayesian Model  Features: hits returned by search engine  Incident: whether the extracted inf. is a fact  Adjusting the threshold  Trade between precision and recall

Experimental Results  Precision vs. Recall  Thresh  better than Density  PMI  better than Hits

Experimental Results (2)  Time Len: 4 day  Web page retrieved vs. time  3000 pages/hour  New facts vs. Web page retrieved  1 new fact / 3 pages to 1 new fact / 7 pages

Conclusion & Future Works  Conclusion:  Domain-independent rule templates  Rule generated by rule templates  Built on top of search engine  Assessor Model: More data, more accurate  Future works:  Learn domain-specific rules to improve recall  Automatically extend the ontology

Q & A  Thanks!

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Similar presentations

Presentation on theme: "Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Similar presentations

Presentation on theme: "Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ."— Presentation transcript:

Similar presentations

About project

Feedback