Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open Information Extraction from the Web Oren Etzioni

Similar presentations


Presentation on theme: "Open Information Extraction from the Web Oren Etzioni"— Presentation transcript:

1 Open Information Extraction from the Web Oren Etzioni

2 Etzioni, University of Washington
KnowItAll Project (2003…) Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates Funding: DARPA, IARPA, NSF, ONR, Google. Etzioni, University of Washington

3 Etzioni, University of Washington
Outline A “scruffy” view of Machine Reading Open IE (overview, progress, new demo) Critique of Open IE Future work: Open, Open IE Etzioni, University of Washington

4 I. Machine Reading (Etzioni, AAAI ‘06)
“MR is an exploratory, open-ended, serendipitous process” “In contrast with many NLP tasks, MR is inherently unsupervised” “Very large scale” “Forming Generalizations based on extracted assertions” No Ontology… Ontology Free! Etzioni, University of Washington

5 Lessons from DB/KR Research
Declarative KR is expensive & difficult Formal semantics is at odds with Broad scope Distributed authorship KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03) A fortiori, for KBs extracted from text! Etzioni, University of Washington

6 Machine Reading at Web Scale
A “universal ontology” is impossible Global consistency is like world peace Micro ontologies--scale? Interconnections? Ontological “glass ceiling” Limited vocabulary Pre-determined predicates Swamped by reading at scale! Etzioni, University of Washington

7 II. Open vs. Traditional IE
OPEN VERSUS TRADITIONAL IE II. Open vs. Traditional IE Traditional IE Open IE Input: Corpus + O(R) hand-labeled data Corpus Relations: Specified in advance Discovered automatically Extractor: Relation-specific Relation-independent How is Open IE Possible? Etzioni, University of Washington

8 Semantic Tractability Hypothesis
∃ easy-to-understand subset of English Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11) Characterization is compact, domain independent Covers 85% of binary, verb-based relations How is relation indep extraction possible? Etzioni, University of Washington

9 SAMPLE RELATION PHRASES
SAMPLE OrrF EXTRACTED RELATIONS SAMPLE RELATION PHRASES invented acquired by has a PhD in denied voted for inhibits tumor growth in inherited born in mastered the art of downloaded aspired to is the patron saint of expelled Arrived from wrote the book on Etzioni, University of Washington

10 Etzioni, University of Washington
NUMBER OF RELATIONS Number of Relations DARPA MR Domains <50 NYU, Yago <100 NELL ~500 DBpedia 3.2 940 PropBank 3,600 VerbNet 5,000 WikiPedia InfoBoxes, f > 10 ~5,000 TextRunner (phrases) 100,000+ ReVerb (phrases) 1,000,000+ Etzioni, University of Washington

11 (Arg1, Relation phrase, Arg2)
TEXTRUNNER TextRunner (2007) First Web-scale Open IE system Distant supervision + CRF models of relations (Arg1, Relation phrase, Arg2) 1,000,000,000 distinct extractions Etzioni, University of Washington

12 Relation Extraction from Web
Etzioni, University of Washington

13 Open IE (2012) (the Celtics, beat, the Heat)
After beating the Heat, the Celtics are now the “top dog” in the NBA. (the Celtics, beat, the Heat) Open IE (2012) If he wins 5 key states, Romney will be president (counterfactual: “if he wins 5 key states”) Open source ReVerb extractor Synonym detection Parser-based Ollie extractor (Mausam EMNLP ‘12) Verbs  Nouns and more Analyze context (beliefs, counterfactuals) Sophistication of IE is a major focus But what about entities, types, ontologies? Etzioni, University of Washington

14 Towards “Ontologized” Open IE
Link arguments to Freebase (Lin, AKBC ‘12) When possible! Associate types with Args No Noun Phrase Left Behind (Lin, EMNLP ‘12) Etzioni, University of Washington

15 System Architecture Input Processing Output Web corpus Extractor
Relation-independent extraction Web corpus Extractor (XYZ Corp.; acquired; Go Inc.) (oranges; contain; Vitamin C) (Einstein; was born in; Ulm) (XYZ; buyout of; Go Inc.) (Albert Einstein; born in; Ulm) (Einstein Bros.; sell; bagels) Raw tuples Synonyms, Confidence Assessor XYZ Corp. = XYZ Albert Einstein = Einstein != Einstein Bros. SHOW WHAT bacteria query on google. INCREASE FONT Demo script: Note *answers* are typed (via fb) Show sentences/linkes Extractions Acquire(XYZ Corp., Go Inc.) [7] BornIn(Albert Einstein, Ulm) [5] Sell(Einstein Bros., bagels) [1] Contain(oranges, Vitamin C) [1] Index in Lucene; Link entities Query processor DEMO Etzioni, University of Washington

16 Etzioni, University of Washington
III. Critique of Open IE Lack of formal ontology/vocabulary Inconsistent extractions Can it support reasoning? What’s the point of Open IE? Etzioni, University of Washington

17 Perspectives on Open IE
“Search Needs a Shakeup” (Etzioni, Nature ’11) Textual Resources Reasoning over Extractions Etzioni, University of Washington

18 A. New Paradigm for Search
“Moving Up the Information Food Chain” (Etzioni, AAAI ‘96) Retrieval  Extraction Snippets, docs  Entities, Relations Keyword queries  Questions List of docs  Answers Essential for smartphones! (Siri meets Watson) Etzioni, University of Washington

19 Case Study over Yelp Reviews
Map review corpus to (attribute, value) (sushi = fresh) (parking = free) Natural-language queries “Where’s the best sushi in Seattle?” Sort results via sentiment analysis exquisite > very good > so, so Etzioni, University of Washington

20 RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12)
Revminer.com Etzioni, University of Washington

21 B. Public Textual Resources (Leveraging Open IE)
(police investigate X)  (police charge Y) B. Public Textual Resources (Leveraging Open IE) 94M Rel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12) 600K Relation phrases (Fader, EMNLP ‘11) Relation Meta-data: 50K Domain/range for relations (Ritter, ACL ‘10) 10K Functional relations (Lin, EMNLP ‘10) 30K learned Horn clauses (Schoenmackers, EMNLP ‘10) CLEAN (Berant, ACL ‘12) 10M entailment rules (coming soon) Precision double that of DIRT See openie.cs.washington.edu Etzioni, University of Washington

22 C. Reasoning over Extractions
Identify synonyms (Yates & Etzioni JAIR ‘09) Linear-time 1st order Horn-clause inference (Schoenmackers EMNLP ’08) 1,000,000,000 Extractions Transitive Inference (Berant ACL ’11) Learn argument types Via generative model (Ritter ACL ‘10) Etzioni, University of Washington

23 Unsupervised, probabilistic model for identifying synonyms
P(Bill Clinton = President Clinton) Count shared (relation, arg2) P(acquired = bought) Relations: count shared (arg1, arg2) Functions, mutual recursion Next step: unify with Etzioni, University of Washington

24 Scalable Textual Inference
Desiderata for inference: In text  probabilistic inference On the Web  linear in |Corpus| Argument distributions of textual relations: Inference provably linear Empirically linear!

25 Inference Scalability for Holmes

26 Extractions  Domain/range
Much previous work (Resnick, Pantel, etc.) Utilize generative topic models Extractions of R  Document Domain/range of R  topics

27 TextRunner Extractions
Relations as Documents TextRunner Extractions born_in(Sergey Brin, Moscow) headquartered_in(Microsoft, Redmond) born_in(Bill Gates, Seattle) born_in(Einstein, March) founded_in(Google, 1998) headquartered_in(Google, Mountain View) born_in(Sergey Brin, 1973) founded_in(Microsoft, Albuquerque) born_in(Einstein, Ulm) founded_in(Microsoft, 1973)

28 Argument “docs”  Type Models

29 Generative Story [LinkLDA, Erosheva et. al. 2004]
z1 a1 R N a z2 a2 g T h1 h2 For each relation, randomly pick a distribution over types X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 Pick a topic for arg2 For each extraction, pick type for a1, a2 Person born_in Location Pick a topic for arg2 Then pick arguments based on types Sergey Brin born_in Moscow Bayesian inference (collapsed gibss sampling) to recover hidden variables, which are the types of the relational args. Pick a topic for arg2 Two separate sets of type distributions

30 Examples of Learned Domain/range
elect(Country, Person) predict(Expert, Event) download(People, Software) invest(People, Assets) Was-born-in(Person, Location OR Date) Etzioni, University of Washington

31 Summary: Trajectory of Open IE
2003 KnowItAll project 2007 TextRunner: 1,000,000,000 “Ontology free” extractions 2008-9 Inference over extractions Open source extractor Public textual Resources 2012 Freebase types IE-based search Deeper analysis of sentences Openie.cs.washington.edu Etzioni, University of Washington

32 IV. Future: Open Open IE Open input: ingest tuples from any source
(Tuple, Source, Confidence) Linked Open Output: Extractions  Linked-open Data (LOD) cloud Relation normalization Use LOD best practices Specialized reasoners Motivate lod: compitable phislospohically with us Missing probabilities Linked Data follows an open-world assumption: Specialized reasoners rather than overall mech? Etzioni, University of Washington

33 Etzioni, University of Washington
Conclusions Ontology is not necessary for reasoning Open IE is “gracefully” ontologized Open IE is boosting text analysis LOD has distribution & scale (but not text) = opportunity Thank you Etzioni, University of Washington

34 Etzioni, University of Washington
qs Why Open? What’s next? Dimensions for analyzing systems What’s worked, what’s failed? (lessons) What can we learn from watson? What can we learn from db/kr? (alon) Etzioni, University of Washington

35 Etzioni, University of Washington
Questions What extraction mechanism is used? What corpus? What input knowledge? Role for people/manual labling Form of the extracted knowledge? Size/scope of extracted knowledge? What reasoning is done? Most unique aspect? Biggest challenge? Etzioni, University of Washington

36 Etzioni, University of Washington
Scalability notes Interoperability, distributed authorship, vs. a monolithic system Open IE meets RDF: Need URI’s for predicates. How to obtain? What about errors in mapping to URI? Ambiguity? Uncertainty? Etzioni, University of Washington

37 Etzioni, University of Washington
reasoning Nell: inter-class constraints to gen negative egs Etzioni, University of Washington

38 Etzioni, University of Washington
Dims of scalability Corpus size Syn coverage over text Sem coverage over text Time, belief, n-ary relations, etc. Number of entities, relations Ability to reason How much cpu? How much manual effort? Bounding, cielign effect, ontological glass ceiling Etzioni, University of Washington

39 Example of limiting assumptions
Nell: apple has single meaning Single atom per entity Global computation to add entity Can’t be sure LOD: Best practice Same-as links Etzioni, University of Washington

40 Risk for scalable system
Limited semantics, reasoning No reasoning… Etzioni, University of Washington

41 Etzioni, University of Washington
LOD triple in aug 2011: 31,634,213,770 Etzioni, University of Washington

42 Etzioni, University of Washington
. The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report: . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity. Etzioni, University of Washington

43 Etzioni, University of Washington

44 Entity Linking an Extraction Corpus
Einstein quit his job at the patent office (8) 1. String Match 2. Prominence Priors 3. Context Match Link Score US Patent Office EU Patent Office Japan Patent Office Swiss Patent Office Patent (med) (low) 1,281 inlinks 168 inlinks 56 inlinks 101 inlinks 4,620 inlinks (low) (very high) (med) (low) (high) Prominence # of links in Wikipedia to that Entity’s article Obtain candidates, and measure string similarity. Exact String Match = best match also consider: Patent Swiss Office US EU Japan Wikipedia Article Texts “Einstein quit his job at the patent office.” “Einstein quit his job at the patent office to become a professor.” “In 1909, Einstein quit his job at the patent office.” “Einstein quit his job at the patent office where he worked.” cosine similarity “Document” of the extraction’s source sentences 2.53GHz computer links 15 million text arguments in ~3 days (60+ per second) Faster Higher Precision Collective Linking vs One Extraction at a time Link Score is a function of (String Match Score, Prominence Prior Score, Context Match Score) e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score Link Ambiguity = 2nd Top Link Score Top Link Score US Patent Office EU Japan Swiss Alternate capitalization Edit distance Word overlap Substring/ Superstring Known Aliases Potential Abbreviations Etzioni, University of Washington

45 Q/A with Linked Extractions
Ambiguous Entities Typed Search Linked Resources Golf Sports that originated in China Ping Pong Dragon Boating Wushu Karate Soccer “Titanic earned more than $1 billion worldwide” “The Titanic sank in 1912” “The Titanic was released in 1998” “Titanic represents the state-of-the-art in special effects” “Titanic was built in Belfast” (3,761 more …) “The Titanic set sail from Southampton”” “RMS Titanic weighed about 26 kt” “The Titanic was built for safety and comfort” “The Titanic sank in 12,460 feet of water” (1,902 more …) “Golf originated in China” “Soccer originated in China” “Karate originated in China” “Dragon Boating originated in China” (14 more …) “Noodles originated in China” “Printmaking originated in China” “Soy Beans originated in China” “Wushu originated in China” “Taoism originated in China” “Ping Pong originated in China” (534 more …) “Which sports originated in China?” “I need to learn about Titanic the ship for my homework.” Freebase Sports “Dragon Boat Racing” “Table Tennis” Leverages KBs by linking textual arguments to entities found in the knowledge base. Etzioni, University of Washington

46 Linked Extractions support Reasoning
In addition to Question Answering, Linking can also benefit: Functions [Ritter et al., 2008; Lin et al., 2010] Other Relation Properties [Popescu 2007; Lin et al., CSK 2010] Inference [Schoenmackers et al., 2008; Berant et al., 2011] Knowledge-Base Population [Dredze et al., 2010] Concept-Level Annotations [Christensen and Pasca, 2012] … basically anything using the output of extraction Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences… Etzioni, University of Washington

47 Desiderata for a Machine Reader
Reads Everything… Linear in |corpus| Syntactic breadth (Mausam, EMNLP ‘12) Seattle Mayor Mike McGinn Semantic depth (Schubert ‘02) Eg? Powerful reasoning Autonomous, open-ended discovery Note: language, genre independent, even twitter Etzioni, University of Washington

48 Etzioni, University of Washington
Challenges Single-sentence extraction He believed the plan will work John Glenn was the first American in space Obama was elected President in 2008. American president Barack Obama asserted… ?? The history of ai is littered with the corpses of sophisticated systems that failed to scale Notable examples: shrdlu, cyc, blockworld planners, ? Fundamental assumptions that break..consistency The history of success is full of scalable system with weak semantics (initially): WWW (vs hypertext) Google Etzioni, University of Washington


Download ppt "Open Information Extraction from the Web Oren Etzioni"

Similar presentations


Ads by Google