Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge: Harvesting Entities, Relationships,

Similar presentations


Presentation on theme: "Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge: Harvesting Entities, Relationships,"— Presentation transcript:

1 Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge: Harvesting Entities, Relationships, and Temporal Facts from Web Sources

2 Acknowledgements

3 Goal: Turn Web into Knowledge Base comprehensive DB of human knowledge everything that Wikipedia knows everything machine-readable capturing entities, classes, relationships Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009

4 Approach: Harvesting Facts from Web PoliticianPolitical Party Angela MerkelCDU Karl-Theodor zu GuttenbergCDU Christoph HartmannFDP … CompanyCEO GoogleEric Schmidt YahooOverture FacebookFriendFeed Software AGIDS Scheer … MovieReportedRevenue Avatar$ 2,718,444,933 The Reader$ 108,709,522 FacebookFriendFeed Software AGIDS Scheer … PoliticalPartySpokesperson CDU Philipp Wachholz Die GrünenClaudia Roth FacebookFriendFeed Software AGIDS Scheer … ActorAward Christoph WaltzOscar Sandra BullockOscar Sandra BullockGolden Raspberry … PoliticianPosition Angela MerkelChancellor Germany Karl-Theodor zu GuttenbergMinister of Defense Germany Christoph HartmannMinister of Economy Saarland … CompanyAcquiredCompany GoogleYouTube YahooOverture FacebookFriendFeed Software AGIDS Scheer … YAGO-NAGA IWP Cyc TextRunner ReadTheWeb WikiTax2WordNet SUMO

5 Knowledge for Intelligence entity recognition & disambiguation understanding natural language & speech knowledge services & reasoning for semantic apps (e.g. deep QA) semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.) FIFA 2010 finalists who played in a Champions League final? Politicians who are also scientists? Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?... German football coach when Bastian Schweinsteiger was born? Relationships between Manfred Pinkal, Edsger Dijkstra, Michael Dell, and Renee Zellweger?

6 Outline... Automatic KB Construction Growing & Maintaining the KB Temporal Knowledge What and Why Wrap-up

7 What is Knowledge (in a KB)?... facts / assertions: bornIn (BastianSchweinsteiger, Kolbermoor), hasWon (BastianSchweinsteiger, BronzeFIFAWorldCup2010), playedInFinal (BastianSchweinsteiger, ChampionsLeague2010), … taxonomic : instanceOf (BastianSchweinsteiger, footballPlayer), subclassOf (footballPlayer, athlete), … lexical / terminology: means (Big Apple, NewYorkCity), means (Apple, AppleComputerCorporation) means (MS, Microsoft), means (MS, MultipleSclerosis) … common-sense properties: apples are green, red, juicy, sweet, sour … - but not fast, smart … balls are round, smooth, slippery … - but not square, funny … common-sense axioms: x: human(x) male(x) female(x) x: (male(x) female(x)) (female(x) ) male(x)) x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) … procedural: how to fix/install/prepare/remove … epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)), believes (Copernicus, shape(Earth, sphere)) …

8 Tapping on Wikipedia Categories

9 KBs: Example YAGO (Suchanek et al.: WWW07) Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn Max Planck means (0.9) subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means Max Karl Ernst Ludwig Planck Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State Angela Dorothea Merkel Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means(0.1) instanceOf subclass means Angela Merkel means citizenOf instanceOf locatedIn subclass Accuracy 95% 2 Mio. entities, classes 40 Mio. RDF triples (facts) ( entity1-relation-entity2, subject-predicate-object )

10 KBs: Example YAGO (F. Suchanek et al.: WWW07)

11 KBs: Example DBpedia (Auer, Bizer, et al.: ISWC07) 3 Mio. entities, 1 Bio. facts (RDF triples) 1.5 Mio. entities mapped to hand-crafted taxonomy of 259 classes with 1200 properties

12 Outline... Automatic KB Construction Growing & Maintaining the KB Temporal Knowledge What and Why Wrap-up

13 French Marriage Problem facts in KB: new facts or fact candidates: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) 1)for recall: pattern-based harvesting 2)for precision: consistency reasoning

14 Pattern-Based Harvesting FactsPatterns (Hillary, Bill) (Carla, Nicolas) & Fact Candidates X and her husband Y X and Y on their honeymoon X and Y and their children X has been dating with Y X loves Y … good for recall noisy, drifting not robust enough for high precision (Angelina, Brad) (Hillary, Bill) (Victoria, David) (Carla, Nicolas) (Angelina, Brad) (Yoko, John) (Carla, Benjamin) (Larry, Google) (Kate, Pete) (Victoria, David) (Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)

15 Reasoning about Fact Candidates Use consistency constraints to prune false candidates spouse(Hillary,Bill) spouse(Carla,Nicolas) spouse(Cecilia,Nicolas) spouse(Carla,Ben) spouse(Carla,Mick) spouse(Carla, Sofie) spouse(x,y) diff(y,z) spouse(x,z) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) spouse(x,y) f(x)spouse(x,y) m(y) spouse(x,y) (f(x) m(y)) (m(x) f(y)) FOL rules (restricted): ground atoms: Rules can be weighted (e.g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. of subset of atoms being the truth Rules reveal inconsistencies Find consistent subset(s) of atoms (possible world(s), the truth) spouse(x,y) diff(w,x) spouse(w,y)

16 Markov Logic Networks (MLNs) (M. Richardson / P. Domingos 2006) Map logical constraints & fact candidates into probabilistic graph model: Markov Random Field (MRF) s(x,y) m(y) s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) f(x) s(Ca,Nic) s(Ce,Nic) s(Ca,Nic) s(Ca,Ben) s(Ca,Nic) s(Ca,So) s(Ca,Ben) s(Ca,So) s(Ca,Nic) m(Nic) Grounding: s(Ce,Nic) m(Nic) s(Ca,Ben) m(Ben) s(Ca,So) m(So) f(x) m(x) m(x) f(x) Literal Boolean Var Literal binary RV

17 Markov Logic Networks (MLNs) (M. Richardson / P. Domingos 2006) Map logical constraints & fact candidates into probabilistic graph model: Markov Random Field (MRF) s(x,y) m(y) s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) f(x)f(x) m(x) m(x) f(x) m(Ben) m(Nic) s(Ca,Nic) s(Ce,Nic) s(Ca,Ben) s(Ca,So) m(So) RVs coupled by MRF edge if they appear in same clause MRF assumption: P[X i |X 1..X n ]=P[X i |N(X i )] Variety of algorithms for joint inference: Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, … joint distribution has product form over all cliques

18 Related Alternative Probabilistic Models software tools: alchemy.cs.washington.edu alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/ Constrained Conditional Models [D. Roth et al. 2007] Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008] log-linear classifiers with constraint-violation penalty mapped into Integer Linear Programs RVs share factors (joint feature functions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constraining of RVs m(Ben) m(Nic) s(Ca,Nic) s(Ce,Nic) s(Ca,Ben) s(Ca,So) m(So)

19 Reasoning for KB Growth: Direct Route facts in KB: new fact candidates: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) + patterns: X and her husband Y X and Y and their children X has been dating with Y X loves Y ? 1.facts are true; fact candidates & patterns hypotheses grounded constraints clauses with hypotheses as vars 2.type signatures of relations greatly reduce #clauses 3.cast into Weighted Max-Sat with weights from pattern stats customized approximation algorithm unifies: fact cand consistency, pattern goodness, entity disambig. (F. Suchanek et al.: WWW09) Direct approach:

20 Facts & Patterns Consistency with SOFIE constraints to connect facts, fact candidates, patterns (F. Suchanek et al.: WWW09, N. Nakashole et al.: WebDB10) functional dependencies: spouse(X,Y): X Y, Y X relation properties: asymmetry, transitivity, acyclicity, … type constraints, inclusion dependencies: spouse Person PersoncapitalOfCountry cityOfCountry domain-specific constraints: bornInYear(x) + 10years graduatedInYear(x) hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t pattern-fact duality: occurs(p,x,y) expresses(p,R) type(x)=dom(R) type(y)=rng(R) R(x,y) name(-in-context)-to-entity mapping: means(n,e1) means(n,e2) … occurs(p,x,y) R(x,y) type(x)=dom(R) type(y)=rng(R) expresses(p,R)

21 Entity Disambiguation Revisited occurs (divorced from, Madonna, Guy Ritchie) expresses (divorced from, wasMarriedTo) wasMarriedTo (Madonna, Guy Ritchie) actually is: occurs (divorced from, Madonna, Guy Ritchie) means (Madonna, Madonna Louise Ciccone ) expresses (divorced from, wasMarriedTo) wasMarriedTo (Madonna Louise Ciccone, Guy Ritchie)[0.7] occurs (divorced from, Madonna, Guy Ritchie) means (Madonna, Madonna (Edvard Munch)) expresses (divorced from, wasMarriedTo) wasMarriedTo (Madonna (Edvard Munch), Guy Ritchie) [0.3] use context-similarity as disambiguation prior set clause weights accordingly reduced to normal case entity level word/phrase level

22 Experimental Results SOFIE (F. Suchanek et al.: WWW09) input: biographies of 400 US senators, 3500 HTML files output: birth/death date&place, politicianOf (state) run-time: 7 h parsing, 6 h hypotheses, 2 h Max-Sat precision: % (except for death place) recall: ca. 750 extracted facts (300 politicianOf facts) PROSPERA (N. Nakashole et al.: WebDB10): input: Wikipedia articles and Web homepages of scientists output: hasAdvisor, graduatedAt, hasCollaborator, facultyAt, wonAward run-time: 1 h total (largely parallelized) precision: % recall: ca extracted facts (400 hasAdvisor facts) Now running experiments on ClueWeb09 corpus (500 Mio. English Web pages) with Hadoop cluster of 10x16 cores and 10x48 GB

23 Outline... Automatic KB Construction Growing & Maintaining the KB Temporal Knowledge What and Why Wrap-up

24 Temporal Knowledge Which facts for given relations hold at what time point or during which time intervals ? marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ] capitalOf (Berlin, Germany) [ 1990, now ] capitalOf (Bonn, Germany) [ 1949, 1989 ] hasWonPrize (JimGray, TuringAward) [ 1998 ] graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ] graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ] hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ] How can we query & reason on entity-relationship facts in a time-travel manner - with uncertain/incomplete KB ? US president when Barack Obama was born? students of Hector Garcia-Molina while he was at Princeton?

25 French Marriage Problem facts in KB new fact candidates: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) divorced (Madonna, Guy) domPartner (Angelina, Brad) 1: 2: 3: validFrom (2, 2008) validFrom (4, 1996) validUntil (4, 2007) validFrom (5, 2010) validFrom (6, 2006) validFrom (7, 2008) 4: 5: 6: 7: 8: JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

26 Challenge: Temporal Knowledge for all people in Wikipedia ( ) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night consistency constraints are potentially helpful: functional dependencies: husband, time wife inclusion dependencies: marriedPerson adultPerson age/time/gender restrictions: birthdate + < marriage < divorce 1)recall: gather temporal scopes for base facts 2)precision: reason on mutual consistency

27 Difficult Dating

28 (Even More Difficult) Implicit Dating explicit dates vs. implicit dates relative to other dates

29 (Even More Difficult) Relative Dating vague dates relative dates vague dates relative dates narrative text relative order narrative text relative order

30 Framework for T-Fact Extraction (Theobald et al.: MUD10, Wang et al.: EDBT10; Zhang et al.: WebDB08) 1)represent temporal scopes of facts in the presence of incompleteness and uncertainty 2) gather & filter candidates for t-facts: extract base facts R(e1, e2) first; then focus on sentences with e1, e2 and date or temporal phrase 3)aggregate & reconcile evidence from observations 4) reason on joint constraints about facts and time scopes

31 1) Representing T-Fact Evidence different resolutions, later refinement uncertain & inconsistent evidence confidence distribution After 4 years of happy marriage, Madonna and Sean got divorced in September : married(Madonna, Sean), earliestSince (1, 1-Jan-1985), latestSince (1, 31-Dec-1985), earliestUntil (1, 1-Sep-1989), latestUntil (1, 30-Sep-1989) event-style and state-style facts meta-facts to capture temporal scopes 1: married(Madonna, Sean), 2: married(Madonna, Guy), validSince (1, 16-Aug-1985), validUntil (1, 14-Sep-1989), validSince (2, 22-Dec-2000), validUntil (2, 15-Dec-2008) 3: wonAward(Sean, AcademyAwardForBestActor) validOn (3, 29-Feb-2004) µ=1987 σ 2 =

32 2) Gather & Filter T-Fact Candidates Choice of sources: news-stylebiography-style date in headermany dates in text relative temp exprsexplicit dates, narrative simple languageelaborated language many pronounspronouns for main entity Naive approach: use deep NLP (dependency parser) on every sentence then use classifier (or structured-output learner) to detect t-facts too expensive Bruni met recently divorced president Sarkozy in November 2007 at a dinner party. She has said she is easily "bored with monogamy … A romance is said to have started a few weeks ago between her and Biolay.

33 2) Gather & Filter: Multi-Stage Approach stage 1: sentences with e1 and e2 from R stage 2: sentences that contain a temporal expression stage 3: sentences where the t-expression refers to R(e1,e2) match noun phrases against YAGO means relation use disambiguation prior for entity mentions use TARSQI tool to extract relative t-expressions and map them to absolute dates or durations run dependency parser: check shortest path connecting e1, e2, verb, t-expr alternatively, consider only sentences with two noun groups & short surface distances of e1, e2, t-expr Jim married Sue, but later left her and began an affair with Jane in 2005.

34 3) Aggregate & Reconcile T-Fact Evidence Ideal input: Madonna and Sean were married from 16-Aug-85 until 12-Sep-89. Madonna and Sean married on August 16, Madonna and Sean got divorced in September time evidence Imprecise input: Madonna and Sean were married from 1985 through Madonna and Sean were married four years in the late nineties. Madonna and Sean got divorced in fall Noisy input: Madonna and Sean plan their wedding in summer Madonna and Sean just returned from their honeymoon (in Jan 1986). Madonna and Sean will be divorced by the the end of the year (1989). The marriage of Madonna and Sean will not survive this year (1987).

35 3) Aggregate & Reconcile T-Fact Evidence Real input: … Madonna and Sean were chased during their honeymoon … (Jan 19, 1986) Madonna and her husband Sean opened the exhibition … (March 7, 1986) Madonna and her husband Sean were seen at … (April 1, 1986) Madonna and Sean met other couples at … (June 22, 1986) Madonna and Sean plan to have children … (July 4, 1986) Madonna and Sean would consider adopting a child … (July 14, 1986) Sean and his wife Madonna purchase another castle in … (November 5, 1986)... Madonna and Sean think about getting divorced … (April 21, 1989) The marriage of Madonna and Sean is in deep crisis … (May 11, 1989) … time evidence

36 3) Aggregate & Reconcile T-Fact Evidence Real input: … Madonna and Sean were chased during their honeymoon … (Jan 19, 1986) Madonna and her husband Sean opened the exhibition … (March 7, 1986) Madonna and her husband Sean were seen at … (April 1, 1986) Madonna and Sean met other couples at … (June 22, 1986) Madonna and Sean plan to have children … (July 4, 1986) Madonna and Sean would consider adopting a child … (July 14, 1986) Sean and his wife Madonna purchase another castle in … (November 5, 1986)... Madonna and Sean think about getting divorced … (April 21, 1989) The marriage of Madonna and Sean is in deep crisis … (May 11, 1989) … time evidence …..……..…

37 3) Aggregate & Reconcile: Solution time evidence event histogram (begin) event histogram (end) state histogram (during) Classifer for t-fact observations: begin vs. during vs. end Build separate histogram for each class (and each t-fact) Combine histograms & derive high-confidence time scope

38 4) Joint Reasoning on Facts and T-Facts X, Y, Z, T1, T2: m(X,Y) m(X,Z) validTime(m(X,Y),T1) validTime(m(X,Z),T2) overlaps(T1, T2) constraint: marriedTo (m) is an injective function at any given point Combine & reconcile t-scopes across different facts after grounding: m(Carla, Nicolas) m(Cecilia, Nicolas) overlaps ([2008,2010], [1996,2007]) m(Carla, Nicolas) m(Carla, Benjamin) overlaps ([2008,2010], [2009,2011]) m(Ca,Nic) m(Ce,Nic) false m(Ca,Nic) m(Ca,Ben) true

39 4) Joint Reasoning on Facts and T-Facts time m(Ca, Ben) m(Ca, Nic) m(Ce, Nic) m(Ca, Mi) m(Ce, Mi) Conflict graph: m(Ca, Ben) [2009,2011] m(Ca, Nic) [2008,2010] m(Ce, Nic) [1996,2007] m(Ca, Mi) [2004,2008] m(Ce, Mi) [1998,2005] Find maximal independent set: subset of nodes w/o adjacent pairs with (evidence-) weighted nodes

40 4) Joint Reasoning on Facts and T-Facts time m(Ca, Ben) m(Ca, Nic) m(Ce, Nic) m(Ca, Mi) m(Ce, Mi) Conflict graph: m(Ca, Ben) [2009,2011] m(Ca, Nic) [2008,2010] m(Ce, Nic) [1996,2007] m(Ca, Mi) [2004,2008] m(Ce, Mi) [1998,2005] Find maximal independent set: subset of nodes w/o adjacent pairs with (evidence-) weighted nodes

41 4) Joint Reasoning on Facts and T-Facts time m(Ca, Ben) m(Ca, Nic) m(Ce, Nic) m(Ca, Mi) m(Ce, Mi) alternative approach: split t-scopes and reason on consistency of t-fact partitions

42 Preliminary Results overlaps (T1,T2) teammates(X,Y) automatic extraction of t-facts about football/soccer from Wikipedia and news articles query answering by reasoning on t-facts

43 Outline... Automatic KB Construction Growing & Maintaining the KB Temporal Knowledge What and Why Wrap-up

44 KB Building: Where Do We Stand? Knowledge Bases on Entities & Classes Relationships Temporal Knowledge widely open (fertile) research ground: uncertain / incomplete temporal scopes of facts joint reasoning on base-facts and time-scopes good progress, but many challenges left: recall & precision by patterns & reasoning efficiency & scalability soft rules, hard constraints, richer logics, … open-domain discovery of new relation types strong success story, some problems left: large taxonomies of classes with individual entities long tail calls for new methods entity disambiguation remains grand challenge

45 Overall Take-Home... Historic opportunity: revive Cyc vision, make it real & large-scale ! KB as enabler of macroscopic machine reading challenging & risky, but high pay-off Explore & exploit synergies between semantic, statistical, & social Web methods: statistical evidence + logical consistency ! Many interesting research topics for CS (+ CoLi): efficiency & scalability constraints & reasoning on uncertain data NLP for temporal statements statistical ranking for semantic search knowledge-base life-cycle: growth & maintenance

46 Thank You !


Download ppt "Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge: Harvesting Entities, Relationships,"

Similar presentations


Ads by Google