Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge Harvesting Entities and Relationships.

Similar presentations


Presentation on theme: "Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge Harvesting Entities and Relationships."— Presentation transcript:

1 Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ From Information to Knowledge Harvesting Entities and Relationships From Web Sources Martin Theobald Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~mtb/

2 Goal: Turn Web into Knowledge Base comprehensive DB of human knowledge everything that Wikipedia knows everything machine-readable capturing entities, classes, relationships Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009

3 Approach: Harvesting Facts from Web PoliticianPolitical Party Angela MerkelCDU Karl-Theodor zu GuttenbergCDU Christoph HartmannFDP … CompanyCEO GoogleEric Schmidt YahooOverture FacebookFriendFeed Software AGIDS Scheer … MovieReportedRevenue Avatar$ 2,718,444,933 The Reader$ 108,709,522 FacebookFriendFeed Software AGIDS Scheer … PoliticalPartySpokesperson CDU Philipp Wachholz Die GrünenClaudia Roth FacebookFriendFeed Software AGIDS Scheer … ActorAward Christoph WaltzOscar Sandra BullockOscar Sandra BullockGolden Raspberry … PoliticianPosition Angela MerkelChancellor Germany Karl-Theodor zu GuttenbergMinister of Defense Germany Christoph HartmannMinister of Economy Saarland … CompanyAcquiredCompany GoogleYouTube YahooOverture FacebookFriendFeed Software AGIDS Scheer … YAGO-NAGA IWP Cyc TextRunner ReadTheWeb

4 Knowledge as Enabling Technology entity recognition & disambiguation understanding natural language & speech knowledge services & reasoning for semantic apps (e.g. deep QA) semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.) Indy 500 winners who are still alive? Politicians who are also scientists? Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?... US president when Barack Obama was born? Relationship between Angela Merkel, Jim Gray, Dalai Lama?

5 5/54 Knowledge Search (1) Who was US president when Barack Obama was born? http://www.wolframalpha.com

6 6/54 Knowledge Search (1) http://www.wolframalpha.com Who was mayor of Indianapolis when Barack Obama was born? not enough facts in KB !

7 7/54 Knowledge Search (2) http://www.google.com/squared/ Indy500 winners?

8 8/54 Knowledge Search (2) http://www.google.com/squared/ Indy500 winners?

9 9/54 Knowledge Search (2) http://www.google.com/squared/ Indy500 winners from Europe? no types no inference !

10 YAGO-NAGA Related Work communities Kylin KOG Cyc Freebase Cimple DBlife UIMA DBpedia Yago-Naga StatSnowball EntityCube Avatar System T Powerset START ontologies information extraction Answers SWSE Hakia TextRunner TrueKnowledge WolframAlpha Text2Onto sig.ma kosmix KnowItAll (Semantic Web) (Statistical Web) (Social Web) ReadTheWeb GoogleSquared 10/38 Cyc TextRunner ReadTheWeb IWP WebTables WorldWideTables PSOX EntityRank Cazoodle

11 Outline... Framework Entities and Classes Relationships Temporal Knowledge What and Why  Wrap-up

12 Framework: Types of Knowledge... facts / assertions: bornIn (JohnDillinger, Indianapolis) hasWon (JimGray, TuringAward), … taxonomic : instanceOf (JohnDillinger, bankRobbers), subclassOf (bankRobbers, criminals), … lexical / terminology: means (“Big Apple“, NewYorkCity), means (“Big Mike“, MichaelStonebraker) means (“MS“, Microsoft), means (“MS“, MultipleSclerosis) … common-sense properties: apples are green, red, juicy, sweet, sour … - but not fast, smart … balls are round, smooth, slippery … - but not square, funny … common-sense axioms:  x: human(x)  male(x)  female(x)  x: (male(x)   female(x))  (female(x) )   male(x))  x: animal(x)  (hasLegs(x)  isEven(numberOfLegs(x)) … procedural: how to fix/install/prepare/remove … epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)), believes (Copernicus, shape(Earth, sphere)) …

13 Framework: Information Extraction (IE) many sources one source Surajit obtained his PhD in CS from Stanford University under the supervision of Prof. Jeff Ullman. He later joined HP and worked closely with Umesh Dayal … source- centric IE instanceOf (Surajit, scientist) inField (Surajit, computer science) hasAdvisor (Surajit, Jeff Ullman) almaMater (Surajit, Stanford U) workedFor (Surajit, HP) friendOf (Surajit, Umesh Dayal) … yield-centric harvesting Student Advisor hasAdvisor Student University almaMater Student Advisor 1) recall ! 2) precision 1) precision ! 2) recall near-human quality ! Student Advisor Surajit Chaudhuri Jeffrey Ullman Alon Halevy Jeffrey Ullman Jim Gray Mike Harrison … … Student University Surajit Chaudhuri Stanford U Alon Halevy Stanford U Jim Gray UC Berkeley … …

14 Framework: Knowledge Representation... RDF (Resource Description Framework, W3C): subject-property-object (SPO) triples, binary relations structure, but no (prescriptive) schema Relations, frames Description logics: OWL, DL-lite Higher-order logics, epistemic logics temporal & provenance annotations can refer to reified facts via fact identifiers (approx. equiv. to RDF quadruples: “Color“  Sub  Prop  Obj) facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) facts (RDF triples) 1: 2: 3: 4: facts about facts: 5: (1, inYear, 1968) 6: (2, inYear, 2006) 7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008) 9: (4, validFrom, 2-Feb-2008) 10: (2, source, SigmodRecord)

15 http://www.mpi-inf.mpg.de/yago-naga/ KB‘s: Example YAGO (Suchanek et al.: WWW‘07) Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn “Max Planck” means (0.9) subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State “Angela Dorothea Merkel” Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means(0.1) instanceOf subclass means “Angela Merkel” means citizenOf instanceOf locatedIn subclass Accuracy  95% 2 Mio. entities, 20 Mio. facts 40 Mio. RDF triples ( entity1-relation-entity2, subject-predicate-object )

16 KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07) http://www.mpi-inf.mpg.de/yago-naga/

17 KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07) 3 Mio. entities, 1 Bio. facts (RDF triples) 1.5 Mio. entities mapped to hand-crafted taxonomy of 259 classes with 1200 properties http://www.dbpedia.org

18 Outline... Framework Entities and Classes Relationships Temporal Knowledge What and Why  Wrap-up 

19 Entities & Classes... Which entity types (classes, unary predicates) are there? Which subsumptions should hold (subclass/superclass, hyponym/hypernym, inclusion dependencies) ? Which individual entities belong to which classes? Which names denote which entities? scientists, doctoral students, computer scientists, … female humans, male humans, married humans, … subclassOf (computer scientists, scientists), subclassOf (scientists, humans), … instanceOf (Surajit Chaudhuri, computer scientists), instanceOf (BarbaraLiskov, computer scientists), instanceOf (Barbara Liskov, female humans), … means (“Lady Di“, Diana Spencer), means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), … means (“Madonna“, Madonna Louise Ciccone), means (“Madonna“, Madonna(painting by Edward Munch)), …

20 WordNet Thesaurus [Miller/Fellbaum 1998] http://wordnet.princeton.edu/ 3 concepts / classes & their synonyms (synset‘s)

21 WordNet Thesaurus [Miller/Fellbaum 1998] http://wordnet.princeton.edu/ subclasses (hyponyms) superclasses (hypernyms)

22 WordNet Thesaurus [Miller & Fellbaum 1998] scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist... => principal investigator, PI … HAS INSTANCE => Bacon, Roger Bacon … but: only few individual entities (instances of classes) > 100 000 classes and lexical relations; can be cast into description logics or graph, with weights for relation strengths (derived from co-occurrence statistics) http://wordnet.princeton.edu/

23 Tapping on Wikipedia Categories

24

25 Mapping: Wikipedia  WordNet [Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07] Jim Gray (computer specialist) Computer Scientist American Scientist Sailor, Crewman Missing Person Chemist Artist

26 American Sailor, Crewman Mapping: Wikipedia  WordNet [Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07] Jim Gray (computer specialist) Computer Scientist Data- base Fellow (1), Comrade Fellow (2), Colleague Fellow (3) (of Society) Scientist Member (1), Fellow Member (2), Extremity American Computer Scientists Database Researcher Fellows of the ACM People Lost at Sea instanceOf subclassOf ? ? ? name similarity (edit dist., n-gram overlap) ? context similarity (word/phrase level) ? machine learning ? Computer Scientists by Nation Databases ACM Members of Learned Societies Engineering Societies ? ? ? Missing Person

27 Mapping: Wikipedia  WordNet [Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07] Analyzing category names  noun group parser: American Musicians of Italian Descent American Folk Music of the 20th Century American Indy 500 Drivers on Pole Positions Head word is key, should be in plural for instanceOf headpre-modifierpost-modifierheadpre-modifierpost-modifier headpre-modifierpost-modifier Given: entity e in Wikipedia categories c 1, …, c k Wanted: instanceOf(e,c) and subclassOf(c i,c) for WN class c Problem: vagueness & ambiguity of names c 1, …, c k

28 Mapping Wikipedia Entities to WordNet Classes Given: entity e in Wikipedia categories c 1, …, c k Wanted: instanceOf(e,c) and subclassOf(c i,c) for WN class c Problem: vagueness & ambiguity of names c 1, …, c k Heuristic Method: for each c i do if head word w of category name c i is plural { 1) match w against synsets of WordNet classes 2) choose best fitting class c and set e  c 3) expand w by pre-modifier and set c i  w +  c } can also derive features this way feed into supervised classifier [Suchanek: WWW‘07, Ponzetto & Strube: AAAI‘07] tuned conservatively: high precision, reduced recall

29 Learning More Mappings [ Wu & Weld: WWW‘08 ] Kylin Ontology Generator (KOG): learn classifier for subclassOf across Wikipedia & WordNet using YAGO as training data advanced ML methods (MLN‘s, SVM‘s) rich features from various sources category/class name similarity measures category instances and their infobox templates: template names, attribute names (e.g. knownFor) Wikipedia edit history: refinement of categories Hearst patterns: C such as X, X and Y and other C‘s, … other search-engine statistics: co-occurrence frequencies > 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories

30 Goal: Comprehensive & Consistent ! Jim Gray (computer specialist) Madonna (entertainer) Jeffrey Ullman Bob Dylan … … American Computer Scientists Database Researcher Fellows of the ACM Databases Members of Learned Societies Artist Singer Italian American Musician Born Award Winner Scientist Known For Alma Mater Notable Awards Doctoral Students Academic Bell Labs Princeton Alumni Knuth Prize Laureate American People by Occupation Fellow(1) Fellow(2) World Record Holders American Songwriters Athlete Genres Years Active Hall of Fame Inductees U Michigan Alumni Also Known As Website Guitar Players Americans of Italian Descent People by Status Computer Data Telecomm. History

31 Goal: Comprehensive & Consistent ! Jim Gray (computer specialist) Madonna (entertainer) Jeffrey Ullman Bob Dylan … … American Computer Scientists Database Researcher Fellows of the ACM Databases Members of Learned Societies Artist Singer Italian American Musician Born Award Winner Scientist Known For Alma Mater Notable Awards Doctoral Students Academic Bell Labs Princeton Alumni Knuth Prize Laureate American People by Occupation Fellow(1) Fellow(2) World Record Holders American Songwriters Athlete Genres Years Active Hall of Fame Inductees U Michigan Alumni Also Known As Website Guitar Players Americans of Italian Descent People by Status Computer Data Telecomm. History

32 Goal: Comprehensive & Consistent ! Jim Gray (computer specialist) Madonna (entertainer) Jeffrey Ullman Bob Dylan … … American Computer Scientists Database Researcher Fellows of the ACM Databases Members of Learned Societies Artist Singer Italian American Musician Born Award Winner Scientist Known For Alma Mater Notable Awards Doctoral Students Academic Bell Labs Princeton Alumni Knuth Prize Laureate American People by Occupation Fellow(1) Fellow(2) World Record Holders American Songwriters Athlete Genres Years Active Hall of Fame Inductees U Michigan Alumni Also Known As Website Guitar Players Americans of Italian Descent People by Status Computer Data Telecomm. History

33 Goal: Comprehensive & Consistent ! Jim Gray (computer specialist) Madonna (entertainer) Jeffrey Ullman Bob Dylan … … American Computer Scientists Database Researcher Fellows of the ACM Databases Members of Learned Societies Artist Singer Italian American Musician Born Award Winner Scientist Known For Alma Mater Notable Awards Doctoral Students Academic Bell Labs Princeton Alumni Knuth Prize Laureate American People by Occupation Fellow(1) Fellow(2) World Record Holders American Songwriters Athlete Genres Years Active Hall of Fame Inductees U Michigan Alumni Also Known As Website Guitar Players Americans of Italian Descent People by Status Computer Data Telecomm. History Clean up the mess: graph algorithms ? random walk with restart dense subgraphs … statistical machine learning ? logical consistency reasoning ? gigantic schema integration ? ontology merging

34 Long Tail of Class Instances

35 [Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010] But: Precision drops for classes with sparse statistics (DB profs, …) Harvested items are names, not entities Canonicalization (de-duplication) unsolved State-of-the-Art Approach (e.g. SEAL): Start with seeds: a few class instances Find lists, tables, text snippets (“for example: …“), … that contain one or more seeds Extract candidates: noun phrases from vicinity Gather co-occurrence stats (seed&cand, cand&className pairs) Rank candidates point-wise mutual information, … random walk (PR-style) on seed-cand graph

36 Individual Entity Disambiguation “Penn“ “U Penn“ University of Pennsylvania “Penn State“ Pennsylvania State University „PSU“ Pennsylvania (US State) Sean Penn Passenger Service Unit NamesEntities ?? ill-defined with zero context known as record linkage for names in record fields Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links

37 Collective Entity Disambiguation Consider a set of names {n 1, n 2, …} in same context and sets of candidate entities E1 = {e 11, e 12, …}, E2 = {e 21, e 22, …}, … Define joint objective function (e.g. likelihood for prob. model) that rewards coherence of mappings n i  e ij [McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti 2009, …] Solve optimization problem Stuart Russell Michael Jordan Stuart Russell (computer scientist) Stuart Russell (DJ) Michael Jordan (computer scientist) Michael Jordan (NBA)

38 Problems and Challenges Wikipedia categories reloaded Robust disambiguation Tags, tables, topics Long tail of entities comprehensive & consistent instanceOf and subClassOf across Wikipedia and WordNet (via consistency reasoning ?) tap on other sources: Web2.0, Web tables, directories, etc. near-real-time mapping of names to entities with near-human quality discover new entities, detect new names for known entities beyond Wikipedia: domain-specific entity catalogs

39 Outline... Framework Entities and Classes Relationships Temporal Knowledge What and Why  Wrap-up  

40 Relationships Which instances (pairs of individual entities) are there for given binary relations with specific type signatures? hasAdvisor (JimGray, MikeHarrison) hasAdvisor (HectorGarcia-Molina, Gio Wiederhold) hasAdvisor (Susan Davidson, Hector Garcia-Molina) graduatedAt (JimGray, Berkeley) graduatedAt (HectorGarcia-Molina, Stanford) hasWonPrize (JimGray, TuringAward) bornOn (JohnLennon, 9Oct1940) diedOn (JohnLennon, 8Dec1980) marriedTo (JohnLennon, YokoOno) Which additional & interesting relation types are there between given classes of entities? competedWith(x,y), nominatedForPrize(x,y), … divorcedFrom(x,y), affairWith(x,y), … assassinated(x,y), rescued(x,y), admired(x,y), …

41 Picking Low-Hanging Fruit (First)

42 Deterministic Pattern Matching... [Kushmerick 97, Califf & Mooney 99, Gottlob 01, …] Regular expressions matching Wrapper induction (grammar learning for restricted regular languages) Well understood

43 French Marriage Problem facts in KB: new facts or fact candidates: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) 1)for recall: pattern-based harvesting 2)for precision: consistency reasoning

44 Pattern-Based Harvesting FactsPatterns (Hillary, Bill) (Carla, Nicolas) & Fact Candidates X and her husband Y X and Y on their honeymoon X and Y and their children X has been dating with Y X loves Y … good for recall noisy, drifting not robust enough for high precision (Angelina, Brad) (Hillary, Bill) (Victoria, David) (Carla, Nicolas) (Angelina, Brad) (Yoko, John) (Carla, Benjamin) (Larry, Google) (Kate, Pete) (Victoria, David) (Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)

45 Reasoning about Fact Candidates Use consistency constraints to prune false candidates spouse(Hillary,Bill) spouse(Carla,Nicolas) spouse(Cecilia,Nicolas) spouse(Carla,Ben) spouse(Carla,Mick) Spouse(Carla, Sofie) spouse(x,y)  diff(y,z)   spouse(x,z) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) spouse(x,y)  f(x)spouse(x,y)  m(y) spouse(x,y)  (f(x)  m(y))  (m(x)  f(y)) FOL rules (restricted): ground atoms: Rules can be weighted (e.g. by fraction of ground atoms that satisfy a rule)  uncertain / probabilistic data  compute prob. distr. of subset of atoms being the truth Rules reveal inconsistencies Find consistent subset(s) of atoms (“possible world(s)“, “the truth“) spouse(x,y)  diff(w,y)   spouse(w,y)

46 Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Map logical constraints & fact candidates into probabilistic graph model: Markov Random Field (MRF) s(x,y)  m(y) s(x,y)  diff(y,z)   s(x,z) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y)  diff(w,y)   s(w,y) s(x,y)  f(x)  s(Ca,Nic)   s(Ce,Nic)  s(Ca,Nic)   s(Ca,Ben)  s(Ca,Nic)   s(Ca,So)  s(Ca,Ben)   s(Ca,So)  s(Ca,Nic)  m(Nic) Grounding:  s(Ce,Nic)  m(Nic)  s(Ca,Ben)  m(Ben)  s(Ca,So)  m(So) f(x)   m(x) M(x)   f(x) Literal  Boolean Var Literal  binary RV

47 Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Map logical constraints & fact candidates into probabilistic graph model: Markov Random Field (MRF) s(x,y)  m(y) s(x,y)  diff(y,z)   s(x,z) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y)  diff(w,y)   s(w,y) s(x,y)  f(x)f(x)   m(x) M(x)   f(x) m(Ben) m(Nic) s(Ca,Nic) s(Ce,Nic) s(Ca,Ben) s(Ca,So) m(So) RVs coupled by MRF edge if they appear in same clause MRF assumption: P[X i |X 1..X n ]=P[X i |N(X i )] Variety of algorithms for joint inference: Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, … joint distribution has product form over all cliques

48 Related Alternative Probabilistic Models software tools: alchemy.cs.washington.edu alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/ Constrained Conditional Models [D. Roth et al. 2007] Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008] log-linear classifiers with constraint-violation penalty mapped into Integer Linear Programs RV‘s share “factors“ (joint feature functions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constraining of RV‘s m(Ben) m(Nic) s(Ca,Nic) s(Ce,Nic) s(Ca,Ben) s(Ca,So) m(So)

49 Reasoning for KB Growth: Direct Route facts in KB: new fact candidates: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) + patterns: X and her husband Y X and Y and their children X has been dating with Y X loves Y ? facts are true; fact candidates & patterns  hypotheses grounded constraints  clauses with hypotheses as vars cast into Weighted Max-Sat with weights from pattern stats customized approximation algorithm unifies: fact cand consistency, pattern goodness, entity disambig. (F. Suchanek et al.: WWW‘09) www.mpi-inf.mpg.de/yago-naga/sofie/ Direct approach:

50 Facts & Patterns Consistency constraints to connect facts, fact candidates, patterns (F. Suchanek et al.: WWW‘09) functional dependencies: spouse(X,Y): X  Y, Y  X relation properties: asymmetry, transitivity, acyclicity, … type constraints, inclusion dependencies: spouse  Person  PersoncapitalOfCountry  cityOfCountry domain-specific constraints: bornInYear(x) + 10years ≤ graduatedInYear(x) www.mpi-inf.mpg.de/yago-naga/sofie/ hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s < t pattern-fact duality: occurs(p,x,y)  expresses(p,R)  R(x,y) name(-in-context)-to-entity mapping:  means(n,e1)   means(n,e2)  … occurs(p,x,y)  R(x,y)  expresses(p,R)

51 Soft Rules vs. Hard Constraints Enforce FD‘s (mutual exclusion) as hard constraints: Generalize to other forms of constraints: hard constraintsoft constraint hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s < t firstPaper(x,p)  firstPaper(y,q)  author(p,x)  author(p,y) )  inYear(p) > inYear(q) + 5years  hasAdvisor(x,y) hasAdvisor(x,y)  diff(y,z)   hasAdvisor(x,z) combine with weighted constraints no longer MaxSat constrained MaxSat instead open issue for arbitrary constraints  rethink reasoning !

52 Problems and Challenges High precision & high recall at affordable cost Scale, dynamics, life-cycle Declarative, self-optimizing workflows Types and constraints robust pattern analysis & reasoning incorporate pattern & reasoning steps into IE queries/programs grow & maintain KB with near-human-quality over long periods explore & understand different families of constraints soft rules & hard constraints, rich DL, beyond CWA parallel processing, lazy / lifted inference, … Open-domain knowledge harvesting turn names, phrase & table cells into entities & relations

53 Outline... Framework Entities and Classes Relationships Temporal Knowledge What and Why  Wrap-up   

54 Temporal Knowledge Which facts for given relations hold at what time point or during which time intervals ? marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ] capitalOf (Berlin, Germany) [ 1990, now ] capitalOf (Bonn, Germany) [ 1949, 1989 ] hasWonPrize (JimGray, TuringAward) [ 1998 ] graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ] graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ] hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ] How can we query & reason on entity-relationship facts in a “time-travel“ manner - with uncertain/incomplete KB ? US president when Barack Obama was born? students of Hector Garcia-Molina while he was at Princeton?

55 French Marriage Problem facts in KB new fact candidates: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) divorced (Madonna, Guy) domPartner (Angelina, Brad) 1: 2: 3: validFrom (2, 2008) validFrom (4, 1996) validUntil (4, 2007) validFrom (5, 2010) validFrom (6, 2006) validFrom (7, 2008) 4: 5: 6: 7: 8: JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

56 Challenge: Temporal Knowledge for all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night consistency constraints are potentially helpful: functional dependencies: husband, time  wife inclusion dependencies: marriedPerson  adultPerson age/time/gender restrictions: birthdate +  < marriage < divorce 1)recall: gather temporal scopes for base facts 2)precision: reason on mutual consistency

57 Difficult Dating

58 (Even More Difficult) Implicit Dating explicit dates vs. implicit dates relative to other dates

59 (Even More Difficult) Relative Dating vague dates relative dates vague dates relative dates narrative text relative order narrative text relative order

60 TARSQI: Extracting Time Annotations Hong Kong is poised to hold the first election in more than half a century that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro- democracy politician, Alan Leong, announced Wednesday that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re- election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the March 25 election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in 1997. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in 2005, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for another five years, said Mr. Leong, the former chairman of the Hong Kong Bar Association. (M. Verhagen et al.: ACL‘05)http://www.timeml.org/site/tarsqi/ extraction errors extraction errors

61 Representing Time: AI Perspective Instant –durationless piece of time Period –potentially unbounded continuum of instants Events –time as a sequence of events  E –precedence and overlap relations on E  E [Allen 1984, Allen & Hayes 1989, …]

62 Relations between Time Periods A Before B B After A A Meets B B MetBy A A Overlaps B B OverlappedBy A A Starts B B StartedBy A A During B B Contains A A Finishes B B FinishedBy A A Equal B AB A B A B A B A B A B A B

63 Representing Time: DB Perspective Time point: smallest time unit of fixed duration/granularity (e.g., a day, a year, a second) Interval: finite set of time points State relation: fact holds at every time point within interval isCapitalOf (Bonn, Germany) [1949, 1989] Event relation: fact holds at exactly one time point within interval wonCup (United, ChampionsLeague) [1999, 1999] intervals can also capture uncertainty of time points

64 Uncertainty and Time Point-probabilities for facts and intervals playsFor(Beckham, United)[1990, 2005]:0.9 –fact valid in interval [t b, t e ] with prob. p –fact not valid with prob. 1-p Continuous distributions playsFor(Beckham, United) [1990, 2005]:Gauss(µ=1996,σ 2 =1) Histograms playsFor(Beckham, United) [1990, 1992):0.1 [1992, 2004):0.6 [2004, 2005]:0.2 0.6 0.2 0.1 ‘90‘92 ‘05‘04 0.9 ‘90 ‘05 ‘90 ‘96 ‘05 µ=1996 σ 2 =1

65 0.3 0.6 Possible Worlds in Time 0.3 StateEvent ‘95 ‘98‘02 ‘96‘98 ‘00‘01 ‘96‘99‘00 ‘99 0.54 0.91.0 ‘01 ‘98 playsFor (Beckham, United)wonCup (United, ChampionsLeague) playsFor(Beckham, United)  wonCup(United, ChampionsLeague) Base Facts hasWon (Beckham, ChampionsLeague)  0.2 0.5 0.10.2 0.12 0.30 0.06 #P-complete per histogram bin linear in #bins

66 Joint Reasoning on Facts & Time marriedTo (Nicolas, Carla) 0.91 marriedTo (Nicolas, Cecilia) 0.65 divorcedFrom (Nicolas, Cecilia) 0.78 bornIn (Nicolas, Paris) 0.77 bornIn (Cecilia, Boulogne) 0.12 bornIn (Carla, Turin) 0.43 marriedTo (Carla, Ben) 0.18 marriedTo (Carla, Mick) 0.25 marriedTo(a,b,T1)  marriedTo(a,c,T2)  different(b,c)  disjoint(T1,T2) marriedTo(a,b,T1)  divorcedFrom(a,b,T2)  before(T1,T2) marriedTo(a,b,T1)  bornIn(a,c,T2)  before(T2,T1) Rules:Facts from KB (with confidence weights):

67 Joint Reasoning on Facts & Time bornIn(Nicolas, Paris) bornIn(Cecilia, Boulogne) bornIn(Carla, Turin) m(Nicolas, Cecilia) div(Nicolas, Cecilia) m(Nicolas, Carla) m(Carla, Mick) m(Carla, Ben) marriedTo (Nicolas, Carla) marriedTo (Nicolas, Cecilia) divorcedFrom (Nicolas, Cecilia) marriedTo (Carla, Mick) marriedTo (Carla, Ben) bornIn (Carla, Turin) bornIn (Cecilia, Boulogne) bornIn (Nicolas, Paris) 0.91 0.650.78 0.770.12 0.43 0.18 0.25 marriedTo(a,b,T1)  marriedTo(a,c,T2)  different(b,c)  disjoint(T1,T2) marriedTo(a,b,T1)  divorcedFrom(a,b,T2)  before(T1,T2) marriedTo(a,b,T1)  bornIn(a,c,T2)  before(T2,T1) Rules:Facts from KB (with confidence weights): time + more soft rules: hasChild (a,c)  hasChild (b,c)  different (a,b)  marriedTo(a,b) + recursive rules … Compute most likely possible world !

68 Problems and Challenges Temporal Querying (Revived) Consistency Reasoning Incomplete and Uncertain Temporal Scopes Gathering Implicit and Relative Time Annotations query language (T-SPARQL?), no schema confidence weights & ranking incorrect, incomplete, unknown begin/end vague dating biographies & news, relative orderings aggregate & reconcile observations extended MaxSat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time

69 Outline... Framework Entities and Classes Relationships Temporal Knowledge What and Why  Wrap-up    

70 KB Building: Where Do We Stand? Entities & Classes Relationships Temporal Knowledge widely open (fertile) research ground: uncertain / incomplete temporal scopes of facts joint reasoning on ER facts and time scopes good progress, but many challenges left: recall & precision by patterns & reasoning efficiency & scalability soft rules, hard constraints, richer logics, … open-domain discovery of new relation types strong success story, some problems left: large taxonomies of classes with individual entities long tail calls for new methods entity disambiguation remains grand challenge

71 Overall Take-Home... Historic opportunity: revive Cyc vision, make it real & large-scale ! challenging & risky, but high pay-off Explore & exploit synergies between semantic, statistical, & social Web methods: statistical evidence + logical consistency ! For DB researchers (theoreticians & normal ones): efficiency & scalability constraints & reasoning killer app for uncertain data management knowledge-base life-cycle: growth & maintenance

72 Thank You !


Download ppt "Gerhard Weikum Max Planck Institute for Informatics From Information to Knowledge Harvesting Entities and Relationships."

Similar presentations


Ads by Google