Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Using Datalog for Rule-Based.

Similar presentations


Presentation on theme: " Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Using Datalog for Rule-Based."— Presentation transcript:

1  Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Using Datalog for Rule-Based Reasoning over Web Data: Challenges and Next Steps Axel Polleres Digital Enterprise Research Institute, NationaI University of Ireland, Galway Joint work with Aidan Hogan, Andreas Harth, Stefan Decker

2 Digital Enterprise Research Institute This talk is about… the Semantic Web … … in particular:  practical Web Reasoning & how/why we apply Datalog there Misquoting Jim Hendler: “A Little Datalog goes a long way” 2

3 Digital Enterprise Research Institute The Web of Data 3 March 2008 March 2009 … Structured Knowledge on the Web… … in the order of Billions of statements … growing fast!

4 Digital Enterprise Research Institute Search Engines for the Web of Data 4 Promise: … query answering over RDF Web data Typical assumptions for Search engines remain: expected sub-second response times obvious “garbage” should be filtered/ignored

5 Digital Enterprise Research Institute 5 Simplified “added value” proposition of Semantic Search… 5 Fig 1: RDF Web Dataset “explicit” data RDF “implicit” data? Via inference using OWL2, RDF Schema!

6 Digital Enterprise Research Institute 6 amazon:MSLam foaf:made amazon:Compilers. dblp:M_S_Lam foaf:made dblp:SystArrayOptCompilers. Problem: Synonymous Omissions 6 Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book.

7 Digital Enterprise Research Institute 7 amazon:Compilers dc:creator ex:MSLam. 7 Problem: Different “Ontologies” used Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book.

8 Digital Enterprise Research Institute 8 amazon:Compilers dc:creator amazon:MSLam. amazon:MSLam foaf:made amazon:Compilers. 8 Solution: Publish Complete Data? Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book.

9 Digital Enterprise Research Institute 9 amazon:Compilers dc:creator ex:MSLam. 9 Solution: Ask query in all possible ways? Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. UNION ?Book dc:creator amazon:MSLam.

10 Digital Enterprise Research Institute 10 amazon:Compilers dc:creator amazon:MSLam. dc:creator owl:inverseOf foaf:made. dblp:M_S_Lam foaf:made dblp:SystArrayOptCompilers. amazon:MSLam owl:sameAs dblp:M_S_Lam. amazon:MSLam foaf:made amazon:Compilers. amazon:MSLam foaf:made dblp:SystArrayOptCompilers. 10 Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book. Solution: Exploit OWL and RDFS…

11 Digital Enterprise Research Institute 11 Inference over OWL and RDFS… Two of the “mainstream” directions:  DL fragments of OWL/RDFS: OWL Lite, OWL DL, OWL2DL, etc. 1. reduce Web Data to DL facts (A-Box) and terminological axioms (T-Box) 2. Use DL reasoner to answer queries  Datalog-reducible fragments of OWL: RDFS, DLP, pD*, OWL2RL, 1. Encode semantics of OWL/RDFS into Datalog rules Both assertional and terminological knowledge remains just facts. 2. Apply fwd- or bwd-chaining inference amazon:Compilers dc:creator amazon:MSLam. amazon:MSLam foaf:made amazon:SystArrayOptCompilers. amazon:MSLam owl:sameAs dblp:M_S_Lam. foaf:made rdfs:domain foaf:Person dc:creator owl:inverseOf foaf:made.

12 Digital Enterprise Research Institute 12 Inference over OWL and RDFS… Two of the “mainstream” directions:  Datalog-reducible fragments of OWL: RDFS, OWL-, DLP, pD*, OWL2RL, amazon:Compilers dc:creator ex:MSLam. amazon:MSLam foaf:made amazon:SystArrayOptCompilers. amazon:MSLam owl:sameAs ex:MSLam. foaf:made rdfs:domain foaf:Person. dc:creator owl:inverseOf foaf:made. ?o ?p2 ?s. :- ?p1 owl:inverseOf ?p2. ?s ?p1 ?o. ?s rdf:type ?c. :- ?p1 rdfs:domain ?c. ?s ?p1 ?o. ?p2 owl:inverseOf ?p1. :- ?p1 owl:inverseOf ?p2.

13 Digital Enterprise Research Institute 13 Web Reasoning: Rule Based Approach 13 Why we focus on the Datalog approach?  Massive A-Box/fact base  Popular Web ontologies (T-Box) is fairly small/inexpressive…  FWD-chaining shall allow storing/indexing implicit answers for quick retrieval… Hope: Datalog/Rules scale well for instance retrieval OWL2RL enough for most Web ontologies … but how feasible is that?

14 Digital Enterprise Research Institute 14 Web Reasoning: Observations & Challenges 14 Scalability:  Massive A-Box: Tens of billions of statements (for the moment)  Near linear scale required Noisy data:  Inconsistencies galore  NoisyData  “Ontology hijacking”

15 Digital Enterprise Research Institute (Accidental) Inconsistencies… FOAF Ontology: foaf:Person disjointWith foaf:Organisation. foaf:homepage rdf:type owl:inverseFunctionalProperty. 15 ?s1 owl:differentFrom ?s2. :- ?s1 rdf:type ?c1. ?s2 rdf:type ?c2. ?c1 owl:disjointWith ?c2. ?s1 owl:sameAs ?s2. :- ?s1 ?p ?o. ?s2 ?p ?o. ?p rdf:type owl:inverseFunctionalProperty. ERROR :- ?x owl:sameAS ?y. ?x owl:differentFrom ?y. Source1 (faulty): TimBernersLee foaf:homepage TimBernersLee rdf:type foaf:Person. W3.org: W3C foaf:homepage W3C rdf:type foaf:Organisation.

16 Digital Enterprise Research Institute 16 foaf:mbox_sha1sum a owl:InverseFunctionalProperty. ?x foaf:mbox_sha1sum 08445a31a78661b5c746feff39a9db6e4e2cc5cf. ?s1 owl:sameAs ?s2. :- ?s1 ?p ?o. ?s2 ?p ?o. ?p rdf:type owl:inverseFunctionalProperty ?s 1 /?s 2 bindings in body  inferred pair-wise and reflexive owl:sameAs statements Noisy Data

17 Digital Enterprise Research Institute 17 More Noise: From type Type of resource Ontology hijacking: A non-authoritative source trying to redefine existing properties & classes. “Ontology Hijacking” rdf:type rdfs:domain eiao:testRun. rdf:type rdfs:domain eiao:pageSurvey. rdf:type rdfs:domain eiao:siteSurvey. rdf:type rdfs:domain eiao:Scenario. rdf:type rdfs:domain eiao:rangeLocation. rdf:type rdfs:domain eiao:startPointer. rdf:type rdfs:domain eiao:endPointer. rdf:type rdfs:domain eiao:header. rdf:type rdfs:domain eiao:runs.

18 Digital Enterprise Research Institute 18 OWL 2 RL domain: ?s rdf:type ?c. :- ?p1 rdfs:domain ?c. ?s ?p1 ?o. Adds 9 2 x |N| triples, where N is the set of “normal” rdf:type triples in the data! “Ontology Hijacking” rdf:type rdfs:domain eiao:testRun. rdf:type rdfs:domain eiao:pageSurvey. rdf:type rdfs:domain eiao:siteSurvey. rdf:type rdfs:domain eiao:Scenario. rdf:type rdfs:domain eiao:rangeLocation. rdf:type rdfs:domain eiao:startPointer. rdf:type rdfs:domain eiao:endPointer. rdf:type rdfs:domain eiao:header. rdf:type rdfs:domain eiao:runs.

19 Digital Enterprise Research Institute 19 SAOR: Scalable Authoritative OWL Reasoner 19 No systems available that can deal with that … Goals: Scalability  Separate TBox data – in memory  Reduced Output  Incomplete reasoning! Web tolerance  Consider authority of TBox  Incomplete reasoning!

20 Digital Enterprise Research Institute 20 Scalable Reasoning: In-mem T-Box Main optimisation: Store T-Box in memory By far, the most commonly accessed segment of data for reasoning Quite small (1-2%) e.g. from a 100M statement Web crawl  ABOX: 3,753,791 X ?s foaf:name ?o. vs.  TBOX: <20 X foaf:name ?p ?o. + ?s ?p foaf:name. 20

21 Digital Enterprise Research Institute 21 Scalable Reasoning: Scans Scan 1: Scan input data, separate T-Box statements, load T-Box statements into memory Scan 2: Scan all on-disk data, join with in-memory T-Box. With in-mem T-Box, avoid A-Box joins for many *not all* rules  A-Box joins too expensive on large volumes of data

22 Digital Enterprise Research Institute 22 Scalable Reasoning: No A Box Joins ex:me foaf:homepage ex:home.... ex:me foaf:page ex:home. ex:me foaf:isPrimaryTopicOf ex:home. ex:home rdf:type foaf:Document. ex:home rdf:type wordnet:Document.... IN-MEM T-BOX ON-DISK A-BOX ON-DISK OUTPUT Execution of three rules: OWL 2 RL rule prp-spo1 ?x ?p 2 ?y. :- ?p 1 rdfs:subPropertyOf ?p 2. ?x ?p 1 ?y. OWL 2 RL rule cax-sco ?x rdf:type ?c 2. :- ?c 1 rdfs:subClassOf ?c 2. ?x rdf:type ?c 1. OWL 2 RL rule prp-spo1 ?y rdf:type ?c. :- ?p rdfs:range ?c. ?x ?p ?y.

23 Digital Enterprise Research Institute 23 Scalable Reasoning: Joins We focus on these rules that don’t need A-Box joins:  [48 rules/76 OWL2RL rules]  Covers e.g. all of RDF Schema!  This fragment can easily be distributed! However: some rules do require A-Box joins, e.g. ?x owl:sameAs ?:- ?x owl:sameAs ?y. ?y owl:sameAs ?z.  Handle with BW-chaining (Storing pivot element lists.) ?x 1 owl:sameAs ?x 2. :- ?p a owl:InverseFunctionalProperty. ?x 1 ?p ?o. ?x 2 ?p ?o..  Currently ignored, see examples above, we currently work on statistical approach for ifp. No A-Box joins for SAOR reasoning over >1B statements as deployed in SWSE, we ran experiments for a smaller dataset on full OWL2RL  using in-memory transitivity indexes, semi-naïve evaluation transitive properties (not that many) 23

24 Digital Enterprise Research Institute 24 Web Tolerance: Authoritative Reasoning We check authority (on the T-Box statements only) to make inferences! Document D authoritative for class/property X iff:  X not identified by URI, OR  De-referenced URI of X coincides with or redirects to D Borrowing from the idea of DL to separate T-Box and A-Box we enable authority checking by so called split-rules :  Split-rule: Antecedent divided in T-Box and A-Box statements.  Split-rule Application: At least one of the A-Box/T-Box join variables needs to be spoken about authoritatively, for the rule to fire. Example: ?s rdf:type ?d. :- ?c rdfs:subClassOf ?d. ?s rdf:type ?c. 24

25 Digital Enterprise Research Institute 25 Web Tolerance: Authoritative Reasoning Example:  FOAF ontology authoritative for foaf:Person ✓  MY spec not authoritative for foaf:Person ✘ Only allow extension in authoritative documents  my:Person rdfs:subClassOf foaf:Person. (MY spec) ✓ BUT: Reduce obscure memberships  foaf:Person rdfs:subClassOf my:Person. (MY spec) ✘ ALSO: Protect specifications  foaf:mbox rdf:type owl:SymmetricProperty. (MY spec) ✘ Similarly for other rules. In-memory T-Box only stores statements that are authoritative for rule execution. 25 ?s rdf:type ?d. :- ?c rdfs:subClassOf ?d. ?s rdf:type ?c.

26 Digital Enterprise Research Institute Runtime… 26 no A-Box joins + authoritative split rule application Linear scale for most rules single machine: 1.1bn in => bn out, <10 hours Can be paralellized! [Weaver,Hendler 2009],[Urbani et al. 2009] => 113 minutes with A-Box joins… … only scale up to ~100M statements so far

27 Digital Enterprise Research Institute We would, if we could… Use ranking of statements [Harth et al. ISWC2009] to rank inferences. Ongoing work with Piero Bonatti:  Rank inferences (by aggregation) s p o : f(v1,… vn) :- t1:v1 … tn:vn Base on Annotated programs [Kifer & Subrahmanian, JLP, 1992] Main Difficulty:  many possible inferences for the same statement, aggregation prevents cheap file-scans we currently rely on. 27

28 Digital Enterprise Research Institute Summary (So, why should you care?) We need to care about scale  Throw away what we don’t need – our choices are motivated empirically: – T-Box separation + filescans – Split rules notion + Authoritativeness keep “noise explosion” low … but applicable in similar domains? Admittedly: rather a restriction of Datalog1.0 to scale with rules of certain shape But also: More Datalog on the Semantic Web horizon! – W3C RIF: Web standard for rule exchange … RIF safe Core = safe Datalog with built-ins – W3C SPARQL 1.0 translatable to Datalog strat,not, [Polleres 2007, Angles and Gutierrez 2008, Ianni et al. 2009] SPARQL 1.1 additional features well-investigated in Datalog! – Annotations/Rank potentially boost accuracy of query results, other annotation domains: time, provenance, etc. 28

29 Digital Enterprise Research Institute Le Fin… Techniques used in Running search engines… 29

30 Digital Enterprise Research Institute Ok, here it is… 30 2RL Core

31 Digital Enterprise Research Institute 31 Evaluation: Authoritative Reasoning Class ANAnn * An * NA rss:item M0908M foaf:Person M14.5M937M foaf:Document M 531M wordnet:Person M0258M foaf:chatEvent 001.1M00 Total 71,3368.7M16M2.6B!! Property dc:title M01.1B!! foaf:name M18.8M2.5B!! dc:date M02.3B!! foaf:nick M02B!! dc:description 06313M01.9B!! Total 52, M18.8M9.8B!!

32 Digital Enterprise Research Institute 32

33 Digital Enterprise Research Institute 33 Scalable Reasoning: Joins However: some rules do require A-Box joins We employ on-disk hashtables ex:me foaf:homepage ex:home.... ex:moi foaf:homepage ex:home.... ex:me owl:sameAs ex:moi.... IN-MEM T-BOX ON-DISK A-BOX ON-DISK OUTPUT ON-DISK HASHTABLE

34 Digital Enterprise Research Institute 34 Scalable Reasoning: Equality Use canonical ‘pivot’ identifiers During Scan 2:  Maintain on-disk hashtable with equality chains  Re-write G2 hashtable keys to reflect new equivalences Scan 3: Scan input and inferred data, re-write according to owl:sameAs closure. ⇒ ex:home owl:sameAs ex2:home. ex:me owl:sameAs ex2:me.

35 Digital Enterprise Research Institute 35 Rules Overview G0: 1 rule: only T-Box in antecedent (No A-Box) G1: 17 rules: at least one T-Box statement, only one A-Box statement in antecedent (No A-Box joins) G2: 7 rules: at least one T-Box statement, multiple A-Box statements in antecedent (A-Box joins) G3: 4 rules: only A-Box in antecedent (No T-Box) ANTECEDENT ⇒ CONSEQUENT ?P owl:inverseOf ?Q.?s ?P ?o. ⇒ ?o ?Q ?s. ≥1 TBOX 1 ABOX ⇒ ABOX ?P a :TransitiveProperty. ?x ?P ?y. ?y ?P ?z. ⇒ ?x ?p ?z. >1 TBOX >1 ABOX ⇒ ABOX ?x :sameAs ?y. ?x ?P ?o. ⇒ ?y ?p ?o. 0 TBOX >1 ABOX ⇒ ABOX

36 Digital Enterprise Research Institute 36 Evaluation: Scalable Reasoning G0,G1 142M OUT <1 HR G0,G1,G2,G 3 151M OUT ~16 HR G0,G1,G2,G3 On-disk hashtables begin to struggle

37 Digital Enterprise Research Institute 37 amazon:MSLam foaf:made amazon:Compilers. amazon:MSLam foaf:made dblp:AhoLamSethiUllman. Problem: Synonymous Duplicate Answers 37 Query: Give me books written by Monica Lam? amazon:MSLam foaf:made ?Book.

38 Digital Enterprise Research Institute 38 Web Reasoning: Forward Chaining 38 Forward Chaining materialisation:  Avoid runtime expense of backward-chaining – Users taught impatience by Google  Pre-compute & index answers for quick retrieval  Web-scale systems should be scalable! – More data = more disk space


Download ppt " Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Using Datalog for Rule-Based."

Similar presentations


Ads by Google