Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Similar presentations


Presentation on theme: "Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of."— Presentation transcript:

1 Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of Aalborg, Denmark 3 University of Passau, Germany

2 Outline of this Tutorial Part I –RDF in Centralized Relational Databases Part II –RDF in Distributed Settings Part III –Managing Uncertain RDF Data

3 Outline for Part I Part I.1: Foundations –Introduction to RDF and Linked Open Data –A short overview of SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

4 bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) Information Extraction YAGO/DBpedia et al. >120 M facts for YAGO2 (mostly from Wikipedia infoboxes & categories)

5 YAGO2 Knowledge Base Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn Max Planck means subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means Max Karl Ernst Ludwig Planck Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State Angela Dorothea Merkel Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means instanceOf subclass means Angela Merkel means citizenOf instanceOf locatedIn subclass 3 M entities, 120 M facts 100 relations, 200k classes 3 M entities, 120 M facts 100 relations, 200k classes accuracy 95% accuracy 95%

6 Why care about scalability? Rapid growth of available semantic data Sources: linkeddata.org wikipedia.org

7 Why care about scalability? Rapid growth of available semantic data More than 30 billion triples in more than 200 sources across the LOD cloud DBPedia: 3.4 million entities, 1 billion triples Sources: linkeddata.org wikipedia.org As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

8 … and still growing Billion triple challenge 2008: 1B triples Billion triple challenge 2010: 3B triples Billion triple challenge 2011: 2B triples | War stories from –BigOWLIM: 12B triples in Jun 2009 –Garlik 4Store: 15B triples in Oct 2009 –OpenLink Virtuoso: 15.4B+ triples –AllegroGraph: 1+ Trillion triples

9 Queries can be complex, too SELECT DISTINCT ?a ?b ?lat ?long WHERE { ?a dbpedia:spouse ?b. ?a dbpedia:wikilink dbpediares:actor. ?b dbpedia:wikilink dbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long. } Q7 on BTC2008 in [Neumann & Weikum, 2009]

10 What effects does the financial crisis have on migration rates in the US?

11 Is there a significant increase of serious weather conditions in Europe over the past 20 years?

12 Which glutamic-acid proteases are inhibitors of HIV?

13 Question Answering (QA) Systems KB of curated, structured data 10 trillion (!) facts, 50k algorithms KB from Wikipedia and user edits 600 million facts, 25 million entities

14 IBM Watson: Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

15 SPARQL 1.0 / 1.1 Query language for RDF suggested by the W3C. 3 ways to interpret RDF data: – Instances of logical predicates (facts) – Graphs (subjects/objects as nodes, predicates as directed and labeled edges) – Relations (either multiple binary relations or a single, large ternary relation) SPARQL main building block: – select-project-join combination of relational triple patterns equivalent to graph isomorphism queries over a potentially very large RDF graph Query language for RDF suggested by the W3C. 3 ways to interpret RDF data: – Instances of logical predicates (facts) – Graphs (subjects/objects as nodes, predicates as directed and labeled edges) – Relations (either multiple binary relations or a single, large ternary relation) SPARQL main building block: – select-project-join combination of relational triple patterns equivalent to graph isomorphism queries over a potentially very large RDF graph

16 SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) vegetarian Albert_Einstein physicist Jim_Carrey actor Ontario Canada Ulm Germany scientist chemist Otto_Hahn Frankfurt Mike_Myers NewmarketScarborough Europe isA bornIn locatedIn isA

17 SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) vegetarian Albert_Einstein physicist Jim_Carrey actor Ontario Canada Ulm Germany scientist chemist Otto_Hahn Frankfurt Mike_Myers NewmarketScarborough Europe isA bornIn locatedIn isA actor Ontario ?person ?loc bornIn locatedIn isA Find subgraphs of this form: variables constants SELECT ?person WHERE { ?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario. }

18 Eliminate duplicates in results Return results in some order with optional LIMIT n clause Optional matches and filters on bounded vars More operators: ASK, DESCRIBE, CONSTRUCT See: SPARQL 1.0 – More Features SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c} SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person) SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}

19 SPARQL 1.1 Extensions of the W3C W3C SPARQL 1.1: Aggregations ( COUNT, AVG, …) and grouping Subqueries in the WHERE clause Safe negation: FILTER NOT EXISTS {?x …} –Syntactic sugar for OPTIONAL {?x … } FILTER(!BOUND(?x)) Expressions in the SELECT clause: SELECT (?a+?b) AS ?sum Label constraints on paths: ?x foaf:knows/foaf:knows/foaf:name ?name More functions and operators …

20 RDF+SPARQL: Centralized Engines BigOWLIM (now ontotext.com)ontotext.com OpenLink Virtuoso OntoBroker (now semafora-systems.com)semafora-systems.com Apache Jena (different main-memory/relational backends)Apache Jena Sesame (now openRDF.org)openRDF.org SW-Store, Hexastore, 3Store, RDF-3X (no reasoning) System deployments with >10 11 triples ( see )

21 SPARQL: Extensions from Research (1) More complex graph patterns: Transitive paths [Anyanwu et al., WWW07] SELECT ?p, ?c WHERE { ?p isA scientist. ?p ??r ?c. ?c isA Country. ?c locatedIn Europe. PathFilter(cost(??r) < 5). PathFilter(containsAny(??r,?t ). ?t isA City. } Regular expressions [Kasneci et al., ICDE08] SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.} Meanwhile mostly covered by the SPARQL 1.1 query proposal.

22 SPARQL: Extensions from Research (2) Queries over federated RDF sources: Determine distribution of triple patterns as part of query (for example in Jena ARQ)Jena ARQ Automatically route triple predicates to useful sources

23 SPARQL: Extensions from Research (2) Queries over federated RDF sources: Determine distribution of triple patterns as part of query (for example in Jena ARQ)Jena ARQ Automatically route triple predicates to useful sources Potentially requires mapping of identifiers from different sources SPARQL 1.1 explicitly supports federation of sources -federated-query/

24 Ranking is Essential! Queries often have a huge number of results: –scientists from Canada –publications in databases –actors from the U.S. Queries may have no matches at all: –Laboratoire d'informatique de Paris 6 –most beautiful railway stations Ranking is an integral part of search Huge number of app-specific ranking methods: paper/citation count, impact, salary, … Need for generic ranking of 1) entities and 2) facts

25 Extending Entities with Keywords Remember: entities occur in facts & in documents Associate entities with terms in those documents, keywords in URIs, literals, … (context of entity) chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy

26 Extensions: Keywords Consider witnesses/sources (provenance meta-facts) Allow text predicates with each triple pattern (à la XQ-FT) Problem: not everything is triplified! European composers who have won the Oscar, whose music appeared in dramatic western scenes, and who also wrote classical pieces ? Select ?p Where { ?p instanceOf Composer. ?p bornIn ?t. ?t inCountry ?c. ?c locatedIn Europe. ?p hasWon ?a.?a Name AcademyAward. ?p contributedTo ?movie [western, gunfight, duel, sunset]. ?p composed ?music [classical, orchestra, cantata, opera]. } Semantics: triples match struct. pred. witnesses match text pred. Semantics: triples match struct. pred. witnesses match text pred.

27 Select ?r, ?a Where { ?r instOf researcher [computer science]. ?a workedOn ?x [Manhattan project]. ?r hasAdvisor ?a. } Select ?r, ?a Where { ?r ?p1 ?o1 [computer science]. ?a ?p2 ?o2 [Manhattan project]. ?r ?p3 ?a. } Extensions: Keywords Consider witnesses/sources (provenance meta-facts) Allow text predicates with each triple pattern (à la XQ-FT) Problem: not everything is triplified! Proximity of keywords or phrases boosts expressiveness Proximity of keywords or phrases boosts expressiveness French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics]. ?p2 instanceOf ?c2 [Italy, singer]. ?p1 marriedTo ?p2. } CS researchers whose advisors worked on the Manhattan project?

28 Extensions: Keywords CLEF/INEX Linked Data Track Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor? German politicians successor other stepped down before actual term name ancestor SELECT ?s ?s1 WHERE { ?s rdf:type. ?s1 ?s. FILTER FTContains (?s, "stepped down early"). } https://inex.mmci.uni-saarland.de/tracks/lod/ Problem: not everything is triplified!

29 Extensions: Keywords / Multiple Languages Which river does the Brooklyn Bridge cross? Welchen Fluss überspannt die Brooklyn Bridge? ¿Por qué río cruza la Brooklyn Bridge? Quale fiume attraversa il ponte di Brooklyn? Quelle cours d'eau est traversé par le pont de Brooklyn? Welke rivier overspant de Brooklyn Bridge? river, cross, Brooklyn Bridge Fluss, überspannen, Brooklyn Bridge río, cruza, Brooklyn Bridge fiume, attraversare, ponte di Brooklyn cours d'eau, pont de Brooklyn rivier, Brooklyn Bridge, overspant PREFIX dbo: PREFIX res: SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri. } bielefeld.de/~cunger/qald/ Multilingual Question Answering over Linked Data (QALD-3), CLEF Problem: not everything is triplified!

30 What Makes a Fact Good? Confidence: Prefer results that are likely correct accuracy of info extraction trust in sources (authenticity, authority) bornIn (Jim Gray, San Francisco) from Jim Gray was born in San Francisco (en.wikipedia.org) livesIn (Michael Jackson, Tibet) from Fans believe Jacko hides in Tibet (www.michaeljacksonsightings.com) Informativeness: Prefer results with salient facts Statistical estimation from: frequency in answer frequency on Web frequency in query log q: Einstein isa ? Einstein isa scientist Einstein isa vegetarian q: ?x isa vegetarian Einstein isa vegetarian Whocares isa vegetarian Conciseness: Prefer results that are tightly connected size of answer graph weight of Steiner tree Einstein won NobelPrize Bohr won NobelPrize Einstein isa vegetarian Cruise isa vegetarian Cruise born 1962 Bohr died 1962 Diversity: Prefer variety of facts E won … E discovered … E played … E won … E won …

31 How Can We Implement This? Confidence: Prefer results that are likely correct accuracy of info extraction trust in sources (authenticity, authority) Informativeness: Prefer results with salient facts Statistical estimation from: frequency in answer frequency on Web frequency in query log Conciseness: Prefer results that are tightly connected size of answer graph weight of Steiner tree Diversity: Prefer variety of facts Empirical accuracy of Information Extraction PageRank-style estimate of trust combine into: max { accuracy (f,s) * trust(s) | s witnesses(f) } Statistical Language Models [Zhai et al., Elbassuoni et al.] Graph algorithms (BANKS, STAR, …) [S.Chakrabarti et al., G.Kasneci et al., …] PageRank-style entity/fact ranking [V. Hristidis et al., S.Chakrabarti, …] IR models: tf*idf … [K.Chang et al., …] Statistical Language Models [de Rijke et al.] or

32 Outline for Part I Part I.1: Foundations –Introduction to RDF –A short overview of SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

33 RDF in Rowstores Rowstore: general relational database, storing relations (incl. facts) as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …) General principles: –store triples in one giant three-attribute table (subject, predicate, object) –convert SPARQL to equivalent SQL –The database will do the rest Strings often mapped to unique integer IDs Used by many TripleStores, including 3Store, Jena, HexaStore, RDF-3X, … Simple extension to quadruples (with graphid): (graph,subject,predicate,object) We consider only triples for simplicity!

34 Example: Single Triple Table ex:Katjaex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Martin ex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:Saarland_University. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject predicate object ex:Katjaex:teaches ex:Databases ex:Katjaex:works_for ex:MPI_Informatics ex:Katja ex:PhD_from ex:TU_Ilmenau ex:Martin ex:teaches ex:Databases ex:Martin ex:works_for ex:MPI_Informatics ex:Martin ex:PhD_from ex:Saarland_University ex:Ralf ex:teaches ex:Information_Retrieval ex:Ralf ex:PhD_from ex:Saarland_University ex:Ralf ex:works_for ex:Saarland_University ex:Ralf ex:works_for ex:MPI_Informatics

35 Conversion of SPARQL to SQL General approach to translate SPARQL into SQL: (1) Each triple pattern is translated into a (self-) JOIN over the triple table (2) Shared variables create JOIN conditions (3) Constants create WHERE conditions (4) FILTER conditions create WHERE conditions (5) OPTIONAL clauses create OUTER JOINS (6) UNION clauses create UNION expressions

36 SELECT FROM Triples P1, Triples P2, Triples P3 Example: Conversion to SQL Query SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u, Saar)) SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=works_for AND P2.predicate=works_for AND P3.predicate=phd_from SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=works_for AND P2.predicate=works_for AND P3.predicate=phd_from AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=works_for AND P2.predicate=works_for AND P3.predicate=phd_from AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, Saar) SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=works_for AND P2.predicate=works_for AND P3.predicate=phd_from AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, Saar) SELECT R1.A, R1.B, R2.T FROM ( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=works_for AND P2.predicate=works_for AND P3.predicate=phd_from AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, Saar) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=teaches) AS R2 ) ON (R1.A=R2.A) P1 P2 P3 P4 Filter regex(?u,Saar) Projection ?u ?a,?u ?a

37 Is that all? Well, no. Which indexes should be built? (to support efficient evaluation of triple patterns) How can we reduce storage space? How can we find the best execution plan? Existing databases need modifications: flexible, extensible, generic storage not needed here cannot deal with multiple self-joins of a single table often generate bad execution plans

38 Dictionary for Strings Map all strings to unique integers (e.g., via hashing) Regular size (4-8 bytes), much easier to handle Dictionary usually small, can be kept in main memory This may break original lexicographic sorting order RANGE conditions (not in SPARQL) are difficult! FILTER conditions may be more expensive!

39 Indexes for Commonly Used Triple Patterns Patterns with a single variable are frequent Example: Albert_Einstein invented ?x Build clustered index over (s,p,o) Can also be used for pattern like Albert_Einstein ?p ?x Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6) SPO, POS, OSP to cover all possible triplet patterns SOP, OPS, PSO to have all sort orders for patterns with two vars (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) … All triples in (s,p,o) order B+ tree for easy access 1.Lookup ids for constants: Albert_Einstein 16, invented 24 2.Lookup known prefix in index: (16,24,0) 3.Read results while prefix matches: (16,24,567), (16,24,876) come already sorted! Triple table no longer needed, all triples in each index

40 Why Sort Order Matters for Joins (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) (16,33,46578) (16,56,1345) (24,16,1353) (27,18,133) (47,37,20495) (50,134,1056) MJ When inputs sorted by join attribute, use Merge Join: sequentially scan both inputs immediately join matching triples skip over parts without matches allows pipelining When inputs are unsorted/sorted by wrong attribute, use Hash Join: build hash table from one input scan other input, probe hash table needs to touch every input triple breaks pipelining (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) (27,18,133) (50,134,1056) (16,56,1345) (24,16,1353) (47,37,20495) (16,33,46578) HJ In general, Merge Joins are more preferable: small memory footprint, pipelining

41 RDF-3x: Even More Indexes! SPARQL 1.0 considers duplicates (unless removed with DISTINCT ) but does not (yet) support aggregates/counting often queries with many duplicates like SELECT ?x WHERE ?x ?y Germany. to retrieve entities related to Germany (but counts may be important in the application!) this materializes many identical intermediate results Solution: even more redundancy! Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O Example: SO contains, for each pair (s,o), the number of triples with subject s and object o Do not materialize identical bindings, but keep counts Example: ?x=Albert_Einstein:4 ; ?x=Angela_Merkel:10 15 indexes overall (all SPO permutations + their unique subsets)

42 RDF-3x: Compression Scheme for Triplets Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO: v1=S, v2=P, v3=O Step 1: compute per-attribute deltas Step 2: variable-byte encoding for each delta triple 1-13 bytes (16,19,5356) (16,24,567) (16,24,676) (27,19,643) (27,48,10486) (50,10,10456) (16,19,5356) (0,5,-4798) (0,0,109) (11,-5,-34) (0,29,9843) (23,-38,-30) gap bit header (7 bits) Delta of value 1 (0-4 bytes) Delta of value 2 (0-4 bytes) Delta of value 3 (0-4 bytes) When gap=1, the delta of value3 is included in header, all others are 0 Otherwise, header contains length of encoding for each of the three deltas (5*5*5=125 combinations stored in 7 bits) Many variants exist; this one is designed for triplets…

43 Compression Effectiveness vs. Efficiency Byte-level encoding almost as effective as bit-level encoding techniques (Gamma, Golomb, Rice, etc.) Much faster (10x) for decompressing Example for Barton dataset [Neumann & Weikum: VLDB10]: –Raw data 51 million triples, 7GB uncompressed (as N-Triples) –All 6 main indexes: 1.1GB size, 3.2s decompression with byte-level encoding Optionally: additional compression with LZ77 2x more compact, but much slower to decompress –Compression always on page level

44 POS(works_for,?u,?a) POS(pdh_from,?u,?a) PSO(works_for,?u,?b) Projection ?u,?a ?u ?a MJ Filter regex(?u,Saar) POS(teaches,?a,?t) Back to the Example Query SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u, Saar)) Which of the two plans is better? How many intermediate results? POS(works_for,?u,?a) POS(works_for,?u,?b) PSO(phd_from,?a,?u) POS(teaches,?a,?t) Projection ?u ?a,?u ?a MJ HJ Filter regex(?u,Saar) Core ingredients of a good query optimizer are selectivity estimators for triple patterns (index scans) and joins

45 RDF-3x: Selectivity Estimation How many results will a triple pattern have? Standard databases: Per-attribute histograms Assume independence of attributes Use aggregated indexes for exact count Additional join statistics for triple blocks (pages): too simplistic and inexact … (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) … Assume independence between triple patterns; additionally precompute exact statistics for frequent paths in the data

46 Handling Updates What should we do when our data changes? (SPARQL 1.1 has updates!) Assumptions: Queries far more frequent than updates Updates mostly insertions, hardly any deletions Different applications may update concurrently Solution: Differential Indexing

47 RDF-3x: Differential Updates Workspace A: Triples inserted by application A Workspace B: Triples inserted by application B on-demand indexes at query time kept in main memory Staging architecture for updates in RDF-3X completion of A completion of B Deletions: Insert the same tuple again with deleted flag Modify scan/join operators: merge differential indexes with main index

48 Outline for Part I Part I.1: Foundations –Introduction to RDF –A short overview of SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

49 Principles Observations and assumptions: Not too many different predicates Triple patterns usually have fixed predicate Need to access all triples with one predicate Design consequence: Use one two-attribute table for each predicate

50 Example: Columnstores ex:Katjaex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Martin ex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:Saarland_University. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject object ex:Katjaex:TU_Ilmenau ex:Martin ex:Saarland_University ex:Ralf ex:Saarland_University PhD_from subject object ex:Katjaex:MPI_Informatics ex:Martin ex:MPI_Informtatics ex:Ralf ex:Saarland_University ex:Ralfex:MPI_Informatics works_for subject object ex:Katjaex:Databases ex:Martin ex:Databases ex:Ralf ex:Information_Retrieval teaches

51 Simplified Example: Query Conversion SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } SELECT W1.subject as A, W2.subject as B FROM works_for W1, works_for W2, phd_from P3 WHERE W1.object=W2.object AND W1.subject=P3.subject AND W1.object=P3.object So far, this is yet another relational representation of RDF. So, what is a columnstore?

52 Columnstores and RDF Columnstores store all columns of a table separately. subject object ex:Katjaex:TU_Ilmenau ex:Martin ex:Saarland_University ex:Ralf ex:Saarland_University PhD_from PhD_from:subject ex:Katja ex:Martin ex:Ralf PhD_from:object ex:TU_Ilmenau ex:Saarland_University Advantages: Fast if only subject or object are accessed, not both Allows for a very compact representation Problems: Need to recombine columns if subject and object are accessed Inefficient for triple patterns with predicate variable

53 Compression in Columnstores General ideas: Store subject only once Use same order of subjects for all columns, including NULL values when necessary Additional compression to get rid of NULL values subject ex:Katja ex:Martin ex:Ralf PhD_from ex:TU_Ilmenau ex:Saarland_University NULL works_for ex:MPI_Informatics ex:Saarland_University ex:MPI_Informatics teaches ex:Databases ex:Databases ex:Information_Retrieval NULL PhD_from: bit[1110] ex:TU_Ilmenau ex:Saarland_University Teaches: range[1-3] ex:Databases ex:Databases ex:Information_Retrieval

54 Outline for Part I Part I.1: Foundations –Introduction to RDF –A short overview of SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

55 Property Tables Group entities with similar predicates into a relational table (for example using RDF types or a clustering algorithm). ex:Katjaex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Martin ex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:Saarland_University. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject teaches PhD_from ex:Katjaex:Databasesex:TU_Ilmenau ex:Martin ex:Databasesex:Saarland_University ex:Ralf ex:IRex:Saarland_University subject teaches PhD_from ex:Katjaex:Databasesex:TU_Ilmenau ex:Martin ex:Databasesex:Saarland_University ex:Ralf ex:IRex:Saarland_University ex:AxelNULLex:TU_Vienna subject predicate object ex:Katjaex:works_for ex:MPI_Informatics ex:Martin ex:works_for ex:MPI_Informatics ex:Ralf ex:works_for ex:Saarland_University ex:Ralf ex:works_for ex:MPI_Informatics Leftover triples

56 Property Tables: Pros and Cons Advantages: More in the spirit of existing relational systems Saves many self-joins over triple tables etc. Disadvantages: Potentially many NULL values Multi-value attributes problematic Query mapping depends on schema Schema changes very expensive

57 Even More Systems… Store RDF data as sparse matrix with bit-vector compression [BitMat, Hendler at al.: ISWC09] Convert RDF into XML and use XML methods (XPath, XQuery, …) Store RDF data in graph databases and perform bi-simulation [Fletcher at al.: ESWC12] or employ specialized graph index structures [gStore, Zou et al.: PVLDB11] And many more … See our list of readings.

58 Which Technique is Best? Performance depends a lot on precomputation, optimization, implementation, fine-tuning … Comparative results on BTC 2008: (from [Neumann & Weikum, 2009]) RDF-3X RDF-3X (2008) COLSTORE ROWSTORE RDF-3X RDF-3X (2008) COLSTORE ROWSTORE

59 Challenges and Opportunities SPARQL with different entailment regimes New SPARQL 1.1 features (grouping, aggregation, updates) User-oriented ranking of query results –Efficient top-k operators –Effective scoring methods for structured queries What are the limits of a centralized RDF engine? Dealing with uncertain RDF data – what is the most likely query answer? –Triples with probabilities probabilistic databases

60 Outline of this Tutorial Part I –RDF in Centralized Relational Databases Part II –RDF in Distributed Settings Part III –Managing Uncertain RDF Data

61 Outline for Part II Part II.1: Search Engines for the Semantic Web Part II.2: Mediator-based and Federated Architectures

62 Semantic Web Search Engines Querying RDF data collections started by adapting existing search engines to RDF data. –Crawling for.rdf files, and HTML documents with embedded RDF content (see: RDFa microformat).RDFamicroformat –Indexing & search based on keywords extracted from entity- and property names. –Usually generate a virtual document for an entity (string literals and human-readable names). Swoogle [Ding et al., CIKM04] (University of Maryland)Swoogle Falcons [Cheng at al., WWW08] (Nanjing University)Falcons

63

64

65 Outline for Part II Part II.1: Search Engines for the Semantic Web Part II.2: Mediator-based and Federated Architectures

66 Classification of Distributed Approaches Approaches for querying distributed and potentially heterogeneous (RDF) data sources Materialization-based approaches (data-warehousing) Virtually materialized approaches Peer-2-Peer Federated systems MapReduce/ Hadoop Shared-memory architectures (Message-Passing, RMI, etc.) Shared-nothing architectures Mediator-based systems Shard, Jena-HBase [Abadi et al. PVLDB11] Trinity (MSR) DARQFedEx YARS2 Gridvine RDFPeers Partout 4Store Eagre

67 How to Integrate Data Sources? Ship and integrate data from different sources to the client. Three common approaches: – Query-driven (single mediator) – Database federations (exported schemas) – Warehousing (fully integrated & centrally managed)

68 Query-Driven Approach query result query result query result List of SPARQL endpoints: DBpedia: YAGO: https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browserhttps://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browser List of SPARQL endpoints: DBpedia: YAGO: https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browserhttps://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browser

69 Advantages of Query-Driven Integration No need to copy data –no or little own storage costs –no need to purchase data Potentially more up-to-date data Mediator holds catalog (statistics, etc.) and may optimize queries Only generic query interface needed at sources (SPARQL endpoints) May be less draining on sources Sources often even unaware of participation

70 result query Federation-based Approach result query result query Source 1Source 2 …Source n

71 Advantages of Federation-Based Integration Very similar to query-driven integration, except –that the sources know that they are part of a federation; –and they export their local schemas into a federated schema. Intermediate step toward full integration of the data in a single warehouse.

72 Warehousing Architecture SPARQL Client Metadata SPARQL Client Integrated LOD index: Integrated LOD index:

73 Advantages of Warehousing Perform Extract-Transform-Load (ETL) processes with periodic updates over the source High query performance Local processing at sources unaffected Can operate even when sources are offline Can query data that is no longer stored at sources More detailed statistics and metadata available at warehouse –Modify, summarize (store aggregates), analyse –Add historical information, provenance, timestamps, etc.

74 Classification of Distributed Approaches Approaches for querying distributed and potentially heterogeneous (RDF) data sources Materialization-based approaches (data-warehousing) Virtually materialized approaches Peer-2-Peer Federated systems MapReduce/ Hadoop Shared-memory architectures (Message-Passing, RMI, etc.) Shared-nothing architectures Mediator-based systems Shard, Jena-HBase [Abadi et al. PVLDB11] Trinity (MSR) DARQFedEx YARS2 Gridvine RDFPeers Partout 4Store Eagre

75 DARQ [Leser et al., Humbold University Berlin, ISWC08] Classical mediator-based architecture connecting a given SPARQL endpoint to other endpoints via a combination of wrappers and service descriptions. Service descriptions –RDF data descriptions –Statistical information –Binding constraints Query optimizer based on rewriting rules and cost estimations for physical join operators.

76 FedEx [fluid Ops & MPI-INF: ISWC11] Online query optimization over federations of SPARQL endpoints. Cost estimates based on result sizes of SPARQL ASK queries. Bound nested-loop joins by grouping sets of variable bindings into SPARQL UNION queries (instead of using FILTER conditions) : SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient. ?drug drugbank:casRegistryNumber ?id. ?keggDrug rdf:type kegg:Drug. ?keggDrug bio2rdf:xRef ?id. ?keggDrug purl:title ?title. }

77 Partout [Galaraga, Hose, Schenkel: PVLDB13] Materialization-based, distributed & workload-aware SPARQL engine. Distribution helps to scale-out query processing via parallel join executions. Triple fragments are distributed over hosts H1…Hn by –(1) maximizing query locality, and –(2) balancing the hosts workload. H1…Hn run local RDF-3x instances. –(1) local S,P,O statistics by RDF-3x, –(2) global (cached) statistics. Global query workload (aka. query log)Global query graph

78 Partout Example Query Plan H1, H2, H3 hold triplets for ?s rdf:type db:city H1 has triplets for ?s db:located db:Germany H2 has triplets for ?s db:name ?name

79 More Distributed RDF Engines Shard TripleStore (Hadoop + Hash-Partitioning)Shard TripleStore RDFPeers (P2P/Chord architecture) [Cai et al., WWW04] Gridvine (P2P/Chord architecture) [Aberer et al., VLDB07] YARS2 (federated architecture) [Decker at al., ISWC07] Jena-HBase (Hadoop & HBase) [Khadilkar et al., ISWC12]Jena-HBase SW-Store (Hadoop/RDF-3x) [Abadi et al., PVLDB11]SW-Store 4Store (materialized, shared-nothing) [Harris et al., SSWS09]4Store Eagre (materialized, shared-nothing) [HKUST & HP Labs, ICDE13] Trinity (materialized, shared-memory, message passing) [MSR, SIGMOD13]Trinity more in Zois tutorial in the afternoon…

80 Outline of this Tutorial Part I –RDF in Centralized Relational Databases Part II –RDF in Distributed Settings Part III –Managing Uncertain RDF Data

81 Outline for Part III Part III.1: Motivation –What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines –Stanford Trio Project U Washington Part III.4: Managing Uncertain RDF Data Max Planck Institute

82 What is Uncertain Data? Certain DataUncertain Data Temperature is FSensor reported 75 ± 0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (60%) or a Sparrow (40%) It always rains in GalwayThere is a 89% chance of rain in Galway tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month Johns age is 23Johns age is in [20,30]

83 … And Why Does It Arise? Certain DataUncertain Data Temperature is FSensor reported 75 ± 0.5 F Bob works for YahooBob works for either Yahoo or Microsoft Mary sighted a FinchMary sighted either a Finch (60%) or a Sparrow (40%) It always rains in GalwayThere is a 89% chance of rain in Galway tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month Johns age is 23Johns age is in [20,30] Precision of devices Lack of exact information (alternatives and missing values) Uncertainty about future events Anonymization

84 Applications: Deduplication Name John Doe J. Doe ? 80% match

85 Applications: Information Integration name, hPhone, oPhone, hAddr, oAddr name, phone, address Combined View at the schema level: schema integration at the instance level: record linkage

86 Applications: Information Extraction (I) RestaurantZip Hard Rock Cafe

87 Applications: Information Extraction (II) What is Uncertain Data and Why Does It Arise?

88 bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) type(Jeff, Author) [0.9] author(Jeff, Drag_Book) [0.8] author(Jeff,Cind_Book) [0.6] worksAt(Jeff, Bell_Labs) [0.7] type(Jeff, CEO) [0.4] Applications: Information Extraction (III) YAGO/DBpedia et al. New fact candidates >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) 100s M additional facts from Wikipedia text

89 How do current database management systems (DBMS) handle uncertainty? They dont

90 Clean: turn into data that DBMSs can handle What Do (Most) Applications Do? (1)Loss of information (2)Errors compound and propagate insidiously ObserverBird-1 Mary Finch: 80% Sparrow: 20% Susan Dove: 70% Sparrow: 30% Jane Hummingbird: 65% Sparrow: 35% Bird-1 Finch Dove Hummingbird

91 Outline for Part III Part III.1: Motivation –What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines –Stanford Trio Project U Washington Part III.4: Managing Uncertain RDF Data Max Planck Institute Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management) Morgan & Claypool Publishers, 2012

92 Databases Today are Deterministic An item either is in the database or it is not. A tuple either is in the query answer or it is not. This applies to all variety of data models: – Relational, E/R, hierarchical, XML, …

93 What is a Probabilistic Database ? An tuple belongs to the database is a probabilistic event. A tuple is an answer to the query is a probabilistic event. Can be extended to all possible kinds of data models; we consider only probabilistic relational data.

94 Sample Spaces & Venn Diagrams Sample Space Sample space : all possible events that can be observed. Pr( ) = 1. Random variable χ t assigns a probability to an event s.t. 0 Pr( χ t ) 1. As a convention, we will use tuple identifiers in the place of random variables to denote probabilistic events. Tuple t 1 is in the database. Tuple t 2 is an answer to a query.

95 Possible Worlds Semantics int, varchar(55), datetime Employee(ID:int, name:varchar(55), dob:datetime, salary:int) Attribute domains: Relational schema: # values: 2 32, 2 440, 2 64 # of possible tuples: 2 32 × × 2 64 × 2 32 # of possible relation instances: × × 2 64 × 2 32 Employee(...), Projects(... ), Groups(...), WorksFor(...) Database schema: # of possible database instances: N (= big but finite)

96 The Definition Given a finite set of all possible database instances: INST = {I 1, I 2, I 3,..., I N } Definition: A probabilistic database I p is a probability distribution on INST s.t. i=1,…,N Pr(I i ) = 1 Pr : INST [0,1] Definition: A possible world is I INST s.t. Pr(I) > 0

97 Example CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueDenverGizmo Pr(I 1 ) = 1/3 CustomerAddressProduct JohnBostonGadget SueDenverGizmo CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueSeattleCamera CustomerAddressProduct JohnBostonGadget SueSeattleCamera Pr(I 2 ) = 1/12 Pr(I 3 ) = 1/2 Pr(I 4 ) = 1/12 Possible worlds = {I 1, I 2, I 3, I 4 } I p =

98 Tuples as Events One tuple t event t I Two tuples t 1, t 2 event t 1 I Λ t 2 I Pr(t) = I: t I Pr(I) Pr(t 1 Λ t 2 ) = I: t 1 I Λ t 2 I Pr(I) Marginal probability of t Marginal probability of t Marginal probability of t 1 Λ t 2 Marginal probability of t 1 Λ t 2

99 Tuple Correlations Pr(t 1 Λ t 2 ) = 0 Pr(t 1 Λ t 2 ) < Pr(t 1 ) Pr(t 2 ) Negatively correlated Pr(t 1 Λ t 2 ) > Pr(t 1 ) Pr(t 2 ) Positively correlated Pr(t 1 Λ t 2 ) = Pr(t 1 ) = Pr(t 2 ) Identical = N P DΛDΛ Disjoint-AND Pr(t 1 Λ t 2 ) = Pr(t 1 ) Pr(t 2 ) Independent-AND I ΛI Λ Independent-OR Pr(t 1 V t 2 ) = 1-(1-Pr(t 1 ))(1-Pr(t 2 )) Disjoint-OR Pr(t 1 V t 2 ) = Pr(t 1 )+Pr(t 2 ) I VI V D V Pr( t 1 ) = 1 - Pr(t 1 ) NOT

100 Example with Correlations CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueDenverGizmo Pr(I 1 ) = 1/3 CustomerAddressProduct JohnBostonGadget SueDenverGizmo CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueSeattleCamera CustomerAddressProduct JohnBostonGadget SueSeattleCamera Pr(I 2 ) = 1/12 Pr(I 3 ) = 1/2 Pr(I 4 ) = 1/12 = N P D D I p =

101 Special Case! Pr(I) = t I pr(t) × t I (1-pr(t)) No restrictions w.r.t. other tuples pr : TUP (0,1] Tuple-independent probabilistic database INST = P (TUP) N = 2 M TUP = {t 1, t 2, …, t M } = all tuples

102 … back to the Venn Diagram (I) Sample Space If t 1 and t 2 are independent (per assumption!) : 4 possible worlds = 4 subsets of events Tuple t 1 is in the database. Tuple t 2 is in the database. Pr(Tuple t 1 is in the database and tuple t 2 is in the database) := Pr(t 1 ) x Pr(t 2 ) = pr(t 1 ) x pr(t 2 )

103 … back to the Venn Diagram (II) Sample Space If t 1 and t 2 are disjoint (per assumption!) : 3 possible worlds = 3 subsets of events Tuple t 1 is in the database. Tuple t 2 is in the database. Pr(Tuple t 1 is in the database and tuple t 2 is in the database) := 0

104 Tuple Prob. Possible Worlds NameCitypr JohnSeattlep 1 = 0.8 SueBostonp 2 = 0.6 FredBostonp 3 = 0.9 I p = NameCity JohnSeattl SueBosto FredBosto NameCity SueBosto FredBosto NameCity JohnSeattl FredBosto NameCity JohnSeattl SueBosto NameCity FredBosto NameCity SueBosto NameCity JohnSeattl I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I3I3 (1-p 1 )p 2 (1-p 3 ) I4I4 (1-p 1 )(1-p 2 )p 3 I5I5 p 1 p 2 (1-p 3 ) I6I6 p 1 (1-p 2 )p 3 I7I7 (1-p 1 )p 2 p 3 I8I8 p1p2p3p1p2p3 = 1 J = NameCity Assumption: Tuples are independent!

105 Tuple Prob. Query Evaluation NameCitypr JohnSeattlep1p1 SueBostonp2p2 FredBostonp3p3 CustomerProductDatepr JohnGizmo...q1q1 JohnGadget...q2q2 JohnGadget...q3q3 SueCamera...q4q4 SueGadget...q5q5 SueGadget...q6q6 FredGadget...q7q7 SELECT DISTINCT x.city FROM Person p x, Purchase p y WHERE x.Name = y.Customer and y.Product = Gadget SELECT DISTINCT x.city FROM Person p x, Purchase p y WHERE x.Name = y.Customer and y.Product = Gadget TupleProbability Seattle Boston 1-(1-q 2 )(1-q 3 ) p 1 ( ) 1- (1- ) × (1- ) p 2 ( ) p 3 1-(1-q 5 )(1-q 6 ) q7q7 Marginals

106 Summary of Data Model Possible Worlds Semantics Very powerful model: –Can capture any tuple correlations. Needs separate representation formalism: (just tables are generally not enough) Boolean event expressions to capture complex tuple- dependencies: provenance, lineage, views, etc. But: query evaluation may be very expensive. –Need to find good cases, otherwise must approximate. Possible Worlds Semantics Very powerful model: –Can capture any tuple correlations. Needs separate representation formalism: (just tables are generally not enough) Boolean event expressions to capture complex tuple- dependencies: provenance, lineage, views, etc. But: query evaluation may be very expensive. –Need to find good cases, otherwise must approximate.

107 Outline for Part III Part III.1: Motivation –What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines –Stanford Trio Project U Washington Part III.4: Managing Uncertain RDF Data Max Planck Institute

108 Trios Data Model 1.Alternatives 2.? (Maybe) Annotations 3.Confidence values 4.Lineage Uncertainty-Lineage Databases (ULDBs) [Widom et al.: 2008]

109 Trios Data Model 1. Alternatives: uncertainty about value Saw (witness, color, car) Amy red, Honda red, Toyota orange, Mazda Three possible instances

110 Six possible instances Trios Data Model 1. Alternatives 2. ? (Maybe): uncertainty about presence ? Saw (witness, color, car) Amy red, Honda red, Toyota orange, Mazda Bettyblue, Acura

111 Trios Data Model 1. Alternatives 2. ? (Maybe) Annotations 3. Confidences: weighted uncertainty Six possible instances, each with a probability ? Saw (witness, color, car) Amy red, Honda 0.5 red, Toyota 0.3 orange, Mazda 0.2 Betty blue, Acura 0.6

112 So Far: Model is Not Closed Saw (witness, car) Cathy Honda Mazda Drives (person, car) Jimmy, Toyota Jimmy, Mazda Billy, Honda Frank, Honda Hank, Honda Suspects Jimmy Billy Frank Hank Suspects = π person (Saw Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT

113 Example with Lineage IDSaw (witness, car) 11Cathy Honda Mazda IDDrives (person, car) 21 Jimmy, Toyota Jimmy, Mazda 22 Billy, Honda Frank, Honda 23Hank, Honda IDSuspects 31Jimmy 32 Billy Frank 33Hank Suspects = π person (Saw Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23

114 Example with Lineage ID Saw (witness, car) 11Cathy Honda Mazda ID Drives (person, car) 21 Jimmy, Toyota Jimmy, Mazda 22 Billy, Honda Frank, Honda 23Hank, Honda ID Suspects 31Jimmy 32 Billy Frank 33Hank Suspects = π person (Saw Drives) ? ? ? λ (31) = (11,2) Λ (21,2) λ (32,1) = (11,1) Λ (22,1); λ (32,2) = (11,1) Λ (22,2) λ (33) = (11,1) Λ 23 (7)

115 Operational Semantics Closure: up-arrow always exists Closure: up-arrow always exists Completeness: any (finite) set of possible instances can be represented D I 1, I 2, …, I n J 1, J 2, …, J m D possible instances Q on each instance rep. of instances direct implementation

116 Summary on Trios Data Model 1.Alternatives 2.? (Maybe) Annotations 3.Confidence values 4.Lineage Uncertainty-Lineage Databases (ULDBs) Theorem: ULDBs are closed and complete. Formally studied properties like minimization, equivalence, approximation and membership based on lineage. [Benjelloun, Widom, et al.: VLDB J. 08]

117 MYSTIQ: Query Complexity Data complexity of a query Q: Compute Q(I p ), for probabilistic database J – Extensional query evaluation: Works for safe query plans with PTIME data complexity – Intensional query evaluation: Works for any plan but has #P-complete data complexity in the general case Assume independent tuples in J Compute marginal probabilities for tuples in Q Boolean event expressions for intensional query evaluation Assume independent tuples in J Compute marginal probabilities for tuples in Q Boolean event expressions for intensional query evaluation

118 Extensional Query Evaluation Relational ops compute probabilities vpvp × v1v1 p1p1 v1v1 v2v2 p 1 p 2 v2v2 p2p2 vp1p1 vp2p2 v1-(1-p 1 )(1-p 2 )… [Fuhr&Roellke:1997, Dalvi&Suciu:2004] - vp1p1 vp 1 (1-p 2 )vp2p2 Data complexity: PTIME or: p 1 + p 2 + …

119 JonSeap1p1 Jonq1q1 q2q2 q3q3 SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Customer and y.Product = Gadget SELECT DISTINCT x.City FROM Person p x, Purchase p y WHERE x.Name = y.Customer and y.Product = Gadget JonSeap1q1p1q1 JonSeap1q2p1q2 JonSeap1q3p1q3 1-(1-p 1 q 1 )(1- p 1 q 2 )(1- p 1 q 3 ) × Jonq1q1 q2q2 q3q3 × JonSeap 1 (1-(1-q 1 )(1-q 2 )(1-q 3 )) [Dalvi&Suciu:2004] Wrong ! Correct ! Depends on plan !!! Jon1-(1-q 1 )(1-q 2 )(1-q 3 ) JonSeap1p1 Safe Plans

120 Query Complexity Sometimes there exists a correct extensional plan, but consider the following: Q bad :- R(x), S(x,y), T(y) Data complexity is #P-complete [Dalvi&Suciu:2004] NP = class of problems of the form is there a witness ? #P = class of problems of the form how many witnesses ? (will be coming back to this…) NP = class of problems of the form is there a witness ? #P = class of problems of the form how many witnesses ? (will be coming back to this…)

121 Intensional Database [Fuhr&Roellke:1997] Atomic event ids Intensional probabilistic database J each tuple t has an event attribute t.E Intensional probabilistic database J each tuple t has an event attribute t.E e 1, e 2, e 3, … p 1, p 2, p 3, … [0,1] e 3 Λ (e 5 V e 2 ) Probabilities: Event expressions: Λ, V,

122 Probability of Boolean Expressions E = X 1 X 3 v X 1 X 4 v X 2 X 5 v X 2 X 6 Sampling: Randomly make each variable true with the following probabilities Pr(X 1 ) = p 1, Pr(X 2 ) = p 2,....., Pr(X 6 ) = p 6 What is Pr(E) ??? Answer: Re-group cleverly E = X 1 (X 3 v X 4 ) v X 2 (X 5 v X 6 ) Pr(E) = 1 - (1-p 1 (1-(1-p 3 )(1-p 4 ))) (1-p 2 (1-(1-p 5 )(1-p 6 ))) Needed for query evaluation! Read once formula Read once formula

123 Complexity Issues Theorem [Valiant:1979] For a Boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form is there a witness ? SAT #P = class of problems of the form how many witnesses ? #SAT NP = class of problems of the form is there a witness ? SAT #P = class of problems of the form how many witnesses ? #SAT The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete

124 MYSTIQ: [Re, Suciu: VLDB04] Probabilistic Query Evaluation on Top of a Deterministic Database Engine Deterministic Database Deterministic Database SQL Query Probabilistic Query Engine (Top-k) Answers 1. Sampling 2. Extensional joins 3. Indexes

125 Outline for Part III Part III.1: Motivation –What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines –Stanford Trio Project U Washington Part III.4: Uncertain RDF Data –URDF Max Planck Institute

126 Uncertain RDF (URDF) Data Model Extensional Layer (information extraction & integration) –High-confidence facts: existing knowledge base (ground truth) –New fact candidates: extracted facts with confidence values –Integration of different knowledge sources: Ontology merging or explicit Linked Data (owl:sameAs, owl:equivProp.) Large Probabilistic Database of RDF facts Intensional Layer (query-time inference) –Soft rules: deductive grounding & lineage (Datalog/SLD resolution) –Hard rules: consistency constraints (more general FOL rules) –Propositional & probabilistic consistency reasoning

127 Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) livesIn(x,y) hasChild(x,z) livesIn(z,y) People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z bornOn(x,y) bornOn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 ) marriedTo(x,z,t 2 ) y z disjoint(t 1,t 2 ) [0.8] [0.5]

128 Soft Rules vs. Hard Rules (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) livesIn(x,y) hasChild(x,z) livesIn(z,y) People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z bornOn(x,y) bornOn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 ) marriedTo(x,z,t 2 ) yz disjoint(t 1,t 2 ) [0.8] [0.5] Deductive database: Datalog, core of SQL & relational algebra, RDF/S, OWL2-RL, etc. Deductive database: Datalog, core of SQL & relational algebra, RDF/S, OWL2-RL, etc. More general FOL constraints: Datalog with constraints, X-Tuples in Prob.-DBs owl:FunctionalProperty, etc. More general FOL constraints: Datalog with constraints, X-Tuples in Prob.-DBs owl:FunctionalProperty, etc.

129 URDF: Running Example Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Jeff Stanford University type [1.0] Surajit Princeton David Computer Scientist Computer Scientist worksAt [0.9] type [1.0] graduatedFrom [0.6] graduatedFrom [0.7] graduatedFrom [0.9] hasAdvisor [0.8] hasAdvisor [0.7] KB: Base Facts Derived Facts gradFr(Surajit,Stanford) gradFr(David,Stanford) Derived Facts gradFr(Surajit,Stanford) gradFr(David,Stanford) graduatedFrom [?]

130 Basic Types of Inference MAP Inference –Find the most likely assignment to query variables y under a given evidence x. –Compute: arg max y P( y | x) (NP-hard for MaxSAT) Marginal/Success Probabilities –Probability that query y is true in a random world under a given evidence x. –Compute: y P( y | x ) (#P-hard already for conjunctive queries)

131 General Route: Grounding & MaxSAT Solving Query graduatedFrom(x, y) Query graduatedFrom(x, y) CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton)) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton)) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) ) Grounding – Consider only facts (and rules) which are relevant for answering the query 2) Propositional formula in CNF, consisting of – Grounded hard & soft rules – Weighted base facts 3) Propositional Reasoning – Find truth assignment to facts such that the total weight of the satisfied clauses is maximized MAP inference: compute most likely possible world

132 [Theobald,Sozio,Suchanek,Nakashole: VLDS12] Find: arg max y P( y | x) Resolves to a variant of MaxSAT for propositional formulas URDF: MaxSAT Solving with Soft & Hard Rules { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford)) (hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) S: Mutex-const. Special case: Horn-clauses as soft rules & mutex-constraints as hard rules C: Weighted Horn clauses (CNF) Compute W 0 = clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in S t { Compute W f+ t = clauses C w(C) P(C is sat. | f = true); } Compute W S- t = clauses C w(C) P(C is sat. | S t = false); Choose truth assignment to f in S t that maximizes W f+ t, W S- t ; Remove satisfied clauses C; t++; } Compute W 0 = clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in S t { Compute W f+ t = clauses C w(C) P(C is sat. | f = true); } Compute W S- t = clauses C w(C) P(C is sat. | S t = false); Choose truth assignment to f in S t that maximizes W f+ t, W S- t ; Remove satisfied clauses C; t++; } Runtime: O(|S||C|) Approximation guarantee of 1/2 Runtime: O(|S||C|) Approximation guarantee of 1/2 MaxSAT Alg.

133 Deductive Grounding with Lineage (SLD Resolution/Datalog) \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) CD AB A (B (C D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0]

134 Lineage & Possible Worlds 1) Deductive Grounding –Dependency graph of the query –Trace lineage of individual query answers 2) Lineage DAG (not in CNF), consisting of –Grounded hard & soft rules –Weighted base facts Plus: entire derivation history! 3) Probabilistic Inference Compute marginals: P(Q): aggregate probabilities of all possible worlds where the lineage of the query evaluates to true P(Q|H): drop impossible worlds \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) 0.7x( )=0.078(1-0.7)x0.888= (1-0.72)x(1-0.6) = x0.9 =0.72 CD AB A (B (C D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 [Das Sarma,Theobald,Widom: ICDE08 Dylla, Miliaraki,Theobald: CIKM11]

135 Possible Worlds Semantics A:0.7B:0.6C:0.8D:0.9Q 2 : A (B (C D)) P(W) x0.6x0.8x0.9 = x0.6x0.8x0.1 = … = … = … = … = … = … = x0.6x0.8x0.9 = x0.6x0.8x0.1 = x0.6x0.2x0.9 = x0.6x0.2x0.1 = x0.4x0.8x0.9 = … = … = … = P(Q 2 )= P(Q 2 |H)= / = P(Q 1 )=0.0784P(Q 1 |H)= / = Hard rule H: A (B (C D))

136 More Probabilistic Approaches Propositional –Stochastic MaxSat solvers: MaxWalkSat (MAP-Inference) –URDF: constrained weighted MaxSat solver for soft & hard rules Lineage & Possible Worlds (tuple-independent database) –Exact probabilistic inference: junction trees, variable elimination –Approximate inference: decision diagrams/Shannon expansions, sampling Combining First-Order Logic & Probabilistic Graphical Models –Markov Logic Networks* [Richardson & Domingos: Machine Learning 2006] –Factor Graphs [FactorIE, McCallum et al.: NIPS 2008] –Variety of MCMC sampling techniques for probabilistic inference (e.g., Gibbs sampling, MC-SAT, etc.) *Alchemy – Open-Source AI:

137 Experiments URDF: SLD grounding & MaxSat solving |C| - # literals in soft rules |S| - # literals in hard rules URDF MaxSat vs. Markov Logic (MAP inference & MC- SAT) YAGO Knowledge Base: 2 Mio entities, 20 Mio facts Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …) Asymptotic runtime checks: runtime comparisons for synthetic rule expansions

138 System components: –Flash Player client –Tomcat server (JRE) –Relational backend (JDBC) –Remote Method Invocation & Object Serialization (BlazeDS) UViz: URDF Visualization Frontend [Meiser, Dylla, Theobald: CIKM11 Demo]

139 UViz: URDF Visualization Frontend Demo! [Meiser, Dylla, Theobald: CIKM11 Demo]

140 PART I SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, SPARQL 1.1 Query Language, W3C Working Draft, 21 March 2013, SPARQL 1.1 Federated Query, W3C Working Draft, 21 March 2013, Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007 Krisztian Balog, Edgar Meij, Maarten de Rijke: Entity Search: Building Bridges between Two Worlds. WWW, 2010 Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002 Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007 Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009 Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008 Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009 Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008 Thomas Neumann, Gerhard Weikum: Scalable join processing on very large RDF graphs. SIGMOD Conference, 2009 Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010 François Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders, Stijn Vansummeren: A Structural Approach to Indexing Triples. ESWC 2012 Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases by Web Services. SIGMOD Conference, 2010 Cheng Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008 Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8), 2011 PART II Min Cai, Martin R. Frank: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. WWW, 2004 Gong Cheng, Weiyi Ge, Yuzhong Qu: Falcons: searching and browsing entities on the semantic web. WWW, 2008 Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, Joel Sachs: Swoogle: A Search and Metadata Engine for the Semantic Web. CIKM, 2004 Luis Galárraga, Katja Hose, Ralf Schenkel: Partout: A Distributed Engine for Efficient RDF Processing. To appear in PVLDB, 2013 Steve Harris, Nick Lamb, Nigel Shadbolt: 4store: The Design and Implementation of a Clustered RDF Store. SSWS, 2009 Jiewen Huang, Daniel J. Abadi, Kun Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011 Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, Paolo Castagna: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. ISWC, 2012 Bastian Quilitz, Ulf Leser: Querying Distributed RDF Data Sources with SPARQL. ISWC, 2008 Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. ESWC, 2011 Bin Shao, Haixun Wang, Yatao Li: Trinity: A Distributed Graph Engine on a Memory Cloud. To appear in SIGMOD, 2013 Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang: EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud. ICDE, 2013 PART III Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2), 2008 Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, Dan Suciu: MYSTIQ: a system for finding more answers by using probabilities. SIGMOD Conference, 2005 Nilesh N. Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB, 2004 Maximilian Dylla, Iris Miliaraki, Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE, 2013 Norbert Fuhr, Thomas Rölleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Trans. Inf. Syst. 15(1), 1997 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive reasoning in uncertain RDF knowledge bases. CIKM, 2011 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS, 2012 Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2012 Recommended Readings


Download ppt "Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of."

Similar presentations


Ads by Google