Scalable RDF Data Management & SPARQL Query Processing

Scalable RDF Data Management & SPARQL Query Processing
Martin Theobald1, Katja Hose2, Ralf Schenkel3 1University of Antwerp, Belgium 2University of Aalborg, Denmark 3University of Passau, Germany

Outline of this Tutorial
Part I RDF in Centralized Relational Databases Part II RDF in Distributed Settings Part III Managing Uncertain RDF Data

Outline for Part I Part I.1: Foundations Part I.2: Rowstore Solutions
Introduction to RDF and Linked Open Data A short overview of SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

Information Extraction
YAGO/DBpedia et al. bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) >120 M facts for YAGO2 (mostly from Wikipedia infoboxes & categories)

YAGO2 Knowledge Base 3 M entities, 120 M facts
100 relations, 200k classes Entity accuracy  95% subclass subclass subclass Organization Person Location subclass subclass subclass subclass subclass Country Scientist Politician subclass subclass State instanceOf instanceOf Biologist instanceOf Physicist City instanceOf Germany instanceOf instanceOf locatedIn Oct 23, 1944 Erwin_Planck diedOn locatedIn Kiel Schleswig-Holstein FatherOf bornIn Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Society Max_Planck Angela Merkel Apr 23, 1858 bornOn means means means means means “Max Planck” “Max Karl Ernst Ludwig Planck” “Angela Merkel” “Angela Dorothea Merkel”

Why care about scalability?
Sources: linkeddata.org wikipedia.org Rapid growth of available semantic data

Why care about scalability?
As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase Sources: linkeddata.org wikipedia.org Rapid growth of available semantic data More than 30 billion triples in more than 200 sources across the LOD cloud DBPedia: 3.4 million entities, 1 billion triples

… and still growing Billion triple challenge 2008: 1B triples
| War stories from BigOWLIM: 12B triples in Jun 2009 Garlik 4Store: 15B triples in Oct 2009 OpenLink Virtuoso: 15.4B+ triples AllegroGraph: 1+ Trillion triples

Queries can be complex, too
SELECT DISTINCT ?a ?b ?lat ?long WHERE { ?a dbpedia:spouse ?b. ?a dbpedia:wikilink dbpediares:actor. ?b dbpedia:wikilink dbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long. } Q7 on BTC2008 in [Neumann & Weikum, 2009]

What effects does the financial crisis have on migration rates in the US?

Is there a significant increase of serious weather conditions in Europe over the past 20 years?

Which glutamic-acid proteases are inhibitors of HIV?

Question Answering (QA) Systems
KB from Wikipedia and user edits 600 million facts, 25 million entities KB of curated, structured data 10 trillion (!) facts, 50k algorithms

IBM Watson: Deep Question Answering
William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

SPARQL 1.0 / 1.1 Query language for RDF suggested by the W3C.
3 ways to interpret RDF data: Instances of logical predicates (“facts”) Graphs (subjects/objects as nodes, predicates as directed and labeled edges) Relations (either multiple binary relations or a single, large ternary relation) SPARQL main building block: select-project-join combination of relational triple patterns equivalent to graph isomorphism queries over a potentially very large RDF graph

SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) vegetarian Albert_Einstein physicist Jim_Carrey actor Ontario Canada Ulm Germany scientist chemist Otto_Hahn Frankfurt Mike_Myers Newmarket Scarborough Europe isA bornIn locatedIn

SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) SELECT ?person WHERE { ?person isA actor. ?person bornIn ?loc ?loc locatedIn Ontario . } scientist Find subgraphs of this form: isA isA actor Ontario ?person ?loc bornIn locatedIn isA variables constants actor vegetarian physicist chemist isA isA isA isA isA isA Mike_Myers Jim_Carrey Albert_Einstein Otto_Hahn bornIn bornIn bornIn bornIn Scarborough Newmarket Ulm Frankfurt locatedIn locatedIn locatedIn locatedIn Ontario Germany locatedIn locatedIn Canada Europe

SPARQL 1.0 – More Features Eliminate duplicates in results
Return results in some order with optional LIMIT n clause Optional matches and filters on bounded var’s More operators: ASK, DESCRIBE, CONSTRUCT See: SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc ?loc locatedIn ?c} SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc ?loc locatedIn Ontario} ORDER BY DESC(?person) SELECT ?person WHERE {?person isA actor OPTIONAL{?person bornIn ?loc} FILTER (!BOUND(?loc))}

SPARQL 1.1 Extensions of the W3C
W3C SPARQL 1.1: Aggregations (COUNT, AVG, …) and grouping Subqueries in the WHERE clause Safe negation: FILTER NOT EXISTS {?x …} Syntactic sugar for OPTIONAL {?x … } FILTER(!BOUND(?x)) Expressions in the SELECT clause: SELECT (?a+?b) AS ?sum Label constraints on paths: ?x foaf:knows/foaf:knows/foaf:name ?name More functions and operators …

RDF+SPARQL: Centralized Engines
BigOWLIM (now ontotext.com) OpenLink Virtuoso OntoBroker (now semafora-systems.com) Apache Jena (different main-memory/relational backends) Sesame (now openRDF.org) SW-Store, Hexastore, 3Store, RDF-3X (no reasoning) System deployments with >1011 triples ( see

SPARQL: Extensions from Research (1)
More complex graph patterns: Transitive paths [Anyanwu et al., WWW’07] SELECT ?p, ?c WHERE { ?p isA scientist . ?p ??r ?c. ?c isA Country . ?c locatedIn Europe . PathFilter(cost(??r) < 5) . PathFilter(containsAny(??r,?t ). ?t isA City . } Regular expressions [Kasneci et al., ICDE’08] SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.} Meanwhile mostly covered by the SPARQL 1.1 query proposal.

Queries over federated RDF sources: Determine distribution of triple patterns as part of query (for example in Jena ARQ) Automatically route triple predicates to useful sources

Queries over federated RDF sources: Determine distribution of triple patterns as part of query (for example in Jena ARQ) Automatically route triple predicates to useful sources Potentially requires mapping of identifiers from different sources SPARQL 1.1 explicitly supports federation of sources

Ranking is Essential! Queries often have a huge number of results:
“scientists from Canada” “publications in databases” “actors from the U.S.” Queries may have no matches at all: “Laboratoire d'informatique de Paris 6” “most beautiful railway stations” Ranking is an integral part of search Huge number of app-specific ranking methods: paper/citation count, impact, salary, … Need for generic ranking of 1) entities and 2) facts

Extending Entities with Keywords
Remember: entities occur in facts & in documents Associate entities with terms in those documents, keywords in URIs, literals, … (context of entity) chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy

Extensions: Keywords Problem: not everything is triplified!
Consider witnesses/sources (provenance meta-facts) Allow text predicates with each triple pattern (à la XQ-FT) Semantics: triples match struct. pred. witnesses match text pred. European composers who have won the Oscar, whose music appeared in dramatic western scenes, and who also wrote classical pieces ? Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Extensions: Keywords Problem: not everything is triplified!
Consider witnesses/sources (provenance meta-facts) Allow text predicates with each triple pattern (à la XQ-FT) Proximity of keywords or phrases boosts expressiveness French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . } CS researchers whose advisors worked on the Manhattan project? Select ?r, ?a Where { ?r instOf researcher [“computer science“] . ?a workedOn ?x [“Manhattan project“] . ?r hasAdvisor ?a . } Select ?r, ?a Where { ?r ?p1 ?o1 [“computer science“] . ?a ?p2 ?o2 [“Manhattan project“] . ?r ?p3 ?a . }

Extensions: Keywords CLEF/INEX 2012-13 Linked Data Track
Problem: not everything is triplified! CLEF/INEX Linked Data Track <topic id=" " category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type < . ?s1 < ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft> </topic>

Extensions: Keywords / Multiple Languages
Problem: not everything is triplified! Multilingual Question Answering over Linked Data (QALD-3), CLEF <question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: < PREFIX res: < SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query> </question>

What Makes a Fact “Good”?
Confidence: Prefer results that are likely correct accuracy of info extraction trust in sources (authenticity, authority) bornIn (Jim Gray, San Francisco) from “Jim Gray was born in San Francisco” (en.wikipedia.org) livesIn (Michael Jackson, Tibet) from “Fans believe Jacko hides in Tibet” ( Informativeness: Prefer results with salient facts Statistical estimation from: frequency in answer frequency on Web frequency in query log q: Einstein isa ? Einstein isa scientist Einstein isa vegetarian q: ?x isa vegetarian Whocares isa vegetarian Diversity: Prefer variety of facts E won … E discovered … E played … E won … E won … E won … E won … Conciseness: Prefer results that are tightly connected size of answer graph weight of Steiner tree Einstein won NobelPrize Bohr won NobelPrize Einstein isa vegetarian Cruise isa vegetarian Cruise born Bohr died 1962

How Can We Implement This?
Confidence: Prefer results that are likely correct accuracy of info extraction trust in sources (authenticity, authority) Empirical accuracy of Information Extraction PageRank-style estimate of trust combine into: max { accuracy (f,s) * trust(s) | s  witnesses(f) } Informativeness: Prefer results with salient facts Statistical estimation from: frequency in answer frequency on Web frequency in query log PageRank-style entity/fact ranking [V. Hristidis et al., S.Chakrabarti, …] IR models: tf*idf … [K.Chang et al., …] Statistical Language Models [de Rijke et al.] or Diversity: Prefer variety of facts Statistical Language Models [Zhai et al., Elbassuoni et al.] Conciseness: Prefer results that are tightly connected size of answer graph weight of Steiner tree Graph algorithms (BANKS, STAR, …) [S.Chakrabarti et al., G.Kasneci et al., …]

Introduction to RDF A short overview of SPARQL Part I.2: Rowstore Solutions Part I.3: Columnstore Solutions Part I.4: Other Solutions and Outlook

RDF in Rowstores Strings often mapped to unique integer IDs
Rowstore: general relational database, storing relations (incl. facts) as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …) General principles: store triples in one giant three-attribute table (subject, predicate, object) convert SPARQL to equivalent SQL The database will do the rest Strings often mapped to unique integer IDs Used by many TripleStores, including 3Store, Jena, HexaStore, RDF-3X, … Simple extension to quadruples (with graphid): (graph,subject,predicate,object) We consider only triples for simplicity!

Example: Single Triple Table
ex:Katja ex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Martin ex:teaches ex:Databases; ex:PhD_from ex:Saarland_University. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject predicate object ex:Katja ex:teaches ex:Databases ex:Katja ex:works_for ex:MPI_Informatics ex:Katja ex:PhD_from ex:TU_Ilmenau ex:Martin ex:teaches ex:Databases ex:Martin ex:works_for ex:MPI_Informatics ex:Martin ex:PhD_from ex:Saarland_University ex:Ralf ex:teaches ex:Information_Retrieval ex:Ralf ex:PhD_from ex:Saarland_University ex:Ralf ex:works_for ex:Saarland_University ex:Ralf ex:works_for ex:MPI_Informatics

Conversion of SPARQL to SQL
General approach to translate SPARQL into SQL: (1) Each triple pattern is translated into a (self-) JOIN over the triple table (2) Shared variables create JOIN conditions (3) Constants create WHERE conditions (4) FILTER conditions create WHERE conditions (5) OPTIONAL clauses create OUTER JOINS (6) UNION clauses create UNION expressions

Example: Conversion to SQL Query
 P1 P2 P3 P4 Filter regex(?u,“Saar“) Projection SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u, “Saar”)) ?a ?a,?u SELECT R1.A, R1.B, R2.T FROM ( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=“teaches”) AS R2 ) ON (R1.A=R2.A) SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) SELECT FROM Triples P1, Triples P2, Triples P3 SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” ?u

Is that all? Well, no. Which indexes should be built? (to support efficient evaluation of triple patterns) How can we reduce storage space? How can we find the best execution plan? Existing databases need modifications: flexible, extensible, generic storage not needed here cannot deal with multiple self-joins of a single table often generate bad execution plans

Dictionary for Strings
Map all strings to unique integers (e.g., via hashing) Regular size (4-8 bytes), much easier to handle Dictionary usually small, can be kept in main memory <  <  <  This may break original lexicographic sorting order  RANGE conditions (not in SPARQL) are difficult!  FILTER conditions may be more expensive!

Indexes for Commonly Used Triple Patterns
Patterns with a single variable are frequent Example: Albert_Einstein invented ?x Build clustered index over (s,p,o) Can also be used for pattern like Albert_Einstein ?p ?x Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6) SPO, POS, OSP to cover all possible triplet patterns SOP, OPS, PSO to have all sort orders for patterns with two var’s (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) … Lookup ids for constants: Albert_Einstein  16, invented  24 Lookup known prefix in index: (16,24,0) Read results while prefix matches: (16,24,567), (16,24,876) come already sorted! All triples in (s,p,o) order B+ tree for easy access Triple table no longer needed, all triples in each index

Why Sort Order Matters for Joins
When inputs sorted by join attribute, use Merge Join: sequentially scan both inputs immediately join matching triples skip over parts without matches allows pipelining When inputs are unsorted/sorted by wrong attribute, use Hash Join: build hash table from one input scan other input, probe hash table needs to touch every input triple breaks pipelining   MJ HJ (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) (16,33,46578) (16,56,1345) (24,16,1353) (27,18,133) (47,37,20495) (50,134,1056) (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) (27,18,133) (50,134,1056) (16,56,1345) (24,16,1353) (47,37,20495) (16,33,46578) In general, Merge Joins are more preferable: small memory footprint, pipelining

RDF-3x: Even More Indexes!
SPARQL 1.0 considers duplicates (unless removed with DISTINCT) but does not (yet) support aggregates/counting often queries with many duplicates like SELECT ?x WHERE ?x ?y Germany. to retrieve entities related to Germany (but counts may be important in the application!) this materializes many identical intermediate results Solution: even more redundancy! Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O Example: SO contains, for each pair (s,o), the number of triples with subject s and object o Do not materialize identical bindings, but keep counts Example: ?x=Albert_Einstein:4; ?x=Angela_Merkel:10 15 indexes overall (all SPO permutations + their unique subsets)

RDF-3x: Compression Scheme for Triplets
Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO: v1=S, v2=P, v3=O Step 1: compute per-attribute deltas Step 2: variable-byte encoding for each delta triple 1-13 bytes (16,19,5356) (16,24,567) (16,24,676) (27,19,643) (27,48,10486) (50,10,10456) (16,19,5356) (0,5,-4798) (0,0,109) (11,-5,-34) (0,29,9843) (23,-38,-30)  gap bit header (7 bits) Delta of value 1 (0-4 bytes) Delta of value 2 (0-4 bytes) Delta of value 3 (0-4 bytes) When gap=1, the delta of value3 is included in header, all others are 0 Otherwise, header contains length of encoding for each of the three deltas (5*5*5=125 combinations stored in 7 bits) Many variants exist; this one is designed for triplets…

Compression Effectiveness vs. Efficiency
Byte-level encoding almost as effective as bit-level encoding techniques (Gamma, Golomb, Rice, etc.) Much faster (10x) for decompressing Example for Barton dataset [Neumann & Weikum: VLDB’10]: Raw data 51 million triples, 7GB uncompressed (as N-Triples) All 6 main indexes: 1.1GB size, 3.2s decompression with byte-level encoding Optionally: additional compression with LZ77 2x more compact, but much slower to decompress Compression always on page level

Back to the Example Query
SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u, “Saar”))  POS(works_for,?u,?a) POS(works_for,?u,?b) PSO(phd_from,?a,?u) POS(teaches,?a,?t) Projection ?u ?a,?u ?a MJ HJ Filter regex(?u,“Saar“) 250  POS(works_for,?u,?a) POS(pdh_from,?u,?a) PSO(works_for,?u,?b) Projection ?u,?a ?u ?a MJ Filter regex(?u,“Saar“) POS(teaches,?a,?t) 250 250 2500 250 100 50 5 1000 1000 50 1000 100 1000 Core ingredients of a good query optimizer are selectivity estimators for triple patterns (index scans) and joins Which of the two plans is better? How many intermediate results?

RDF-3x: Selectivity Estimation
How many results will a triple pattern have? Standard databases: Per-attribute histograms Assume independence of attributes  Use aggregated indexes for exact count Additional join statistics for triple blocks (pages): too simplistic and inexact … (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) … Assume independence between triple patterns; additionally precompute exact statistics for frequent paths in the data

Solution: Differential Indexing
Handling Updates What should we do when our data changes? (SPARQL 1.1 has updates!) Assumptions: Queries far more frequent than updates Updates mostly insertions, hardly any deletions Different applications may update concurrently Solution: Differential Indexing

RDF-3x: Differential Updates
Staging architecture for updates in RDF-3X Workspace A: Triples inserted by application A completion of A Workspace B: Triples inserted by application B completion of B on-demand indexes at query time kept in main memory Deletions: Insert the same tuple again with “deleted” flag Modify scan/join operators: merge differential indexes with main index

Principles Observations and assumptions:
Not too many different predicates Triple patterns usually have fixed predicate Need to access all triples with one predicate Design consequence: Use one two-attribute table for each predicate

Example: Columnstores
ex:Katja ex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Martin ex:teaches ex:Databases; ex:PhD_from ex:Saarland_University. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject object ex:Katja ex:MPI_Informatics ex:Martin ex:MPI_Informtatics ex:Ralf ex:Saarland_University ex:Ralf ex:MPI_Informatics works_for subject object ex:Katja ex:Databases ex:Martin ex:Databases ex:Ralf ex:Information_Retrieval teaches subject object ex:Katja ex:TU_Ilmenau ex:Martin ex:Saarland_University ex:Ralf ex:Saarland_University PhD_from

Simplified Example: Query Conversion
SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } SELECT W1.subject as A, W2.subject as B FROM works_for W1, works_for W2, phd_from P3 WHERE W1.object=W2.object AND W1.subject=P3.subject AND W1.object=P3.object So far, this is yet another relational representation of RDF. So, what is a columnstore?

Columnstores and RDF Columnstores store all columns of a table separately. PhD_from:subject ex:Katja ex:Martin ex:Ralf subject object ex:Katja ex:TU_Ilmenau ex:Martin ex:Saarland_University ex:Ralf ex:Saarland_University PhD_from PhD_from:object ex:TU_Ilmenau ex:Saarland_University Advantages: Fast if only subject or object are accessed, not both Allows for a very compact representation Problems: Need to recombine columns if subject and object are accessed Inefficient for triple patterns with predicate variable

Compression in Columnstores
General ideas: Store subject only once Use same order of subjects for all columns, including NULL values when necessary Additional compression to get rid of NULL values subject ex:Katja ex:Martin ex:Ralf PhD_from ex:TU_Ilmenau ex:Saarland_University NULL teaches ex:Databases ex:Databases ex:Information_Retrieval NULL works_for ex:MPI_Informatics ex:Saarland_University PhD_from: bit[1110] ex:TU_Ilmenau ex:Saarland_University Teaches: range[1-3] ex:Databases ex:Databases ex:Information_Retrieval

Property Tables Group entities with similar predicates into a relational table (for example using RDF types or a clustering algorithm). ex:Katja ex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Martin ex:teaches ex:Databases; ex:PhD_from ex:Saarland_University. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject teaches PhD_from ex:Katja ex:Databases ex:TU_Ilmenau ex:Martin ex:Databases ex:Saarland_University ex:Ralf ex:IR ex:Saarland_University subject teaches PhD_from ex:Katja ex:Databases ex:TU_Ilmenau ex:Martin ex:Databases ex:Saarland_University ex:Ralf ex:IR ex:Saarland_University ex:Axel NULL ex:TU_Vienna subject predicate object ex:Katja ex:works_for ex:MPI_Informatics ex:Martin ex:works_for ex:MPI_Informatics ex:Ralf ex:works_for ex:Saarland_University ex:Ralf ex:works_for ex:MPI_Informatics “Leftover triples”

Property Tables: Pros and Cons
Advantages: More in the spirit of existing relational systems Saves many self-joins over triple tables etc. Disadvantages: Potentially many NULL values Multi-value attributes problematic Query mapping depends on schema Schema changes very expensive

 See our list of readings.
Even More Systems… Store RDF data as sparse matrix with bit-vector compression [BitMat, Hendler at al.: ISWC’09] Convert RDF into XML and use XML methods (XPath, XQuery, …) Store RDF data in graph databases and perform bi-simulation [Fletcher at al.: ESWC’12] or employ specialized graph index structures [gStore, Zou et al.: PVLDB’11] And many more …  See our list of readings.

Which Technique is Best?
Performance depends a lot on precomputation, optimization, implementation, fine-tuning … Comparative results on BTC 2008: (from [Neumann & Weikum, 2009]) RDF-3X RDF-3X (2008) COLSTORE ROWSTORE RDF-3X RDF-3X (2008) COLSTORE ROWSTORE

Challenges and Opportunities
SPARQL with different entailment regimes New SPARQL 1.1 features (grouping, aggregation, updates) User-oriented ranking of query results Efficient top-k operators Effective scoring methods for structured queries What are the limits of a centralized RDF engine? Dealing with uncertain RDF data – what is the most likely query answer? Triples with probabilities  probabilistic databases

Outline for Part II Part II.1: Search Engines for the Semantic Web
Part II.2: Mediator-based and Federated Architectures

Semantic Web Search Engines
Querying RDF data collections started by adapting existing search engines to RDF data. Crawling for .rdf files, and HTML documents with embedded RDF content (see: RDFa microformat). Indexing & search based on keywords extracted from entity- and property names. Usually generate a virtual document for an entity (string literals and human-readable names). Swoogle [Ding et al., CIKM’04] (University of Maryland) Falcons [Cheng at al., WWW’08] (Nanjing University)

Outline for Part II Part II.1: Search Engines for the Semantic Web
Part II.2: Mediator-based and Federated Architectures

Classification of Distributed Approaches
Approaches for querying distributed and potentially heterogeneous (RDF) data sources Virtually materialized approaches Materialization-based approaches (data-warehousing) Mediator-based systems Federated systems MapReduce/ Hadoop Shared-nothing architectures DARQ FedEx YARS2 Partout 4Store Eagre Shard, Jena-HBase [Abadi et al. PVLDB’11] Peer-2-Peer Shared-memory architectures (Message-Passing, RMI, etc.) Gridvine RDFPeers Trinity (MSR)

How to Integrate Data Sources?
Ship and integrate data from different sources to the client. Three common approaches: Query-driven (single mediator) Database federations (exported schemas) Warehousing (fully integrated & centrally managed) ? RDF Source RDF Source

Query-Driven Approach
SPARQL Client SPARQL Client Mediator Wrapper Wrapper Wrapper query result query result query result RDF Source RDF Source RDF Source List of SPARQL endpoints: DBpedia: YAGO: 7

Advantages of Query-Driven Integration
No need to copy data no or little own storage costs no need to purchase data Potentially more up-to-date data Mediator holds catalog (statistics, etc.) and may optimize queries Only generic query interface needed at sources (SPARQL endpoints) May be less draining on sources Sources often even unaware of participation

Federation-based Approach
SPARQL Client SPARQL Client Federated Schema Exported Schema Exported Schema Exported Schema query result query result query result Local Schema Local Schema Local Schema RDF Source RDF Source RDF Source Source 1 Source … Source n

Advantages of Federation-Based Integration
Very similar to query-driven integration, except that the sources know that they are part of a federation; and they export their local schemas into a federated schema. Intermediate step toward full integration of the data in a single “warehouse”.

Warehousing Architecture
SPARQL Client SPARQL Client Query & Analysis Warehouse Metadata Integration RDF Source RDF Source RDF Source Integrated LOD index:

Advantages of Warehousing
Perform Extract-Transform-Load (ETL) processes with periodic updates over the source High query performance Local processing at sources unaffected Can operate even when sources are offline Can query data that is no longer stored at sources More detailed statistics and metadata available at warehouse Modify, summarize (store aggregates), analyse Add historical information, provenance, timestamps, etc.

Classification of Distributed Approaches
Approaches for querying distributed and potentially heterogeneous (RDF) data sources Virtually materialized approaches Materialization-based approaches (data-warehousing) Mediator-based systems Federated systems MapReduce/ Hadoop Shared-nothing architectures DARQ FedEx YARS2 Partout 4Store Eagre Shard, Jena-HBase [Abadi et al. PVLDB’11] Peer-2-Peer Shared-memory architectures (Message-Passing, RMI, etc.) Gridvine RDFPeers Trinity (MSR)

DARQ [Leser et al., Humbold University Berlin, ISWC’08]
Classical mediator-based architecture connecting a given SPARQL endpoint to other endpoints via a combination of wrappers and service descriptions. Service descriptions RDF data descriptions Statistical information Binding constraints Query optimizer based on rewriting rules and cost estimations for physical join operators.

FedEx [fluid Op’s & MPI-INF: ISWC’11]
Online query optimization over federations of SPARQL endpoints. Cost estimates based on result sizes of SPARQL ASK queries. “Bound nested-loop joins” by grouping sets of variable bindings into SPARQL UNION queries (instead of using FILTER conditions): SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title . }

Partout [Galaraga, Hose, Schenkel: PVLDB’13]
Materialization-based, distributed & workload-aware SPARQL engine. Distribution helps to scale-out query processing via parallel join executions. Global query workload (aka. “query log”) Global query graph Triple fragments are distributed over hosts H1…Hn by (1) maximizing query locality, and (2) balancing the hosts workload. H1…Hn run local RDF-3x instances. (1) local S,P,O statistics by RDF-3x, (2) global (cached) statistics.

Partout Example Query Plan
H1, H2, H3 hold triplets for ?s rdf:type db:city H1 has triplets for ?s db:located db:Germany H2 has triplets for ?s db:name ?name

More Distributed RDF Engines
Shard TripleStore (Hadoop + Hash-Partitioning) RDFPeers (P2P/Chord architecture) [Cai et al., WWW’04] Gridvine (P2P/Chord architecture) [Aberer et al., VLDB’07] YARS2 (federated architecture) [Decker at al., ISWC’07] Jena-HBase (Hadoop & HBase) [Khadilkar et al., ISWC’12] SW-Store (Hadoop/RDF-3x) [Abadi et al., PVLDB’11] 4Store (materialized, shared-nothing) [Harris et al., SSWS’09] Eagre (materialized, shared-nothing) [HKUST & HP Labs, ICDE’13] Trinity (materialized, shared-memory, message passing) [MSR, SIGMOD’13]  more in Zoi’s tutorial in the afternoon…

Outline for Part III Part III.1: Motivation
What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines Stanford Trio Project MystiQ @ U Washington Part III.4: Managing Uncertain RDF Data Max Planck Institute

What is “Uncertain” Data?
Temperature is F Sensor reported 75 ± 0.5 F Bob works for Yahoo Bob works for either Yahoo or Microsoft Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%) It always rains in Galway There is a 89% chance of rain in Galway tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23 John’s age is in [20,30]

… And Why Does It Arise? “Certain” Data “Uncertain” Data
Temperature is F Sensor reported 75 ± 0.5 F Bob works for Yahoo Bob works for either Yahoo or Microsoft Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%) It always rains in Galway There is a 89% chance of rain in Galway tomorrow Yahoo stocks will be at 100 in a month Yahoo stock will be between 60 and 120 in a month John’s age is 23 John’s age is in [20,30] Precision of devices Lack of exact information (alternatives and missing values) Uncertainty about future events Anonymization

Applications: Deduplication
Name John Doe J. Doe ? 80% match

Applications: Information Integration
name, hPhone, oPhone, hAddr, oAddr name, phone, address at the schema level: “schema integration” at the instance level: “record linkage” Combined View

Applications: Information Extraction (I)
Restaurant Zip Hard Rock Cafe 94109

Applications: Information Extraction (II)
Subj. Pred. Obj. Galway type City locatedIn Ireland hasPopulation 75,414 areaCode 091 namedAfter Gaillimh_River … What is Uncertain Data and Why Does It Arise?

Applications: Information Extraction (III)
YAGO/DBpedia et al. bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) New fact candidates type(Jeff, Author)[0.9] author(Jeff, Drag_Book)[0.8] author(Jeff,Cind_Book)[0.6] worksAt(Jeff, Bell_Labs)[0.7] type(Jeff, CEO)[0.4] 100’s M additional facts from Wikipedia text

How do current database management systems (DBMS) handle uncertainty?
They don’t 

What Do (Most) Applications Do?
Clean: turn into data that DBMSs can handle Observer Bird-1 Mary Finch: 80% Sparrow: 20% Susan Dove: 70% Sparrow: 30% Jane Hummingbird: 65% Sparrow: 35% Bird-1 Finch Dove Hummingbird Loss of information Errors compound and propagate insidiously

What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines Stanford Trio Project MystiQ @ U Washington Part III.4: Managing Uncertain RDF Data Max Planck Institute Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management) Morgan & Claypool Publishers, 2012

Databases Today are Deterministic
An item either is in the database or it is not. A tuple either is in the query answer or it is not. This applies to all variety of data models: Relational, E/R, hierarchical, XML, …

What is a Probabilistic Database ?
“An tuple belongs to the database” is a probabilistic event. “A tuple is an answer to the query” is a probabilistic event. Can be extended to all possible kinds of data models; we consider only  probabilistic relational data.

Sample Spaces & Venn Diagrams
“Tuple t2 is an answer to a query.” “Tuple t1 is in the database.” Sample space : all possible events that can be observed. Pr() = 1. Random variable χt assigns a probability to an event s.t. 0 ≤ Pr( χt ) ≤ 1. As a convention, we will use tuple identifiers in the place of random variables to denote probabilistic events.

Possible Worlds Semantics
Attribute domains: int, varchar(55), datetime # values: 232, , Relational schema: Employee(ID:int, name:varchar(55), dob:datetime, salary:int) # of possible tuples: × 2440 × 264 × 232 # of possible relation instances: × 2440 × 264 × 232 Database schema: Employee(. . .), Projects( ), Groups( . . .), WorksFor( . . .) # of possible database instances: N (= big but finite)

The Definition Given a finite set of all possible database instances:
INST = {I1, I2, I3, . . ., IN} Definition: A probabilistic database Ip is a probability distribution on INST s.t. åi=1,…,N Pr(Ii) = 1 Pr : INST → [0,1] Definition: A possible world is I  INST s.t. Pr(I) > 0

Example Ip = Possible worlds = {I1, I2, I3, I4} Customer Address
Product John Seattle Gizmo Camera Sue Denver Customer Address Product John Boston Gadget Sue Denver Gizmo Pr(I2) = 1/12 Pr(I1) = 1/3 Customer Address Product John Seattle Gizmo Camera Sue Customer Address Product John Boston Gadget Sue Seattle Camera Pr(I4) = 1/12 Pr(I3) = 1/2 Possible worlds = {I1, I2, I3, I4}

Tuples as Events One tuple t  event “t  I” Pr(t) = åI: t  I Pr(I)
Marginal probability of t One tuple t  event “t  I” Pr(t) = åI: t  I Pr(I) Two tuples t1, t2  event “t1  I Λ t2  I” Pr(t1 Λ t2) = åI: t1I Λ t2I Pr(I) Marginal probability of t1 Λ t2

Tuple Correlations NOT Pr(⌐t1) = 1 - Pr(t1) Independent-AND
Pr(t1 Λ t2) = Pr(t1) Pr(t2) I Λ Independent-OR Pr(t1 V t2) = 1-(1-Pr(t1))(1-Pr(t2)) I V Disjoint-AND Pr(t1 Λ t2) = 0 DΛ Disjoint-OR Pr(t1 V t2) = Pr(t1)+Pr(t2) D V Negatively correlated Pr(t1 Λ t2) < Pr(t1) Pr(t2) N Positively correlated Pr(t1 Λ t2) > Pr(t1) Pr(t2) P Identical Pr(t1 Λ t2) = Pr(t1) = Pr(t2) =

Example with Correlations
Ip = Customer Address Product John Seattle Gizmo Camera Sue Denver Customer Address Product John Boston Gadget Sue Denver Gizmo = D N Pr(I2) = 1/12 Pr(I1) = 1/3 D Customer Address Product John Seattle Gizmo Camera Sue Customer Address Product John Boston Gadget Sue Seattle Camera P Pr(I4) = 1/12 Pr(I3) = 1/2

Special Case! Tuple-independent probabilistic database
INST = P(TUP) N = 2M TUP = {t1, t2, …, tM} = all tuples No restrictions w.r.t. other tuples pr : TUP → (0,1] Pr(I) = Õt  I pr(t) × Õt Ï I (1-pr(t))

… back to the Venn Diagram (I)
Sample Space  “Tuple t2 is in the database.” “Tuple t1 is in the database.” Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := Pr(t1) x Pr(t2) = pr(t1) x pr(t2) If t1 and t2 are independent (per assumption!): 4 possible worlds = 4 subsets of events

… back to the Venn Diagram (II)
Sample Space  “Tuple t2 is in the database.” “Tuple t1 is in the database.” Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := 0 If t1 and t2 are disjoint (per assumption!): 3 possible worlds = 3 subsets of events

Tuple Prob.  Possible Worlds
Assumption: Tuples are independent! Name City pr John Seattle p1 = 0.8 Sue Boston p2 = 0.6 Fred p3 = 0.9 J = å = 1 Ip = Name City Name City John Seattl Name City Sue Bosto Name City Fred Bosto Name City John Seattl Sue Bosto Name City John Seattl Fred Bosto Name City Sue Bosto Fred Name City John Seattl Sue Bosto Fred I1 (1-p1) (1-p2) (1-p3) I2 p1(1-p2)(1-p3) I3 (1-p1)p2(1-p3) I4 (1-p1)(1-p2)p3 I5 p1p2(1-p3) I6 p1(1-p2)p3 I7 (1-p1)p2p3 I8 p1p2p3

Tuple Prob.  Query Evaluation
Customer Product Date pr John Gizmo . . . q1 Gadget q2 q3 Sue Camera q4 q5 q6 Fred q7 Name City pr John Seattle p1 Sue Boston p2 Fred p3 Marginals SELECT DISTINCT x.city FROM Personp x, Purchasep y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ Tuple Probability Seattle Boston p1( ) 1-(1-q2)(1-q3) 1- ( ) × ( ) p2( ) p3 1-(1-q5)(1-q6 ) q7

Summary of Data Model Possible Worlds Semantics Very powerful model:
Can capture any tuple correlations. Needs separate representation formalism: (“just tables” are generally not enough)  Boolean event expressions to capture complex tuple- dependencies: “provenance”, “lineage”, “views”, etc. But: query evaluation may be very expensive. Need to find good cases, otherwise must approximate.

What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines Stanford Trio Project MystiQ @ U Washington Part III.4: Managing Uncertain RDF Data Max Planck Institute

Uncertainty-Lineage Databases (ULDBs)
Trio’s Data Model [Widom et al.: 2008] Uncertainty-Lineage Databases (ULDBs) Alternatives ‘?’ (Maybe) Annotations Confidence values Lineage

red, Honda ∥ red, Toyota ∥ orange, Mazda
Trio’s Data Model 1. Alternatives: uncertainty about value Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Three possible instances

red, Honda ∥ red, Toyota ∥ orange, Mazda
Trio’s Data Model 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence Saw (witness, color, car) Amy red, Honda ∥ red, Toyota ∥ orange, Mazda Betty blue, Acura ? Six possible instances

red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2
Trio’s Data Model 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty Saw (witness, color, car) Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2 Betty blue, Acura 0.6 ? Six possible instances, each with a probability

So Far: Model is Not Closed
Saw (witness, car) Cathy Honda ∥ Mazda Drives (person, car) Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda Hank, Honda Suspects = πperson(Saw ⋈ Drives) Suspects Jimmy Billy ∥ Frank Hank CANNOT Does not correctly capture possible instances in the result ? ? ?

Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda
Example with Lineage ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23 Hank, Honda ID Saw (witness, car) 11 Cathy Honda ∥ Mazda Suspects = πperson(Saw ⋈ Drives) ID Suspects 31 Jimmy 32 Billy ∥ Frank 33 Hank ? ? ? λ(31) = (11,2) Λ (21,2) λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) λ(33) = (11,1) Λ 23

Jimmy, Toyota ∥ Jimmy, Mazda Billy, Honda ∥ Frank, Honda
Example with Lineage ID Drives (person, car) 21 Jimmy, Toyota ∥ Jimmy, Mazda 22 Billy, Honda ∥ Frank, Honda 23 Hank, Honda ID Saw (witness, car) 11 Cathy Honda ∥ Mazda Suspects = πperson(Saw ⋈ Drives) ID Suspects 31 Jimmy 32 Billy ∥ Frank 33 Hank λ(31) = (11,2) Λ (21,2) ? λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2) ? ? λ(33) = (11,1) Λ 23 Correctly captures possible instances in the result (7)

Operational Semantics
D D′ Closure: up-arrow always exists direct implementation possible instances rep. of instances Q on each instance I1, I2, …, In J1, J2, …, Jm Completeness: any (finite) set of possible instances can be represented

Summary on Trio’s Data Model
Uncertainty-Lineage Databases (ULDBs) Alternatives ‘?’ (Maybe) Annotations Confidence values Lineage Theorem: ULDBs are closed and complete. Formally studied properties like minimization, equivalence, approximation and membership based on lineage. [Benjelloun, Widom, et al.: VLDB J. 08]

MYSTIQ: Query Complexity
Data complexity of a query Q: Compute Q(Ip), for probabilistic database J Extensional query evaluation: Works for “safe” query plans with PTIME data complexity Intensional query evaluation: Works for any plan but has #P-complete data complexity in the general case Assume independent tuples in J Compute marginal probabilities for tuples in Q Boolean event expressions for intensional query evaluation

Extensional Query Evaluation
[Fuhr&Roellke:1997, Dalvi&Suciu:2004] Relational op’s compute probabilities or: p1 + p2 + … v p v 1-(1-p1)(1-p2)… v1 v2 p1 p2 v p1(1-p2) P - s × v p1 p2 v p v1 p1 v2 p2 v p1 v p2 Data complexity: PTIME

“Safe Plans” × × P P Depends on plan !!! [Dalvi&Suciu:2004]
SELECT DISTINCT x.City FROM Personp x, Purchasep y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ “Safe Plans” Jon Sea p1(1-(1-q1)(1-q2)(1-q3)) Sea 1-(1-p1q1)(1- p1q2)(1- p1q3) × P Correct ! × P Jon Sea p1q1 p1q2 p1q3 Wrong ! Jon 1-(1-q1)(1-q2)(1-q3) Jon q1 q2 q3 Jon q1 q2 q3 Jon Sea p1 Jon Sea p1 Depends on plan !!!

Data complexity is #P-complete
Query Complexity [Dalvi&Suciu:2004] Sometimes there exists a correct extensional plan, but consider the following: Data complexity is #P-complete Qbad :- R(x), S(x,y), T(y) NP = class of problems of the form “is there a witness ?” #P = class of problems of the form “how many witnesses ?” (will be coming back to this…)

Intensional Database Atomic event ids e1, e2, e3, … Probabilities:
[Fuhr&Roellke:1997] Atomic event ids e1, e2, e3, … Probabilities: p1, p2, p3, …  [0,1] Event expressions: Λ, V, ⌐ e3 Λ (e5 V ⌐e2) Intensional probabilistic database J  each tuple t has an event attribute t.E

Probability of Boolean Expressions
Needed for query evaluation! E = X1X3 v X1X4 v X2X5 v X2X6 Sampling: Randomly make each variable true with the following probabilities Pr(X1) = p1, Pr(X2) = p2, , Pr(X6) = p6 “Read once” formula What is Pr(E) ??? Answer: Re-group cleverly E = X1 (X3 v X4 ) v X2 (X5 v X6) Pr(E) = 1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))

Complexity Issues Theorem [Valiant:1979] For a Boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?”  SAT #P = class of problems of the form “how many witnesses ?”  #SAT The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete

Probabilistic Query Engine
MYSTIQ: [Re, Suciu: VLDB’04] Probabilistic Query Evaluation on Top of a Deterministic Database Engine (Top-k) Answers 1. Sampling Probabilistic Query Engine 2. Extensional joins SQL Query Deterministic Database 3. Indexes

What is uncertain data, and where does it come from? Part III.2: Possible Worlds & Beyond Part III.3: Probabilistic Database Engines Stanford Trio Project MystiQ @ U Washington Part III.4: Uncertain RDF Data URDF Max Planck Institute

Uncertain RDF (URDF) Data Model
Extensional Layer (information extraction & integration) High-confidence facts: existing knowledge base (“ground truth”) New fact candidates: extracted facts with confidence values Integration of different knowledge sources: Ontology merging or explicit Linked Data (owl:sameAs, owl:equivProp.)  Large “Probabilistic Database” of RDF facts Intensional Layer (query-time inference) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic consistency reasoning

Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) People are not born in different places/on different dates bornIn(x,y)  bornIn(x,z)  y=z bornOn(x,y)  bornOn(x,z)  y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z  disjoint(t1,t2) [0.8] [0.5]

Soft Rules vs. Hard Rules
Deductive database: Datalog, core of SQL & relational algebra, RDF/S, OWL2-RL, etc. (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) People are not born in different places/on different dates bornIn(x,y)  bornIn(x,z)  y=z bornOn(x,y)  bornOn(x,z)  y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1)  marriedTo(x,z,t2)  y≠z  disjoint(t1,t2) [0.8] [0.5] More general FOL constraints: Datalog with constraints, X-Tuples in Prob.-DB’s owl:FunctionalProperty, etc.

URDF: Running Example KB: Base Facts Jeff Surajit David Rules
hasAdvisor(x,y)  worksAt(y,z)  graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Computer Scientist type[1.0] type[1.0] type[1.0] hasAdvisor[0.7] hasAdvisor[0.8] Jeff Surajit David graduatedFrom[0.9] graduatedFrom[0.6] graduatedFrom[?] graduatedFrom[?] graduatedFrom[0.7] Stanford Princeton Derived Facts gradFr(Surajit,Stanford) gradFr(David,Stanford) worksAt[0.9] type[1.0] type[1.0] University

Basic Types of Inference
MAP Inference Find the most likely assignment to query variables y under a given evidence x. Compute: arg max y P( y | x) (NP-hard for MaxSAT) Marginal/Success Probabilities Probability that query y is true in a random world Compute: ∑y P( y | x) (#P-hard already for conjunctive queries)

General Route: Grounding & MaxSAT Solving
Query graduatedFrom(x, y) 1) Grounding Consider only facts (and rules) which are relevant for answering the query 2) Propositional formula in CNF, consisting of Grounded hard & soft rules Weighted base facts 3) Propositional Reasoning Find truth assignment to facts such that the total weight of the satisfied clauses is maximized  MAP inference: compute “most likely” possible world CNF (graduatedFrom(Surajit, Stanford)  graduatedFrom(Surajit, Princeton))  (graduatedFrom(David, Stanford)  graduatedFrom(David, Princeton))  (hasAdvisor(Surajit, Jeff)  worksAt(Jeff, Stanford)  graduatedFrom(Surajit, Stanford))  (hasAdvisor(David, Jeff)  graduatedFrom(David, Stanford))  worksAt(Jeff, Stanford)  hasAdvisor(Surajit, Jeff)  hasAdvisor(David, Jeff)  graduatedFrom(Surajit, Princeton)  graduatedFrom(Surajit, Stanford)  graduatedFrom(David, Princeton) 1000 0.4 0.9 0.8 0.7 0.6

URDF: MaxSAT Solving with Soft & Hard Rules
Special case: Horn-clauses as soft rules & mutex-constraints as hard rules [Theobald,Sozio,Suchanek,Nakashole: VLDS‘12] { graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) } { graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) } 0.4 0.9 0.8 0.7 0.6 Find: arg max y P( y | x) Resolves to a variant of MaxSAT for propositional formulas S: Mutex-const. Compute W0 = ∑clauses C w(C) P(C is satisfied); For each hard constraint S { For each fact f in St { Wf+t = ∑clauses C w(C) P(C is sat. | f = true); } WS-t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+t , WS-t ; Remove satisfied clauses C; t++; MaxSAT Alg. (hasAdvisor(Surajit, Jeff)  worksAt(Jeff, Stanford)  graduatedFrom(Surajit, Stanford))  (hasAdvisor(David, Jeff)  graduatedFrom(David, Stanford))  worksAt(Jeff, Stanford)  hasAdvisor(Surajit, Jeff)  hasAdvisor(David, Jeff)  graduatedFrom(Surajit, Princeton)  graduatedFrom(Surajit, Stanford)  graduatedFrom(David, Princeton) C: Weighted Horn clauses (CNF) Runtime: O(|S||C|) Approximation guarantee of 1/2

Deductive Grounding with Lineage (SLD Resolution/Datalog)
Query graduatedFrom(Surajit, y) Rules hasAdvisor(x,y)  worksAt(y,z)  graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) Q1 Q2 A(B (CD))  A(B (CD))  Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] \/ C D A B graduatedFrom (Surajit, Princeton)[0.7] graduatedFrom (Surajit, Stanford)[0.6] /\ hasAdvisor (Surajit,Jeff)[0.8] worksAt (Jeff,Stanford)[0.9]

Lineage & Possible Worlds
[Das Sarma,Theobald,Widom: ICDE‘08 Dylla, Miliaraki,Theobald: CIKM‘11] Query graduatedFrom(Surajit, y) 1) Deductive Grounding Dependency graph of the query Trace lineage of individual query answers 2) Lineage DAG (not in CNF), consisting of Grounded hard & soft rules Weighted base facts Plus: entire derivation history! 3) Probabilistic Inference  Compute marginals: P(Q): aggregate probabilities of all possible worlds where the lineage of the query evaluates to “true” P(Q|H): drop “impossible worlds” 0.7x( )=0.078 (1-0.7)x0.888=0.266 1-(1-0.72)x(1-0.6) =0.888 0.8x0.9 =0.72 graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) Q1 Q2 A(B (CD))  A(B (CD))  \/ C D A B graduatedFrom (Surajit, Princeton)[0.7] graduatedFrom (Surajit, Stanford)[0.6] /\ hasAdvisor (Surajit,Jeff)[0.8] worksAt (Jeff,Stanford)[0.9]

Possible Worlds Semantics
P(Q1)=0.0784 P(Q1|H)= / 0.412 = P(Q2|H)= / 0.412 = P(Q2)=0.2664 A:0.7 B:0.6 C:0.8 D:0.9 Q2:  A(B(CD)) P(W) 1 0.7x0.6x0.8x0.9 = 0.7x0.6x0.8x0.1 = … = … = … = … = … = … = 0.3x0.6x0.8x0.9 = 0.3x0.6x0.8x0.1 = 0.3x0.6x0.2x0.9 = 0.3x0.6x0.2x0.1 = 0.3x0.4x0.8x0.9 = … = … = … = 0.0784 0.2664 1.0 0.412 Hard rule H:  A   (B  (CD))

More Probabilistic Approaches
Propositional Stochastic MaxSat solvers: MaxWalkSat (MAP-Inference) URDF: constrained weighted MaxSat solver for soft & hard rules Lineage & Possible Worlds (tuple-independent database) Exact probabilistic inference: junction trees, variable elimination Approximate inference: decision diagrams/Shannon expansions, sampling Combining First-Order Logic & Probabilistic Graphical Models Markov Logic Networks* [Richardson & Domingos: Machine Learning 2006] Factor Graphs [FactorIE, McCallum et al.: NIPS 2008] Variety of MCMC sampling techniques for probabilistic inference (e.g., Gibbs sampling, MC-SAT, etc.) *Alchemy – Open-Source AI:

Experiments YAGO Knowledge Base: 2 Mio entities, 20 Mio facts
Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …) Asymptotic runtime checks: runtime comparisons for synthetic rule expansions URDF: SLD grounding & MaxSat solving URDF MaxSat vs. Markov Logic (MAP inference & MC-SAT) |C| - # literals in soft rules |S| - # literals in hard rules

UViz: URDF Visualization Frontend
[Meiser, Dylla, Theobald: CIKM’11 Demo] System components: Flash Player client Tomcat server (JRE) Relational backend (JDBC) Remote Method Invocation & Object Serialization (BlazeDS)

UViz: URDF Visualization Frontend
[Meiser, Dylla, Theobald: CIKM’11 Demo] Demo!

Recommended Readings PART I
SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, SPARQL 1.1 Query Language, W3C Working Draft, 21 March 2013, SPARQL 1.1 Federated Query, W3C Working Draft, 21 March 2013, Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007 Krisztian Balog, Edgar Meij, Maarten de Rijke: Entity Search: Building Bridges between Two Worlds. WWW, 2010 Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002 Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007 Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009 Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008 Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009 Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008 Thomas Neumann, Gerhard Weikum: Scalable join processing on very large RDF graphs. SIGMOD Conference, 2009 Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010 François Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders, Stijn Vansummeren: A Structural Approach to Indexing Triples. ESWC 2012 Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases by Web Services. SIGMOD Conference, 2010 Cheng Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008 Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8), 2011 PART II Min Cai, Martin R. Frank: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. WWW, 2004 Gong Cheng, Weiyi Ge, Yuzhong Qu: Falcons: searching and browsing entities on the semantic web. WWW, 2008 Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, Joel Sachs: Swoogle: A Search and Metadata Engine for the Semantic Web. CIKM, 2004 Luis Galárraga, Katja Hose, Ralf Schenkel: Partout: A Distributed Engine for Efficient RDF Processing. To appear in PVLDB, 2013 Steve Harris, Nick Lamb, Nigel Shadbolt: 4store: The Design and Implementation of a Clustered RDF Store. SSWS, 2009 Jiewen Huang, Daniel J. Abadi, Kun Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011 Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, Paolo Castagna: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. ISWC, 2012 Bastian Quilitz, Ulf Leser: Querying Distributed RDF Data Sources with SPARQL. ISWC, 2008 Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. ESWC, 2011 Bin Shao, Haixun Wang, Yatao Li: Trinity: A Distributed Graph Engine on a Memory Cloud. To appear in SIGMOD, 2013 Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang: EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud. ICDE, 2013 PART III Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2), 2008 Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, Dan Suciu: MYSTIQ: a system for finding more answers by using probabilities. SIGMOD Conference, 2005 Nilesh N. Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB, 2004 Maximilian Dylla, Iris Miliaraki, Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE, 2013 Norbert Fuhr, Thomas Rölleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Trans. Inf. Syst. 15(1), 1997 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive reasoning in uncertain RDF knowledge bases. CIKM, 2011 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS, 2012 Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2012

Scalable RDF Data Management & SPARQL Query Processing

Similar presentations

Presentation on theme: "Scalable RDF Data Management & SPARQL Query Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable RDF Data Management & SPARQL Query Processing

Similar presentations

Presentation on theme: "Scalable RDF Data Management & SPARQL Query Processing"— Presentation transcript:

Similar presentations

About project

Feedback