Database Techniques for Linked Data Management SIGMOD 2012 Andreas Harth 1, Katja Hose 2, Ralf Schenkel 2,3 1 Karlsruhe Instititute of Technology 2 Max.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
RDF-3X: a RISC style Engine for RDF Ref: Thomas Neumann and Gerhard Weikum [PVLDB’08 ] Presented by: Pankaj Vanwari Course: Advanced Databases (CS 632)
1 Relational Query Optimization Module 5, Lecture 2.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
SPARQL All slides are adapted from the W3C Recommendation SPARQL Query Language for RDF Web link:
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Hexastore: Sextuple Indexing for Semantic Web Data Management
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Access Path Selection in a Relational Database Management System Selinger et al.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
CSCE Database Systems Chapter 15: Query Execution 1.
Database Management 9. course. Execution of queries.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Lesley Charles November 23, 2009.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
RDF-3X : RISC-Style RDF Database Engine
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.
RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 5 Index and Clustering
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Session 1 Module 1: Introduction to Data Integrity
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
RDF languages and storages part 2 - indexing semi-structure data Maciej Janik Conrad Ibanez CSCI 8350, Fall 2004.
CS4432: Database Systems II
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Information Retrieval in Practice
CC La Web de Datos Primavera 2017 Lecture 7: SPARQL [i]
Module 11: File Structure
CPS216: Data-intensive Computing Systems
Introduction to Query Optimization
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Evaluation of Relational Operations: Other Operations
CC La Web de Datos Primavera 2016 Lecture 7: SPARQL (1.0)
Implementation of Relational Operations
ICOM 5016 – Introduction to Database Systems
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Database Techniques for Linked Data Management SIGMOD 2012 Andreas Harth 1, Katja Hose 2, Ralf Schenkel 2,3 1 Karlsruhe Instititute of Technology 2 Max Planck Institute for Informatics, Saarbrücken 3 Saarland University, Saarbrücken

Outline for Part II Part II.1: Foundations –A short overview of SPARQL Part II.2: Rowstore Solutions Part II.3: Columnstore Solutions Part II.4: Other Solutions and Outlook

SPARQL Query language for RDF from the W3C Main component: –select-project-join combination of triple patterns graph pattern queries on the knowledge base

SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) vegetarian Albert_Einstein physicist Jim_Carrey actor Ontario Canada Ulm Germany scientist chemist Otto_Hahn Frankfurt Mike_Myers NewmarketScarborough Europe isA bornIn locatedIn isA

Semantic Knowledge Bases from Web Sources 5 SPARQL – Example Example query: Find all actors from Ontario (that are in the knowledge base) vegetarian Albert_Einstein physicist Jim_Carrey actor Ontario Canada Ulm Germany scientist chemist Otto_Hahn Frankfurt Mike_Myers NewmarketScarborough Europe isA bornIn locatedIn isA actor Ontario ?person ?loc bornIn locatedIn isA Find subgraphs of this form: variables constants SELECT ?person WHERE ?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario.

Eliminate duplicates in results Return results in some order with optional LIMIT n clause Optional matches and filters on bounded vars More operators: ASK, DESCRIBE, CONSTRUCT SPARQL – More Features SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c} SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person) SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}

SPARQL: Extensions from W3C W3C SPARQL 1.1 draft: Aggregations (COUNT, AVG, …) Subqueries Negation: syntactic sugar for OPTIONAL {?x … } FILTER(!BOUND(?x)) Regular path expressions Updates

Why care about scalability? Rapid growth of available semantic data > 31 billion triples in the LOD cloud, 325 sources DBPedia: 3.6 million entities, 1.2 billion triples

… and growing Billion triple challenge 2008: 1B triples Billion triple challenge 2010: 3B triples Billion triple challenge 2011: 2B triples War stories from –BigOWLIM: 12B triples in Jun 2009 –Garlik 4store: 15B triples in Oct 2009 –OpenLink Virtuoso: 15.4B+ triples –AllegroGraph: 1+ Trillion triples

Queries can be complex, too SELECT DISTINCT ?a ?b ?lat ?long WHERE { ?a dbpedia:spouse ?b. ?a dbpedia:wikilink dbpediares:actor. ?b dbpedia:wikilink dbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long. } Q7 on BTC2008 in [Neumann & Weikum, 2009]

Outline for Part II Part II.1: Foundations –A short overview of SPARQL Part II.2: Rowstore Solutions Part II.3: Columnstore Solutions Part II.4: Other Solutions and Outlook

RDF in Row Stores Rowstore: general relational database storing relations as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …) General principles: –store triples in one giant three-attribute table (subject, predicate, object) –convert SPARQL to equivalent SQL –The database will do the rest Strings often mapped to unique integer IDs Used by many TripleStores, including 3Store, Jena, HexaStore, RDF-3X, … Simple extension to quadruples (with graphid): (graph,subject,predicate,object) We consider only triples for simplicity!

Example: Triple Table ex:Katjaex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Andreas ex:teaches ex:Databases; ex:works_for ex:KIT; ex:PhD_from ex:DERI. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject predicate object ex:Katjaex:teaches ex:Databases ex:Katjaex:works_for ex:MPI_Informatics ex:Katja ex:PhD_from ex:TU_Ilmenau ex:Andreas ex:teaches ex:Databases ex:Andreas ex:works_for ex:KIT ex:Andreas ex:PhD_from ex:DERI ex:Ralf ex:teaches ex:Information_Retrieval ex:Ralf ex:PhD_from ex:Saarland_University ex:Ralf ex:works_for ex:Saarland_University ex:Ralf ex:works_for ex:MPI_Informatics

Conversion of SPARQL to SQL General approach: One copy of the triple table for each triple pattern Constants in patterns create constraints Common variables across patterns create joins FILTER conditions create constraints OPTIONAL clauses create outer joins UNION clauses create union expressions

SELECT FROM Triples P1, Triples P2, Triples P3 Example: Conversion to SQL Query SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u,“Saar“)) SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=„works_for“ AND P2.predicate=„works_for“ AND P3.predicate=„phd_from“ SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=„works_for“ AND P2.predicate=„works_for“ AND P3.predicate=„phd_from“ AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object SELECT FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=„works_for“ AND P2.predicate=„works_for“ AND P3.predicate=„phd_from“ AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object,“%Saar%“) SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=„works_for“ AND P2.predicate=„works_for“ AND P3.predicate=„phd_from“ AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object,“%Saar%“) SELECT R1.A, R1.B, R2.T FROM ( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=„works_for“ AND P2.predicate=„works_for“ AND P3.predicate=„phd_from“ AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object,“%Saar%“) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=„teaches“) AS R2 ) ON (R1.A=R2.A)  P1 P2 P3 P4   Filter Projection ?u ?a,?u ?a

Is that all? No. Which indexes should be built? (to support evaluation of triple patterns) How can we reduce storage space? How can we find the best execution plan? Existing databases need modifications: flexible, extensible, generic storage not needed here cannot deal with multiple self-joins of a single table often generate bad execution plans

Dictionary for Strings Map all strings to unique integers (e.g., hashing) Regular size, much easier to handle & compress Map small, can be kept in main memory This breaks natural sorting order  FILTER conditions may be more expensive!

Indexes for commonly used triple patterns Patterns with a single variable are frequent Example: Albert_Einstein invented ?x  Build clustered index over (s,p,o) Can also be used for pattern like Albert_Einstein ?p ?x Build similar clustered indexes for all six combinations: SPO, POS, OSP to cover all possible patterns SOP, OPS, PSO to have all sort orders for patterns with two vars (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) … All triples in (s,p,o) order B+ tree for easy access 1.Lookup ids for constants: Albert_Einstein=16, invented=24 2.Lookup known prefix in index: (16,24,0) 3.Read results while prefix matches: (16,24,567), (16,24,876) come already sorted! Triple table no longer needed, all triples in each index

Why sort order matters for joins  (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) (16,33,46578) (16,56,1345) (24,16,1353) (27,18,133) (47,37,20495) (50,134,1056) MJ When inputs sorted by join attribute, use Merge Join: sequentially scan both inputs immediately join matching triples skip over parts without matches allows pipelining When inputs are unsorted/sorted by wrong attribute, use Hash Join: build hash table from one input scan other input, probe hash table needs to touch every input triple breaks pipelining  (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) (27,18,133) (50,134,1056) (16,56,1345) (24,16,1353) (47,37,20495) (16,33,46578) HJ In general, Merge Joins are more preferrable: small memory footprint, pipelining

Even More Indexes SPARQL considers duplicates (unless removed with DISTINCT) and does not (yet) support aggregates/counting  often queries with many duplicates like SELECT ?x WHERE ?x ?y Germany. to retrieve entities related to Germany (but counts may be important in the application!)  this materializes many identical intermediate results Solution: Precompute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O Ex: SO contains, for each pair (s,o), the number of triples with subject s and object o Do not materialize identical bindings, but keep counts Ex: ?x=Albert_Einstein:4; ?x=Angela_Merkel:10

Compression to Reduce Storage Space Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO, v1=S, v2=P, v3=O Step 1: compute per-attribute deltas Step 2: encode each delta triple separately in 1-13 bytes (16,19,5356) (16,24,567) (16,24,676) (27,19,643) (27,48,10486) (50,10,10456) (16,19,5356) (0,5,-4798) (0,0,109) (11,-5,-34) (0,29,9843) (23,-38,-30)  gap bit header (7 bits) Delta of value 2 (0-4 bytes) Delta of value 3 (0-4 bytes) When gap=1, the delta of value3 is included in header, all others are 0 Otherwise, header contains length of encoding for each of the three deltas (5*5*5=125 combinations)

Compression Effectiveness and Efficiency Byte-level encoding almost as effectiv as bit-level encoding techniques (Gamma, Delta, Golomb) Much faster (10x) for decompressing Example for Barton dataset (Neumann & Weikum 2010): –Raw data 51 million triples, 7GB uncompressed (as N-Triples) –All 6 main indexes: 1.1GB size, 3.2s decompression with byte-level encoding 1.06GB size, 42.5s decompression with Delta encoding Additional compression with LZ77 2x more compact, but much slower to decompress Compression always on page level

 POS(works_for,?u,?a) POS(pdh_from,?u,?a) PSO(works_for,?u,?b)   Filter Projection ?u,?a ?u ?a MJ Back to the Example Query SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } OPTIONAL {?a teaches ?t} FILTER (regex(?u,“Saar“))  POS(works_for,?u,?a) POS(works_for,?u,?b) PSO(phd_from,?a,?u) POS(teaches,?a,?t)   Filter Projection ?u ?a,?u ?a MJ HJ POS(teaches,?a,?t) Which of the two plans is better? How many intermediate results? Core ingredients of a good query optimizer are selectivity estimators for triple patterns and joins

Selectivity Estimation for Triple Patterns How many results will a triple pattern have? Standard databases: per-attribute histograms Assume independence of attributes  Use aggregated indexes for exact count Additional join statistics for blocks of triples: too simplistic and inexact … (16,19,5356) (16,24,567) (16,24,876) (27,19,643) (27,48,10486) (50,10,10456) … Assume independence between triple patterns; additionally precompute exact statistics for frequent paths in the data

Outline for Part II Part II.1: Foundations –A short overview of SPARQL Part II.2: Rowstore Solutions Part II.3: Columnstore Solutions Part II.4: Other Solutions and Outlook

Principles Observations and Assumptions: Not too many different predicates Triple patterns usually have fixed predicate Need to access all triples with one predicate Design consequence: Use one two-attribute table for each predicate Example Systems: SWStore, MonetDB

Example: Column Stores ex:Katjaex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Andreas ex:teaches ex:Databases; ex:works_for ex:KIT; ex:PhD_from ex:DERI. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject object ex:Katjaex:TU_Ilmenau ex:Andreas ex:DERI ex:Ralf ex:Saarland_University PhD_from subject object ex:Katjaex:MPI_Informatics ex:Andreas ex:DERI ex:Ralf ex:Saarland_University ex:Ralfex:MPI_Informatics works_for subject object ex:Katjaex:Databases ex:Andreas ex:Databases ex:Ralf ex:Information_Retrieval teaches

Simplified Example: Query Conversion SELECT ?a ?b ?t WHERE {?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. } SELECT W1.subject as A, W2.subject as B FROM works_for W1, works_for W2, phd_from P3 WHERE W1.object=W2.object AND W1.subject=P3.subject AND W1.object=P3.object So far, this is yet another relational representation of RDF. Now what are Column-Stores?

Column-Stores and RDF Columnstores store columns of a table separately. subject object ex:Katjaex:TU_Ilmenau ex:Andreas ex:DERI ex:Ralf ex:Saarland_University PhD_from PhD_from:subject ex:Katja ex:Andreas ex:Ralf PhD_from:object ex:TU_Ilmenau ex:DERI ex:Saarland_University Advantages: Fast if only subject or object are accessed, not both Allows for a very compact representation Problems: Need to recombine columns if subject and object are accessed Inefficient for triple patterns with predicate variable

Compression in Column-Stores General ideas: Store subject only once Use same order of subjects for all columns, including NULL values when necessary Additional compression to get rid of NULL values subject ex:Katja ex:Andreas ex:Ralf PhD_from ex:TU_Ilmenau ex:DERI ex:Saarland_University NULL works_for ex:MPI_Informatics ex:KIT ex:Saarland_University ex:MPI_Informatics teaches ex:Databases ex:Databases ex:Information_Retrieval NULL PhD_from: bit[1110] ex:TU_Ilmenau ex:DERI ex:Saarland_University Teaches: range[1-3] ex:Databases ex:Databases ex:Information_Retrieval

Outline for Part II Part II.1: Foundations –A short overview of SPARQL Part II.2: Rowstore Solutions Part II.3: Columnstore Solutions Part II.4: Other Solutions and Outlook

Property Tables Group entities with similar predicates in a relational table (for example using types or a clustering algorithm) ex:Katjaex:teaches ex:Databases; ex:works_for ex:MPI_Informatics; ex:PhD_from ex:TU_Ilmenau. ex:Andreas ex:teaches ex:Databases; ex:works_for ex:KIT; ex:PhD_from ex:DERI. ex:Ralf ex:teaches ex:Information_Retrieval; ex:PhD_from ex:Saarland_University; ex:works_for ex:Saarland_University, ex:MPI_Informatics. subject teaches PhD_from ex:Katjaex:Databasesex:TU_Ilmenau ex:Andreas ex:Databasesex:DERI ex:Ralf ex:IRex:Saarland_University subject teaches PhD_from ex:Katjaex:Databasesex:TU_Ilmenau ex:Andreas ex:Databasesex:DERI ex:Ralf ex:IRex:Saarland_University ex:AxelNULLex:TU_Vienna subject predicate object ex:Katjaex:works_for ex:MPI_Informatics ex:Andreas ex:works_for ex:KIT ex:Ralf ex:works_for ex:Saarland_University ex:Ralf ex:works_for ex:MPI_Informatics „Leftover triples“

Property Tables: Pros and Cons Advantages: More in the spirit of existing relational systems Saves many self-joins over triple tables etc. Disadvantages: Potentially many NULL values Multi-value attributes problematic Query mapping depends on schema Schema changes very expensive

Even More Systems… Store RDF data as matrix with bit-vector compression Convert RDF into XML and use XML methods (XPath, XQuery, …) Store RDF data in graph databases … See proceedings for pointers See also our tutorial at Reasoning Web 2011

Which technique is best? Performance depends a lot on precomputation, optimization, implementation Comparative results on BTC 2008 (from [Neumann & Weikum, 2009]): RDF-3X RDF-3X (2008) COLSTORE ROWSTORE RDF-3X RDF-3X (2008) COLSTORE ROWSTORE

Challenges and Opportunities SPARQL with different entailment regimes („query-time inference“) Upcoming SPARQL 1.1 features (grouping, aggregation, updates) Ranking of results –Efficient top-k operators –Effective scoring methods for structured queries Dealing with uncertain information – what is the most likely answer? –triples with probabilities Where is the limit for a centralized RDF store?

Backup Slides

Handling Updates What should we do when our data changes? (SPARQL 1.1 will have updates!) Assumptions: Queries far more frequent than updates Updates mostly insertions, hardly any deletions Different applications may update concurrently Solution: Differential Indexing

Differential Updates Workspace A: Triples inserted by application A Workspace B: Triples inserted by application B on-demand indexes at query time kept in main memory Staging architecture for updates in RDF-3X Query by A completion of A completion of B Deletions: Insert the same tuple again with „deleted“ flag Modify scan/join operators