RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.

RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08 May 25 2011 Presented by Somin Kim

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 2/30

Introduction (1/3) Motivation and Problem  RDF (Resource Description Framework) –A flexible representation of schema-free information for the semantic web –(subject, predicate, object) or (subject, property, value) –All RDF triples together can be viewed as a large graph –The notion of RDF triples fits well with “pay as you go” philosophy 3/30

Introduction (2/3) Motivation and Problem  Technical challenges for managing large-scale RDF data –Physical database design –Prediction of join attributes –Suitable granularity of statistics gathering –RDF triples form a graph rather than a collection of trees 4/30

Introduction (3/3) Contribution and Outline  RDF-3X (RDF Triple eXpress) –A novel architecture for RDF indexing and querying, eliminating the need for physical database design  Key principles of RDF-3X –Physical design is workload-independent  By creating appropriate indexes over a single, giant “triple table” –The query processor is RISC-style  By relying mostly on merge joins over sorted index lists –The query optimizer employs dynamic programming for plan enumeration 5/30

Outline  Introduction  Background and State of the Art –SPARQL –Related Work  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 6/30

Background and State of the Art (1/4) SPARQL  The official standard for searching over RDF storages  Each pattern consists of S, P, O and each of these is either a variable or a literal  Two query modifiers of SPARQL –distinct keyword : duplicates must be eliminated –reduced keyword : duplicates may but need not be eliminated SELECT ?var1 ?var2… WHERE { pattern1. pattern2. … } SELECT ?var1 ?var2… WHERE { pattern1. pattern2. … } SELECT ?title WHERE { ?m ?title; ?c. ?c ?a. ?a “Johnny Depp” } SELECT ?title WHERE { ?m ?title; ?c. ?c ?a. ?a “Johnny Depp” } 7/30

Background and State of the Art (2/4) Related Work  Triple table –All triples are stored in a single table SELECT ?title WHERE { ?book ?title. ?book. ?book } 8/30 Based on JS Myoung’s presentation slide

Background and State of the Art (3/4) Related Work  Property table –Triples are grouped by their predicate name subject property object 9/30

Background and State of the Art (4/4) Related Work  Cluster-property table –Triples are clustered by properties that tend to be defined together 10/30

Outline  Introduction  Background and State of the Art  Storage and Indexing –Triple Store and Dictionary –Compressed Indexes –Aggregated Indexes  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 11/30

Storage and Indexing (1/7) Triples Store and Dictionary  RDF-3X is based on a single, giant “triples table”  Mapping Dictionary –Replacing all literals by ids using a mapping dictionary –It compresses the triple store by containing only id triples SPO object214hasColorblue object214belongsToobject352 ……… SPO 012 034 ……… IDValue 0object214 1hasColor …… 12/30

Storage and Indexing (2/7) Triples Store and Dictionary  Store all triples in a clustered B + -tree –Triples are sorted lexicographically –It allows the conversion of SPARQL patterns into range scans 002… 000001002003 IDValue 0object214 1hasColor …… SPO 012 034 ……… Actually, we don’t need this table! ( literal1, literal2, ?x ) 13/30

Storage and Indexing (3/7) Compressed Indexes  We relied on the fact that the variables are a suffix – - - ?var or -?var1 -?var2  To guarantee that we can answer every possible pattern with variables in any position by merely performing a single index scan, we maintain all six permutations of S, P and O in six separate indexes –(SPO, SOP, OSP, OPS, PSO, POS) –We can afford this level of redundancy ?var - - 14/30

Storage and Indexing (4/7) Compressed Indexes  Instead of storing full triples, we only store the changes between triples –The collation order causes neighboring triples to be very similar  We use a byte-level compression scheme –The algorithm computes the delta to the previous tuple –If delta is small, it is directly encoded in the header byte –Otherwise, it computes the delta value, write the header byte with the size information and write the non-zero tail of the delta 15/30

Storage and Indexing (5/7) Compressed Indexes  Comparison of byte-wise compression vs. bit-wise compression for the Barton dataset  Each leaf page is compressed individually –It allows us to seek to any leaf page and directly start reading triples –The compressed index behaves just like a normal B + -tree 16/30

Storage and Indexing (6/7) Aggregated Indices  For many SPARQL patterns, indexing partial triples rather than full triples would be sufficient  Aggregated indexes –Each aggregated indexes store only two out of the three columns of a triple  (value1, value2, count )  This is done for (SP, PS, SO, OS, PO, OP) –All three one-value indexes  (value1, count)  This is done for (S, P, O) select ?a ?c where { ?a ?b ?c } select ?a ?c where { ?a ?b ?c } 17/30

Storage and Indexing (7/7) SPO SOP PSO POS OSP OPS Triple Index Count SP Count SO Count PS Count PO Count OP Count OS Count S P O Aggregate Index 18/30 Based on KS Kim’s presentation slide

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization –Translating SPARQL Queries –Optimizing Join Ordering  Selectivity Estimates  Evaluation  Conclusion 19/30

Query Processing and Optimization (1/2) Translating SPARQL Queries  Each query can be parsed and expanded into a set of triple patterns  The parser performs dictionary lookups, so the literals are mapped into ids  When a query consists of –a single pattern  Use index structures and answer the query with a single range scan –multiple triple pattern  Join the results of the individual patterns  When a query includes the distinct option, we eliminates duplicates in the result  Finally, a dictionary lookup operator converts the resulting ids back in to strings 20/30

Query Processing and Optimization (2/2) Optimizing Join Ordering  Demanding properties –Bushy join trees (rather than left-deep or right-deep trees) –Fast plan enumeration and cost estimation –Extensive use of merge joins  DP framework –To find best plan, consider all possible plans of subsets –Recursively compute costs for joining subsets to find the cost of each plan –When plan for any subset is computed, store it and reuse it –Larger plans are created by joining optimal solutions of smaller problems 21/30

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates –Selectivity Histograms  Evaluation  Conclusion 22/30

Selectivity Estimates  Estimated cardinalities and selectivities have a huge impact on plan generation  Selectivity Histograms –The cardinality of a single triple pattern  Using aggregated indexes –The numbers of the join partners  Frequent join path 23/30

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation –General Setup –Query Run-times  Conclusion 24/30

Evaluation (1/3) General Setup  Setup –2GHz dual core, 2GB RAM, 30MB/s disk, Linux  Competitors –MonetDB  Column-store-based approach  Presented in VLDB07, by Abadi et al. –PostgreSQL  Triple store with SPO, POS, PSO indexes, similar to Sesame –Other approaches performed much worse  Jena2, Yars2(DERI) 25/30

Evaluation (2/3) General Setup  Datasets –Barton, library data, 51M triples (4.1GB) –Yago, Wikipedia-based ontology, 40M triples (3.1GB) –LibraryThing(partial crawl), tags that users have assigned to the books, 30M triples (1.8GB)  DB load time & DB size 26/30

Evaluation (3/3) Query Run-times  Average run-times for cold caches (sec)  Average run-time for warm caches (sec) BartonYagoLibraryThing RDF-3X5.90.70.89 MonetDB26.478.28.16 PostgreSQL167.810.693.90 BartonYagoLibraryThing RDF-3X0.40.040.13 MonetDB4.854.604.39 PostgreSQL64.30.5630.40 27/30

Outline  Introduction  Background and State of the Art  Storage and Indexing  Query Processing and Optimization  Selectivity Estimates  Evaluation  Conclusion 28/30

Conclusion  RDF-3X is a fast and flexible RDF/SPARQL engine –Exhaustive but very space-efficient triple indexes –Avoids physical design tuning, generic storage –Fast runtime system, query optimization has a huge impact 29/30

RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.

Similar presentations

Presentation on theme: "RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08.

Similar presentations

Presentation on theme: "RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08."— Presentation transcript:

Similar presentations

About project

Feedback