Presentation is loading. Please wait.

Presentation is loading. Please wait.

RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System.

Similar presentations


Presentation on theme: "RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System."— Presentation transcript:

1 RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System Centric Optimization, VLDB, 2008 2009-02-05 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

2 Copyright  2009 by CEBT Overview  Goal Building a new type of TripleStore => RDF-3X Compare RDF-3X with traditional ones  In this presentation, Focusing on physical storage design that had an effect on entire implementation of the system Center for E-Business Technology

3 Copyright  2009 by CEBT Introduction  RDF: Resource Description Framework Conceptually a labeled graph In RDF, all data items are represented in the form of – (subject, predicate, object), aka (subject, property, value) RDF data can be seen as a (potentially huge) set of triples Center for E-Business Technology SPO S1S1 P1P1 O1O1 S1S1 P2P2 O2O2 …...… 2009 IDS Lab. Winter Seminar – 3/22

4 Copyright  2009 by CEBT Introduction  SPARQL: SPARQL Protocol and RDF Query Language The official standard for searching over RDF storages Example – Retrieve the titles of all movies with Johnny Depp SPARQL queries are pattern matching queries on triples that are stored in the RDF storage Center for E-Business Technology SPO S1S1 P1P1 O1O1 S1S1 P2P2 O2O2 …...… Each pattern consists of S, P, O, and each of these is either a variable or a literal Each pattern consists of S, P, O, and each of these is either a variable or a literal

5 Copyright  2009 by CEBT Physical Designs for RDF Storage (1/4)  Giant Triples Table Center for E-Business Technology SELECT ?title WHERE { ?book ?title. ?book. ?book }  Join! Join!  Entire Table Scan!  Redundancy!

6 Copyright  2009 by CEBT Physical Designs for RDF Storage (2/4)  Clustered Property Table Contains clusters of properties that tend to be defined together Center for E-Business Technology

7 Copyright  2009 by CEBT Physical Designs for RDF Storage (3/4)  Property-Class Table Exploits the type property of subjects to cluster similar sets of subjects together in the same table Unlike clustered property table, a property may exist in multiple property-class tables Center for E-Business Technology Values of the type property

8 Copyright  2009 by CEBT Physical Designs for RDF Storage (4/4)  Vertically Partitioned Table The giant table is rewritten into n two column tables where n is the number of unique properties in the data We don’t have to – Maintain null values – Have a certain clustering algorithm Center for E-Business Technology subject property object

9 Copyright  2009 by CEBT RDF-3X  Technical Challenges The diversity of predicate names pose major problem for the physical database design – Join, Redundancy,..  RDF-3X (RDF Triple eXpress) A novel architecture for RDF indexing and querying, eliminating the need for physical database design Center for E-Business Technology

10 Copyright  2009 by CEBT Mapping Dictionary  Replacing all literals by unique IDs using a mapping dictionary RDF-3X is based on a single “giant triples table”, but Mapping dictionary compresses the triple store – Reduced redundancy, Saving a lot of physical space Center for E-Business Technology SPO object214hasColorblue object214belongsToobject352 ……… SPO 012 034 ……… IDValue 0object214 1hasColor ……

11 Copyright  2009 by CEBT Clustered B + -Tree  Store everything in a clustered B + -Tree Triples are sorted in lexicographical order – Allowing the conversion of SPARQL patterns into range scan We don’t have to do entire table scan Center for E-Business Technology 002… 000001002003 SPO 012 034 ……… Actually, we don’t need this table! IDValue 0object214 1hasColor ……

12 Copyright  2009 by CEBT Exhaustive Indexing  We relied on the fact that the variables are a suffix - - ?var, - ?var1 - ?var2 But, ?var - - – To guarantee that we can answer every possible pattern with variables in any position of the pattern triple by merely a single index scan, we maintain all six possible permutations of S, P, and O in six separate indexes – (SPO, SOP, OSP, OPS, PSO, POS) – We can afford this level of redundancy – On all experimental datasets, the total size for all indexes together is less than the original data Center for E-Business Technology ?var - -

13 Copyright  2009 by CEBT Moreover, …  Aggregated Indices Sometimes we don’t need the full triple – Is there a connection between obj4 and obj13? – How many author does object14 have? Therefore maintain aggregated indexes with (value1, value2, count) – (value1, value2) => (SP, PS, SO, OS, PO, OP) – We can use clustered B+ tree  Other Features Join ordering Selectivity estimation … Center for E-Business Technology

14 Copyright  2009 by CEBT An Experimental Setup  Setup 2GHz dual core, 2GB RAM, 30MB/s disk, Linux  Competitors MonetDB – column-store-based (vertically partitioned) approach – Presented in VLDB07, by Abadi et al. PostgreSQL – Triple store with SPO, POS, PSO indexes, similar to Sesame Other approaches performed much worse – Jena2, Yars2(DERI), …  Datasets Barton, library data, 51 mil. triples (4.1 GB) Yago, Wikipedia-based ontology, 40 mil. triples (3.1 GB) LibraryThing(partial crawl), users tag books, 30 mil. triples (1.8 GB)  Benchmark queries (7 or 8 per dataset) - appendix Center for E-Business Technology

15 Copyright  2009 by CEBT DB Load Time & DB Size Center for E-Business Technology BartonYagoLibThing RDF-3X132520 MonetDB11214 PostgreSQL302520 DB Load Time (min.) BartonYagoLibThing RDF-3X2.82.71.6 MonetDB1.61.10.7 PostgreSQL8.77.55.7 DB Size (GB) Good Bad! After running the benchmark 2.0 2.4 6.9

16 Copyright  2009 by CEBT Query Run-times Center for E-Business Technology BartonYagoLibThing RDF-3X0.4(5.9)0.04(0.7)0.13(0.89) MonetDB4.8(26.4)54.6(78.2)4.39(8.16) PostgreSQL64.3(167.8)0.56(10.6)30.4(93.9) Average run-times for warm(cold) cache (sec.)

17 Copyright  2009 by CEBT Conclusion  RDF-3X(RDF Triple eXpress) is a fast and flexible RDF/SPARQL engine Exhaustive but very space-efficient triple indexes Avoids physical design tuning, generic storage Fast runtime system, query optimization has a huge impact  RDF-3X is freely available http://www.mpi-inf.mpg.de/~neumann/rdf3x Center for E-Business Technology

18 Copyright  2009 by CEBT Paper Evaluation  Pros Good Idea Introduce & Solve Optimization Issues Implementation  My Comments Real examples about optimization issues RISC-style? – Most operators merely process integer-encoded IDs, consume and produce streams of ID tuples, compare IDs, etc... ?? Insert & Update & Delete ? Namespace Center for E-Business Technology


Download ppt "RDF-3X: a RISC-style Engine for RDF Presented by Thomas Neumann, Gerhard Weikum Max-Planck-Institut fur Informatik Saarbrucken, Germany Session 19: System."

Similar presentations


Ads by Google