Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar

Motivation Need for efficient storage of structured data Semantic Web libraries, scientific databases, industry Social Networks

RDF Schema Schema Instance

RDF Schema RDF Triples

Related Work Triple store Property tables Class property tables Dynamic table model Vertically partitioned tables (Abadi, et al 2007)‏ Path based approach (Matono, et al 2005)

Vertical Partitioning A table is created for each property First SubjectObject 'r1''Picasso' 'r4''August' Last SubjectObject 'r1''Picasso' 'r4''Rodin' Paints SubjectObject 'r1''r2' 'r1''r3'... etc.

Path-based Model Path signatures relate to instance data Path pathidpathexp 1'' 2'#first' 3'#last' 4'#paints' 5'#title<#paints' 6'#sculpts' 7'#title<#sculpts' Resource namepathidroot 'r1'1'r1' 'r2'4'r1' 'r3'4'r1' 'r4'1'r4' 'Picasso'2'r1' 'Pablo'3'r1' 'August'2'r4' 'Rodin'3'r4'... Our enhancement

Problem Statement Given: A set of RDF triples Vertical partitioning storage model Path-based storage model Find: Query plans for the various categories of queries under these two storage schemes. Objective: To determine query types that perform comparatively better or worse in two storage models Why is the problem hard? Different application domains use RDF, generic storage schemes should support a diverse workload.

Contributions Identification of benchmark queries schema, instance, path, and aggregate queries Enhancement to the path-based schema that addresses different types of workloads Comparison of path-based model and vertical partitioning Analysis of cyclic queries

Query Types Schema queries find all types of artists list all property names list nodes with 2 or more descendants. find the transitive sub-classes of a class 'sculpture' list properties with 2 or more descendants Instance queries find the titles of all paintings by Picasso select all nodes within one edge-length of R4 list all the properties of node r4

Query Types Path queries find the title of any painting painted by anyone display all the titles of work done by artists find the names of all the sculptors...with constraint on intermediate node find an artist's name where the artifact is a painting...with terminal node constraints display all the titles of work done by Picasso connection queries list all the properties of node r4 is there a connection between 'Picasso' and 'Guernica'? diameter queries select all nodes in the graph within one edge-length of R4 non-simple path queries detect loops in the dataset starting at 'Picasso' detect loops in the whole dataset

Query Types Aggregate queries find all nodes with 2 or more properties list all subjects that have two instances of a single property Relationship queries find any relationship between r1 and r4

Assumptions Using a small dataset, with the assumption that number of joins and efficiency of the queries will not change significantly with larger datasets No explicit storage of the RDF schema in the vertically- partitioned scheme INSERT, UPDATE, & DELETE are insignificant compared to SELECT Key nodes in the path-based model are well-defined In practice, key nodes, would be generated dynamically after user load analysis

Experimental Process Validation parameters Nodes Edges Number of joins Number of tables CPU cost Storage bytes Setup both schemes in Oracle 10g for the RDF graph shown earlier Materialized path lengths in path-based scheme Generated query plans Analyzed queries based on the validation parameters Cycle queries – joins are not supported

Conclusions & Observations Vertical Partitioning performs well for Short path length, terminal node constraints. Offers storage benefits for instance queries without path expressions. Enhanced Path Based model performs well for Schema queries, path queries, cycle queries Queries which the original path-based could not address and the enhanced model could answer: Connection queries and diameter queries Path queries with intermediate node constraints

Conclusion (Cont'd)‏ Both the schemes show the same performance on instance queries without path expressions. Both the schemes do not address relationship queries Interesting results for cycle queries specifying the start node gives a bad performance than when the start node is not specified specifying the start node uses Oracle Filter.

Future Work Test large and diverse datasets Test vertical partitioning with a column-orientated database like MonetDB Pruning strategies for cycle queries Impose join indexes Find approaches to answer relationship queries Storage classification based on the application domain

Thank You Questions? Please see http://www.cs.umn.edu/~cmueller/cs8715 for a copy of the report that accompanies this presentation, including a full bibliography

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Similar presentations

Presentation on theme: "Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Similar presentations

Presentation on theme: "Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar."— Presentation transcript:

Similar presentations

About project

Feedback