GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.

GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria

2 Motivation Plenty of large RDF datasets:  TAP, GovTrack, ChefMoz, CIA World Factbook  Many many more (see rdfdata.org) Query languages: RDQL, RQL, SPARQL DB systems: Jena, Sesame, RDFBroker Indexing?  Based on relational database indexes  Has to be rooted in the characteristics of the query language

Contributions Lightweight mechanism for indexing large RDF datasets  GRIN: Graph-based RDF INdex Query answer algorithms for SPARQL-like queries Evaluation on two real-world datasets: TAP (Stanford) and ChefMoz (chefmoz.org) 3

Outline RDF data and queries The GRIN Index structure Answering queries Experimental evaluation 4

RDF graph example (ChefMoz) 5

RDF query example 6

Query example in SPARQL 7 X SELECT ?v1 ?v2 ?v3 WHERE { {(?v1 attire ?v3). (?v1 cuisine Italian)} {(?v2 attire ?v3). (?v2 cuisine Italian). (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)} } FROM ChefMoz

Native RDF systems: Jena2 Stores RDF as (subject, property, value) in a relational table Indexes on each of the three attributes Translates SPARQL/RDQL into SQL 8 X 6 self-joins

Native RDF systems: Sesame Broekstra et al., ISWC 2002 The Sesame SAIL API improves on Jena:  Supports RDF Schema inference  Separates RDFS from the triple table  Supports database schema generation based on the underlying RDF schema of a dataset The problem of too many joins remains 9

Native RDF systems: RDFBroker Sintek et al., ESWC 2006 The database schema is built based on signatures – the set of properties used on a resource Reduces the number of joins between tables 10

The human perspective 11

GRIN intuition Resources “closer” in the RDF graph are more likely to be part of the same answer  Hence they should appear on the same page GRIN will group resources in circles around selected center resources Query evaluation:  Find the smallest circle that contains the answer  Evaluate query only on resources in that circle 17

The GRIN Index structure GRIN is a binary tree in which:  Leaf nodes are sets of resources (and the associated triples)  Inner nodes are circles consisting of a center resource and a radius  Each node is fully contained in its parent Distance metric: shortest path distance in the undirected graph 18

Building the index: clustering 19

Building the index: clustering Standard k-medoids clustering (Kaufman & Rousseeuw, 1987) How many clusters?  R is the set of resources  M is the maximum number of resources per page Average link gives the best performance for the inter-cluster distance 23

Building the index: the tree 24

Queries to constraints Extract constraints from the query:  d(?v1, Italian) ≤ 1  d(?v2, Norfolk) ≤ 1  d(?v3, Italian) ≤ 2  …and so on 28

Query evaluation 29 Goal: identify the smallest circle that is guaranteed to contain an answer to the query 1. Perform a depth-first traversal 2. For each index node, evaluate the constraints 3. If the constraints guarantee an answer, perform subgraph matching

Query evaluation 30

Evaluating constraints Constraints:  d(?v1, Italian) ≤ 1, d(?v2, Norfolk) ≤ 1, d(?v3, Italian) ≤ 2 Question: is ?v1 in the circle (Grivanti, 3)?  d(Grivanti,?v1) ≤ d(Grivanti, Italian) + d(?v1, Italian) ≤ 1 + 1 = 2  ?v1 must be in the circle (Grivanti, 3) 31

Evaluating constraints Question: is ?v3 in (Grivanti, 3)?  d(Grivanti, ?v3) ≤ d(Grivanti, Italian) + d(Italian, ?v3) ≤ 1 + 2 = 3  ?v3 must be in (Grivanti, 3)  Similarly, ?v2 is in the same circle 32

Subgraph matching Perform subgraph matching on the resources in the circles guaranteed to contain an answer  Algorithm by Cordella et. al, IEEE PAMI 26(10), 2006 Worst-time complexity of O(N!)  Where N is the maximum number of nodes in either graph  In practice, GRIN makes N very small 33

Experimental framework Comparison between GRIN, Sesame, Jena2 and RDFBroker (in-memory)  Index build time  Memory consumption at query time  Query time Two real-world datasets:  TAP (Stanford): datasets between 1.5MB and 300MB  ChefMoz (chefmoz.org): 220 MB 35

Index build time 36

Memory consumption 37

Query time 38

Average degree of a query node 39

Conclusions Method for indexing large RDF graphs adapted to the characteristics of RDF queries Avoids expensive join operations Gives better query times than Jena2, Sesame and RDFBroker Current and future work:  Disk-based index  Analysis of overlap and coverage 40

GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.

Similar presentations

Presentation on theme: "GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.

Similar presentations

Presentation on theme: "GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria."— Presentation transcript:

Similar presentations

About project

Feedback