GRIN – A Graph Based RDF Index Octavian Udrea Andrea Pugliese V. S. Subrahmanian Presented by Tulika Thakur.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
WSPD Applications.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.
Topic 23 Red Black Trees "People in every direction No words exchanged No time to exchange And all the little ants are marching Red and black antennas.
Greedy Algorithms Greed is good. (Some of the time)
STUN: SPATIO-TEMPORAL UNCERTAIN (SOCIAL) NETWORKS Chanhyun Kang Computer Science Dept. University of Maryland, USA Andrea Pugliese.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
SASH Spatial Approximation Sample Hierarchy
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.
CS Lecture 9 Storeing and Querying Large Web Graphs.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Unit 11a 1 Unit 11: Data Structures & Complexity H We discuss in this unit Graphs and trees Binary search trees Hashing functions Recursive sorting: quicksort,
1 PART III STORAGE AND INDEXING CH. 8: OVERVIEW OF STORAGE AND INDEXING (introduction) CH. 9: STORING DATA: DISKS AND FILES (hardware) CH. 10: TREE-STRUCTURED.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Query Processing & Optimization
P2P Course, Structured systems 1 Introduction (26/10/05)
Advanced Data Structures and Algorithms COSC-600 Lecture presentation-6.
Random Graph Models of Social Networks Paper Authors: M.E. Newman, D.J. Watts, S.H. Strogatz Presentation presented by Jessie Riposo.
Heapsort Based off slides by: David Matuszek
Database Management 9. course. Execution of queries.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Mining Social Network Graphs Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014.
GStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao {
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
11 -1 Chapter 11 Randomized Algorithms Randomized Algorithms In a randomized algorithm (probabilistic algorithm), we make some random choices.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
CSIT 402 Data Structures II
Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Trees  Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search,
Preview  Graph  Tree Binary Tree Binary Search Tree Binary Search Tree Property Binary Search Tree functions  In-order walk  Pre-order walk  Post-order.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Union-find Algorithm Presented by Michael Cassarino.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
CPSC 252 Binary Heaps Page 1 Binary Heaps A complete binary tree is a binary tree that satisfies the following properties: - every level, except possibly.
1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.
Heapsort. What is a “heap”? Definitions of heap: 1.A large area of memory from which the programmer can allocate blocks as needed, and deallocate them.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
CS 307 Fundamentals of Computer ScienceRed Black Trees 1 Topic 19 Red Black Trees "People in every direction No words exchanged No time to exchange And.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
1 Binary Search Trees  Average case and worst case Big O for –insertion –deletion –access  Balance is important. Unbalanced trees give worse than log.
GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Trees.
Binary Search Trees A binary search tree is a binary tree
Probabilistic Data Management
Overview of Query Optimization
Priority Search Trees Keys are pairs (x,y).
Topic 23 Red Black Trees "People in every direction No words exchanged No time to exchange And all the little ants are marching Red and Black.
Lu Xing CS59000GDM Sept 7th, 2018.
ICS 353: Design and Analysis of Algorithms
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Compact routing schemes with improved stretch
Switching Lemmas and Proof Complexity
Presentation transcript:

GRIN – A Graph Based RDF Index Octavian Udrea Andrea Pugliese V. S. Subrahmanian Presented by Tulika Thakur

- Indexing mechanism for Graph based Queries. -GRIN : a tree data structure. -Large RDF datasets used : TAP, ChefMoz -Comparison with DB systems: Jena, Sesame, RDFBroker - Measure parameters - 1) Size of Index 2) Time taken to answer graph queries 3) Time taken to build the index

RDF graph queries The GRIN Index structure Query Answering Experimental evaluation

RDF Graph Example (extracted from ChefMox dataset)

RDF Graph Representation : An RDF triple has the form (s, p, v) where s ∈ U, p ∈ Up, v ∈ R. U denote a set whose elements are called URI References. L denote a set whose elements are called literals. Up ⊆ U denotes the set of properties. R = U ∪ L denotes the set of resources

Introduction to P-path Given an RDF graph D and a set P ⊆ Up, a P-path in D is a set {e1,..., eq}, with ej = (sj, pj, vj), such that ∀ j ∈ [1, q] ej ∈ D; ∀ j ∈ [1, q − 1] vj = sj+1; ∀ j ∈ [1, q] pj ∈ P. Intuitively, a P-path is a path in the RDF graph whose edge labels are all drawn from the set P. For Example Let P = {location, locatedIn}. The triples (ColdStone, location, Lincoln) and (Lincoln, locatedIn,NE/USA) constitute a P-path of length two in the graph.

Introduction to P-path P = {location,locat edIn} d(ColdStone,NE/USA) = 2 Triples = (ColdStone, location, Lincoln) and (Lincoln, locatedIn,NE/U SA)

RDF Graph Query An RDF graphical query is a 4-tuple (N, V, E, λn) where: N is a set of vertices; V is a set of variables; E = Es ∪ Ed is a set of edges, where E s ⊆ N × N × (V ∪ Up) and E d ⊆ N × N × 2^U p × IN. We call E s the set of single edges and Ed the set of double edges. λn : N →R ∪ V is a vertex labeling function.

RDF Graph Query The query can be expressed in SPARQL as: SELECT ?v1 ?v2 ?v3 WHERE {{(?v1 attire ?v3). (?v1 cuisine Italian)} {(?v2 attire ?v3). (?v2 cuisine Italian). (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)}} P-path

RDF graph queries The GRIN Index structure Query Answering Experimental evaluation

GRIN Index -Resources “closer” in the RDF graph are more likely to be part of the same answer Hence they should appear on the same page. -GRIN will group resources in circles around selected center resources -Query evaluation: Find the smallest circle that contains the answer -Evaluate query only on resources in that circle

Building a GRIN Index A GRIN index is a balanced binary tree such that: Each leaf node contains a set N l ⊆ Rof nodes s.t. for all leaf nodes l != l', N l ∩ N l' = ∅ ; Each non-leaf node t contains a pair (c, r), with c ∈ R and r ∈ IN. This is a very succinct representation of the set of resources in the graph at distance at most r of the resource c. We write this set as Nt = {c' ∈ R|d(c, c') ≤ r}. For any nodes x, y in the tree such that x is a parent of y, Nx ⊇ Ny.

Building a GRIN Index M = maximum number of RDF graph vertices per page. C = number of leaf nodes. |R|/C <= M d c = inter cluster distance function (i) Single link defines d c (S, S') = Min (d c (x, y)) where x ∈ S,y ∈ S' (ii) Complete link defines d c (S, S') = Max (d c (x, y)) where x ∈ S,y ∈ S' (iii) Average link defines d c (S, S') = (Σ(d c (x,y)))/ ( |S|×|S'| ) Where x ∈ S,y ∈ S'

Building a GRIN Index Cluster the vertices in C disjoint Sets using PAM Clustering algorithm. Repeat untill equilibrium is reached? For each intermediate leve, GRINBuld chooses a random node u, Computes its closest node v, and assignes a parent node (c,r) where c is selected from Nu U Nv

Building a GRIN Index

Building the index: the tree 16

Building the index: the tree 17

Building the index: the tree 18

RDF graph queries The GRIN Index structure Query Answering Experimental evaluation

Query Answering Derive Contraints from the query. Evaluate constraints against the nodes of GRIN Index d(?v1,NE/USA) ≤ 2, d(?v2, NE/USA) ≤ 2, d(?v2, Norfolk) ≤ 1), d(?v1, Norfolk) ≤ 3, d(?v1, Italian) ≤ 1, d(?v2, Italian) ≤1, d(?v3, NE/USA) ≤ 3, d(?v3, Norfolk) ≤ 2, d(?v3, Italian) ≤ 2.

Query Answering For any given node, REJECT or ACCEPT it. 1: Reject circle(c,r) if any constant in query is outside the circle 2: Reject circle(c,r) if we cannot guarantee that every variable in inside the circle. Is ?v1 in circle (Grivanti, 2)? d(Grivanti,?v1) ≤ d(Grivanti,Italian)+d(?v1,Italian) ≤ 2 So ?v1 can be satisfied.

RDF graph queries The GRIN Index structure Query Answering Experimental evaluation

RDF System : GRIN Does not store the data in the index, but points to it. The data is stored in a hash table. Only one computationaly iintensive operation – Clustering the leaf nodes. For 300MB data, indexi stored in 75MB and 320 MB is used for the hash table.

RDF System : Jena Stores RDF as (subject, property, value) in a relational table. Indexes on each of the three attributes. Translates SPARQL/RDQL into SQL. Too many self joins. Used 403MB for indexing on 300MB data.

RDF System : Sesame Supports RDF Schema inference Separates RDFS from the triple table Supports database schema generation based on the underlying RDF schema of a dataset The problem of too many joins still remain. Used 825MB for indexing on 300MB data.

RDF System : RDF Broker The database schema is built based on signatures – the set of properties used on a resource. Reduces the number of joins between tables. Used 950MB for indexing on 300MB data.

Discussiom Vertices in GRIN = resources in underlying RDF. Resources can be atmost |R|. Therefore, number of leaf nodes = O|R| GRIN s a binary tree, so height of tree = O(log 2 |R|) Worst Case complexity for index building = O(|R|^ 4* log 2 (|R|) ) Good for small sized data only.

Discussion Time complexity for Query Answering : Best Case - O(N) Worst Case - O(N!) Where N is the total number of vertices in the graphs to be matched, “Our experimental results show that GRINAnswer is often faster than Jena, Sesame and RDFBroker for certain types of graph-based queries.”

Discussion The query can be expressed in SPARQL as: SELECT ?v1 ?v2 ?v3 WHERE {{(?v1 attire ?v3). (?v1 cuisine Italian)} {(?v2 attire ?v3). (?v2 cuisine Italian). (?v2 location Norfolk)} {(Norfolk locatedIn NE/USA)}} No Way to represent P-path in SPARQL !! P-path

ThankYou!