Presentation is loading. Please wait.

Presentation is loading. Please wait.

GStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao {

Similar presentations


Presentation on theme: "GStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao {"— Presentation transcript:

1 gStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao { zoulei,mojinghui,zdy}@icst.pku.edu.cn, leichen@cse.ust.hk, tamer.ozsu@uwaterloo.ca

2 Agenda Introduction Preliminaries Overview of gStore Storage Scheme and Encoding Technique Indexing Structure and Query Algorithm Optimized methods Experiments and their results Conclusions

3 Introduction -1/4 What is RDF? – Building block of semantic web – Represented as a collection of triples : (Subject,Property,Object) Prefix: y=http://en.wikipedia.org/wiki/ SubjectPropertyObject y:Abraham LincolnhasNameAbraham Lincoln y:Abraham LincolnBornOnDate1809-02-12 y:Abraham LincolnDiedOnDate1865-04-15 y:Abraham LincolnDiedIny:Washington_D.C hasName“Washington D.C” y:Washington_D.CFoundYear1790 y:Washington_D.Crdf:typey:city y:United_StateshasName“United States” y:United_StateshasCapitaly:Washington_D.C y:United_Statesrdf:typeCountry y:Reese_Witherspoonrdf:typey:Actor y:Reese_WitherspoonBornOnDate“1976-03-22” y:Reese_WitherspoonBornIny:New_Orleans_Louisiana y:Reese_WitherspoonhasName“Reese Witherspoon” y:New_Orleans_LouisianaFoundYear1718 y:New_Orleans_Louisianardf:typey:city y:New_Orleans_LouisianalocatedIny:United_States

4 Introduction 2/4:RDF Graph

5 Introduction - 3/4 What is SPARQL? Sample query: Select ?name Where { ?m ?name. ?m “1809-02-12” ?m “1865-04-15” } Query with wildcards: Select ?name Where { ?m ?name. ?m ?bd. ?m ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd), “04-15”) }

6 Introduction - 4/4 Problems with existing solutions: – they cannot answer SPARQL queries with wildcards in a scalable manner – they cannot handle frequent updates in RDF repositories Answering with subgraph matching – Modeling RDF data and Query as two graphs – Cannot use regular graph pattern matching – Answering SPARQL query ≈ subgraph matching

7 Preliminaries RDF graph, G, is denoted as G=(V, L V, E, L E ) Query graph, Q, is denoted as Q=(V, L V, E, L E )

8 G(u 1, u 2,…, u n ) is a match of Q(v 1, v 2,…, v n ) if: – v i is a literal vertex, v i and u i have the same literal value – v i is a class/entity vertex, v i and u i have the same URI – v i is a parameter vertex, there is no constraint over u i – v i is a wildcard vertex, v i is a substring of u i and u i is a literal value – there is an edge from v i to v j in Q with the property p, there is also an edge from u i to u j in G with the same property p Preliminaries Cont’d

9 Overview of gstore Work directly on RDF graph and SPARQL Query graph Use a signature-based encoding of each entity and class vertex to speed up matching Filter and evaluate – Use a false-positive algorithm to prune nodes and obtain a set of candidates; then verify each candidate Use an index (VS ∗ -tree) over the data signature graph (has light maintenance load) for efficient pruning

10 Storage Scheme & Encoding Technique Storage Scheme

11 Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000

12 Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 “Abr” “bra” “rah”

13 Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 “Abr” “bra” “rah” 0000 0100 0000 0000 1000 0000 0000 0000 0000 0000 0100 0000

14 Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 “Abr” “bra” “rah” 0000 0100 0000 0000 1000 0000 0000 0000 0000 0000 0100 0000 OR 1000 0100 0100 0000

15 Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0100 0000 0000 1000 0100 0100 0000

16 Storage Scheme & Encoding Technique Encoding technique (hasName, “Abraham Lincoln”) 0010 0000 0000 1000 0100 0100 0000 (BornOnDate, "1908-02-12") 0100 0000 0000 0100 0010 0100 1000 (DiedOnDate, "1965-04-15") 0000 1000 0000 0000 0010 0100 0000 (DiedIn, y:Washington DC) 0000 0010 0000 1000 0010 0100 0001 0110 1010 0000 1100 0110 0100 1001 OR

17 Indexing Structure and Query Algorithm

18 Data Signature Graph G*

19 Converting Q to Q*

20 Filter and Evaluate Find matches of Q* over G*(CL) Verify each match in RDF against G(RS)

21 Generating Candidate List(CL) Two step process: – for each vertex v i ∈ V (Q ∗ ), we find a list R i = {u i1, u i2,..., u in }, where v i &u i= v i, u i ∈ V(G*) and u ij ∈ R i – do a multi-way join to get the candidate list Use S-trees – Height-balanced tree over signatures – Does not support second step - expensive Vs-tree and Vs*-tree – Multi-resolution summary graph based on S-tree – Supports both steps efficiently

22 S-tree Solution 001 002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 0010 10011100 0100 1001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 1000 1000 0000 10000

23 S-tree Solution 001 002003004 005 007 008 006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 0010 10011100 0100 1001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 1000 1000 0000 10000 001 004 006

24 S-tree Solution 001002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 00010001 1000 0000 0001 0100 0001 0100 1000 0010 1001 1100 01001001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 10001000 0000 10000 001 004 006 002 003 006

25 S-tree Solution 001002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 00010001 1000 0000 0001 0100 0001 0100 1000 0010 1001 1100 01001001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 10001000 0000 10000 001 004 006 002 003 006

26 S-tree Solution 001 002003004 005 007 008006 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d13d13 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 0010 10011100 0100 1001 0101 1001 1000 1001 1101 1110 1101 1111 1101 0000 1000 1000 0000 10000 001 004 006 002 003 & 006

27 VS-tree Solution 1110 1101 1001 1101 0010 1001 1100 0100 1001 0101 1001 1000 0010 10001000 01001000 0001 0001 1000 0000 0001 0100 0001 0100 1000 001 002 003 004 005 006 007008 d13d13 d23d23 d33d33 d43d43 d12d12 d22d22 d11d11 11111 1001000110 00001 10010 01000 01011 00010 0010000010 10000 00010 01000 00010 00100 00010

28 VS-tree Solution 0000 10001000 0000 10000

29 VS-tree Solution 0000 10001000 0000 10000 d 1 1 Xd11d11

30 VS-tree Solution 0000 10001000 0000 10000 d 1 2 Xd12d12

31 VS-tree Solution 0000 10001000 0000 10000 d 1 3 Xd23d23

32 VS-tree Solution 0000 10001000 0000 10000 001 X002

33 VS-tree Solution- limitations 0000 10001000 0000 10000 If this level is dense, many summary matches => More search space Process each level step by step

34 Possible Optimization Methods “magically” know which level to begin with to minimize the number of summary matches Use DFS(Depth First Search) to find the valid child nodes While inserting vertices, consider not only the hamming distance but also the number of super edges introduced

35 Optimization example

36 Experimental results-Exact queries Queries Yago network (20 million triples & size 3.1GB) gStore RDF-3xSW-Storex-RDF-3x BigOWLIM GRIN

37 Experimental results-Wildcard queries Queries gStoreRDF-3x SW-Store x-RDF-3x BigOWLIM GRIN

38 Conclusion This approach: – Uses two novel indexes VS-tree and VS*-tree to speed up query processing – Was also to solve the two problems with existing solutions: answers SPARQL queries with wildcards in a scalable manner handle frequent and online updates in RDF repositories

39 Questions?


Download ppt "GStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu ¨, Dongyan Zhao {"

Similar presentations


Ads by Google