Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.

Similar presentations


Presentation on theme: "National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi."— Presentation transcript:

1 National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi MATONO a.matono@aist.go.jp Grid Technology Research Center, AIST

2 National Institute of Advanced Industrial Science and Technology Agenda Motivation & Aims Background Distributed Hash Table (DHT) Distributed Hash Table (DHT) Our approach Performance evaluation Summary

3 National Institute of Advanced Industrial Science and Technology Motivation It is essential to describe resources using RDF to provide semantic tasks (e.g., resource discovery). Today, RDF data is widely used in many fields (e.g., bioinformatics and grid). Thus, RDF data is scattered everywhere and the total data size is rapidly increasing. We proposed a P2P-based RDF query processing. Providing efficient and scalable RDF query processing in a distributed environment is an important issue.

4 National Institute of Advanced Industrial Science and Technology Aims RDF data is scattered everywhere. Provide an efficient join operation in a distributed environment. The amount of data is rapidly increasing. Reduce the amount of data transferred among nodes. Achieve scalability, availability, and reliability.

5 National Institute of Advanced Industrial Science and Technology Distributed Hash Table (DHT) A structured P2P network. Achieve scalability, availability, reliability. Support only exact-match lookups. Lookups for key-value pairs. put (key, value), get (key) Routing is performed in. Routing is performed in O (log n). Some protocols. Chord, Tapestry, Pastry, CAN, Kademlia Chord, Tapestry, Pastry, CAN, Kademlia

6 National Institute of Advanced Industrial Science and Technology N42 N27 N11 Chord [Stoica01] 10-41 N11N42 +32 42 N42N42 +0 keyssucc.distance 58-9 50-57 46-49 44-45 43 N63 N50 N48 N42 +4 N42 +2 N42 +1 N42 +16 N42 +8 finger table N2 N56 N50 N17 N63 N48 N6 N33 … … …… … … … put ( 28, A ) on N42 The node that is responsible for the key 28 is N11 … 27-42 N27N11 +16 11 N11N11 +0 keyssucc.distance 43-10 19-26 15-18 13-14 12 N17N11 +4 N17N11 +2 N17N11 +1 N48N11 +32 N27N11 +8 finger table … 28 N33N27 +1 27 N27N27 +0 keyssucc.distance 59-26 43-58 35-42 31-34 29-30 N63 N48 N42 N33 N27 +4 N27 +2 N27 +16 N27 +32 N27 +8 finger table N33 is the target node … Key 28 This data in this area is stored into Node 27 The distance to the nodes increases exponentially.

7 National Institute of Advanced Industrial Science and Technology Our Approach Three-dimensional hash space called “RDFCube” Each axis represents hash space for one of subject, predicate, and object. Each axis represents hash space for one of subject, predicate, and object. Consist of a set of cubes of the same size called “cells” Consist of a set of cubes of the same size called “cells” Bit information of RDFCube called “existence flag” Each cell contains a bit that indicates the present or absent of triples mapped into the cell. Each cell contains a bit that indicates the present or absent of triples mapped into the cell. Run on the top of two DHTs. RDFPeers DHT is used to store triples. RDFPeers DHT is used to store triples. RDFCube DHT is used to store bit information. RDFCube DHT is used to store bit information.

8 National Institute of Advanced Industrial Science and Technology RDFCube: three-dimensional hash space Each axis represents hash space for one of triple’s elements (subject, predicate, and object). RDFCube is composed of a set of cubes of the same size called “cells”. A triple is mapped into RDFCube based on the hash values of elements. o s p (13, 54, 39) This triple is mapped into the point (13, 54, 39). The point is contained in the cell [0,3,2]. hash subject object predicate 39 13 54 Triple (13, 54, 39) Cell [0, 3, 2] subject object predicate

9 National Institute of Advanced Industrial Science and Technology Existence Flag Each of cells contains a bit that indicates the present or absence of triples mapped into the cell. subject object 1 0110 Cell Sequence [0, 1, *] Bit Sequence 0000 0110 0000 0100 predicate Cell Matrix [0, *, *] o s p Bit Matrix Existence Flag Cell [0, 3, 2]

10 National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is used to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. Storing triples. 1. Store the triples to RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

11 National Institute of Advanced Industrial Science and Technology RDFPeers [Cai04] An RDF repository utilizing a DHT. We call the DHT for RDFPeers as RDFPeers DHT. Key : Each of subject, predicate and object Key : Each of subject, predicate and object Value : Triple Value : Triple To store a triple into RDFPeers DHT. The triple is stored 3 times into 3 nodes by 3 lookups using triple’s elements as keys. The triple is stored 3 times into 3 nodes by 3 lookups using triple’s elements as keys. o s p N63 N8 N55 N41 N25 N4 N21 RDFPeers DHT key: (by predicate) value: key: (by object) value: key: (by subejct) value: s o o s p p o s p o s p put (, ) key value s p o s p o s p o s p o

12 National Institute of Advanced Industrial Science and Technology N63 N8 N55 N41 N25 N4 N21 RDFPeers DHT RDFPeers [Cai04] An RDF repository utilizing a DHT. We call the DHT for RDFPeers as RDFPeers DHT. Key : Each of subject, predicate and object Key : Each of subject, predicate and object Value : Triple Value : Triple Given a query triple Perform a lookup using one of the constants as a key. Perform a lookup using one of the constants as a key. key: (by predicate) value: ? s p key: (by object) value: key: (by subejct) value: s o o s p p o s p o s p get ( ) or get ( ) key s p N55 N21 s p

13 National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is used to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. Storing triples. 1. Store the triples to RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

14 National Institute of Advanced Industrial Science and Technology Key : ID of cell matrix Value : Bit matrix To set a bit of cell to 1 in RDFCube DHT Perform 3 lookups using 3 cell matrixes containing the cell as keys. Perform 3 lookups using 3 cell matrixes containing the cell as keys. RDFCube DHT put (, ) key value [1, *, *] [*, 2, *] [*, *, 1] [1, 2, 1] [1, *, *][*, 2, *][*, *, 1] N1 N15 N57 N36 N28 N51 RDFCube DHT key: value: [1, *, *] 0000 0010 0000 0000 key: value: [*, 2, *] 0000 0000 0010 0000 key: value: [*, *, 1] 0000 0100 0000 0000

15 National Institute of Advanced Industrial Science and Technology Key : ID of cell matrix Value : Bit matrix To get a bit matrix of cell matrix Perform a lookup using the cell matrix id as a key. Perform a lookup using the cell matrix id as a key. RDFCube DHT get ( ) key [1, *, *] N1 N15 N57 N36 N28 N51 RDFCube DHT key: value: [1, *, *] 0000 0010 0000 0000 key: value: [*, 2, *] 0000 0000 0010 0000 key: value: [*, *, 1] 0000 0100 0000 0000 [1, *, *] N57 [1, *, *] 0000 0010 0000 0000

16 National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is use to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. Storing triples. 1. Store the triples into RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

17 National Institute of Advanced Industrial Science and Technology Storing Triples Given the triple Update RDFPeers DHT Update RDFPeers DHT Store the triple into RDFPeers DHT by 3 lookups. Update RDFCube DHT Update RDFCube DHT Get the cell where the triple is mapped into. Set each bit in the 3 bit matrixes to 1 by 3 lookups. o s p (21, 45, 17) hash cell [1, 2, 1] o s p put (, ) s p o s p o s p o s p o key value N63 N8 N55 N41 N25 N4 N21 RDFPeers DHT key: (by predicate) value: key: (by object) value: key: (by subejct) value: s o o s p p o s p o s p put (, ) key value [1, *, *] [*, 2, *] [*, *, 1] [1, 2, 1] N1 N15 N57 N36 N28 N51 RDFCube DHT [*, 2, *] 0000 0000 0010 0000 [*, *, 1] 0000 0100 0000 0000 [1, *, *] 0000 0010 0000 0000

18 National Institute of Advanced Industrial Science and Technology Two DHTs: RDFCube & RDFPeers RDFPeers DHT is used to store RDF triples. RDFPeers is an RDF repository utilizing a DHT. RDFPeers is an RDF repository utilizing a DHT. Proposed by [Min Cai and Martin Frank, 2004] Proposed by [Min Cai and Martin Frank, 2004] RDFCube DHT is used to store bit information. Used as an index for RDFPeers. Used as an index for RDFPeers. String triples. 1. Store the triples to RDFPeers DHT. 2. Store the bit information of the triples into RDFCube DHT. Query processing with join operation. 1. Get the bit information from RDFCube DHT. 2. Perform AND operations of the bits. 3. Get triples from RDFPeers DHT based on the bit information.

19 National Institute of Advanced Industrial Science and Technology 0010 AND Operation Given the query 1.Get bit information of the cells where the query triples are mapped into. 2.Perform AND operation between the bits. p2 o2 o1 ?x p1 Query Processing (1/2) 1110 0011 p2 B2 ?xA1 ?x p1 [*, 3, 2] [*, 1, 1]

20 National Institute of Advanced Industrial Science and Technology N63 N8 N55 N41 N25 N4 RDFPeersDHT N21 Query Processing (2/2) key: (by predicate) value: p1 ?xA1 p1 Candidate answers s0 p1 A0s1 p1 A1s2 p1 A1s3 p1 A2 3.Get triples from RDFPeers DHT based on the bit information 1. Access to a remote node where candidate answer triples are stored into. 2. For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1.

21 National Institute of Advanced Industrial Science and Technology Narrow down the number of the candidate answers Not Answers 3.Get triples from RDFPeers DHT based on the bit information 1. Access to a remote node where candidate answer triples are stored into. 2. For each triple, we check whether the bit of the cell where the triple is mapped into is equal to 1. 3. Return the candidate answer triples that satisfy the condition from the remote node. Query Processing (2/2) ?xA1 p1 Candidate answers s0 p1 A0s1 p1 A1s2 p1 A1s3 p1 A2 [0, 3, 2] [1, 3, 2] [2, 3, 2] [3, 3, 2] 0 0 1 0 Filtering based on the bit information

22 National Institute of Advanced Industrial Science and Technology Performance Evaluation Compare RDFPeers with RDFPeers+RDFCube Data Set Transform XML documents of DBLP into RDF data. Transform XML documents of DBLP into RDF data. Create 4 RDFs of different triples (12500, 25000, 50000, 100000). Create 4 RDFs of different triples (12500, 25000, 50000, 100000).Environments Emulate 100-node Chord network. Emulate 100-node Chord network. #divisions of RDFCube is 32-256. #divisions of RDFCube is 32-256.Queries CUBEPEERS Query 1 ?x Article “Jim Gray” “1998” “CoRR” type author year journal Query 2Query 3 ?y“LNCS” title ?x series ?x“VLDB2004” title ?y crossref title ?z

23 National Institute of Advanced Industrial Science and Technology Storing Performance PEERS is the network costs for storing triples CUBE is the network costs for storing triples and index construction. If the ratio = 2, the cost for storing triples = index construction. If the ratio = 1, the cost for index construction is nothing. The ratio of #hops is smaller than 2, The cost for index construction is smaller than that for storing triples. The ratio of transfer data size is very close to 1, The amount of data transferred for index construction is very small.

24 National Institute of Advanced Industrial Science and Technology Retrieval Performance PEERS is the network costs to get triples from RDFPeers DHT. CUBE is the network costs to get bits and triples from two DHTs. #hops on CUBE is twice as many as that on PEERS. #hops to get triples is equal to #hops to get bit information. The transfer data size is reduced to at most 1/50 in query 1. Our approach makes it possible to reduce transfer size. In particular, when the query has lots of the same variables.

25 National Institute of Advanced Industrial Science and Technology Scalability The ratio of CUBE to PEERS stays constant in all queries. Our approach achieves the scalability with respect to the number of triples.

26 National Institute of Advanced Industrial Science and Technology Summary What we have achieved. Scalability with respect to #triples. Scalability with respect to #triples. Reduce the amount of data transferred among nodes. Reduce the amount of data transferred among nodes. What are our major current challenges. Provide efficient RDF query processing with join operations in a distributed environment. Provide efficient RDF query processing with join operations in a distributed environment. What we will achieve in the near future. Eliminate redistribution of triples. Eliminate redistribution of triples. Utilize the schema information. Utilize the schema information. Dynamic division mechanism of RDFCube. Dynamic division mechanism of RDFCube.

27 National Institute of Advanced Industrial Science and Technology Thank You


Download ppt "National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi."

Similar presentations


Ads by Google