Presentation is loading. Please wait.

Presentation is loading. Please wait.

Store RDF Triples In A Scalable Way 2012-03-23 Liu Long & Liu Chunqiu.

Similar presentations


Presentation on theme: "Store RDF Triples In A Scalable Way 2012-03-23 Liu Long & Liu Chunqiu."— Presentation transcript:

1 Store RDF Triples In A Scalable Way 2012-03-23 Liu Long & Liu Chunqiu

2 Scalable RDF Store Based on HBase and MapReduce 2012-03-23 Liu Chunqiu

3 OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

4 RDF Schema Example In the example above, the resource "horse" is a subclass of the class "animal". ex:horse rdfs:subClassOf ex:animal

5 OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

6 Physical Designs for RDF Storage (1/4) Giant Triples Table SELECT ?title WHERE { ?book ?title. ?book. ?book }  Entire Table Scan!  Redundancy!

7 Physical Designs for RDF Storage (2/4) Clustered Property Table Contains clusters of properties that tend to be defined together

8 Physical Designs for RDF Storage (3/4) Property-Class Table Exploits the type property of subjects to cluster similar sets of subjects together in the same table Unlike clustered property table, a property may exist in multiple property-class tables Values of the type property

9 Physical Designs for RDF Storage (4/4) Vertically Partitioned Table The giant table is rewritten into n two column tables where n is the number of unique properties in the data We don’t have to Maintain null values Have a certain clustering algorithm subject property object

10 OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

11 HBase A data row = a sortable rowkey + an arbitrary number of columns Columns are grouped into column families A data cell can hold multiple versions which are distinguished by timestamp HBase is like a multi-dimensional sorted map : (Row Key, Family : Column,Timestamp )-----> Value Row Key: Key1 { Column Family A { Column X: T1 Value1 T2 Value2 Column Y: T3 Value3 } Column Family B { Column Z: T4 Value4 1.Data stored in the same column family are stored together on file system,while might be distributed 2.Hbase provides a B+ tree-like index on row key by default 3.Secondary Indexes on columns can also be created with extra efforts

12 MapReduce

13 OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

14 Scalable RDF Store Based on HBase and MapReduce Present a RDF storage schema on HBase. The storage schema considers characteristics of both RDF data model and HBase structure Present a MapReduce algorithm for SPARQL Basic Graph Pattern Processing Evaluation

15 The Storage Schema on HBase build up six tables to store RDF triples : S_PO,P_SO, O_SP, PS_O, SO_P and PO_S Data are stored in row keys and column names These six tables have different row key and column name combinations Row Key: Subject URI I { Column Family: VALUE { Column: (predicate1, object1) Column: (predicate2, object2) Column: (predicate3, object3) } S_PO data row Column value is null !

16 The Storage Schema on HBase P1: (s, p, o) P2: (?s, p, o) P3: (s,? p, o) P4: (s, p,? o) P5: (?s,? p, o) P6: (s, ?p,? o) P7: (?s, p,? o) P8: (?s,? p,? o) S_PO P_SO O_SP SP_O SO_P PO_S P1 and P8 can be handled by any table

17 The Storage Schema on HBase Advantages Support for multi-valued properties Support for parallel processing Reduced I/O cost Suitable for HBase Disadvantages Requires more storage space The update operation will be more complicated

18 Query Processing SELECT ?x ?y ?z WHERE { ?x rdf:type ub:GraduateStudent. ?y rdf:type ub:University. ?z rdf:type ub:Department. ?x ub:memberOf ?z. ?z ub:subOrganizationOf ?y. ?x ub:undergraduateDegreeFrom ?y. } LUBM Q2 ( x, y, z ) x y (x, y)(x, y, z) (y, z) (x, z) z Join on x Join on y Join on (x, y) Join on z A MapReduce job

19 Process of MapReduce Assign unique ids to all triple patterns Retrieve results for all triple patterns and persist results to file system Calculate expressions for triple patterns and final result Create and submit the MapReduce job Mappers receive input (k, v) pairs Reducer retrieve (k, v) pairs Input of mapper : (1, Department0) (2, (Student0, Department0)) (3, (Department0, University0)) Output Of mapper: (Department0, (1, )) (Department0, (2, Student0)) (Department0, (3, University0)) (a) Output of reducer: (Student0,University0, Department0) (b) ? x ub: memberOf ?z (x,z) ?z rdf:type ub:Department. ?x ub:memberOf ?z. ?z ub:subOrganizationOf ?y.

20 Process of MapReduce Notice: If the query doesn't need to join, then there will not be any MapReduce job LUBM Query6 SELECT ?X WHERE { ?X rdf:type ub:Student } NO need to join!!!! The result of the only triple pattern will be directly used as final output

21 Evaluation

22

23

24

25

26 UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) 2017.043.38.650.247.8 4019.155.411.267.756.9 6023.366.017.179.070.0 8029.377.227.398.893.3 10038.2100.435.1131.5136.4 Q6 is the most efficient query

27 Evaluation UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) 2017.043.38.650.247.8 4019.155.411.267.756.9 6023.366.017.179.070.0 8029.377.227.398.893.3 10038.2100.435.1131.5136.4 Q6 is the most efficient query Q1 is the fastest among rest queries

28 Evaluation UniversitiesQ1(sec)Q2(sec)Q6(sec)Q7(sec)Q9(sec) 2017.043.38.650.247.8 4019.155.411.267.756.9 6023.366.017.179.070.0 8029.377.227.398.893.3 10038.2100.435.1131.5136.4 Q6 is the most efficient query Q1 is the fastest among rest queries Performance result against LUBM(20, 0) is not quite efficient Compare results of LUBM(20,0)and LUBM(100,0),MapReduce shows great scalability when data size grows

29 OUTLINE RDF Schema Example Conventional RDF Storage Structures Briefly Introduce HBase and MapReduce Scalable RDF Store Based on HBase and MapReduce Conclusion

30 Conclude and Think This approach works against large RDF dataset Waste a lot of storage space MapReduce algorithm need to be more efficient Enrich storage schema Use data compression Reduce execution steps during query processing Integrate with other MapReduce algorithm SO

31


Download ppt "Store RDF Triples In A Scalable Way 2012-03-23 Liu Long & Liu Chunqiu."

Similar presentations


Ads by Google