Presentation is loading. Please wait.

Presentation is loading. Please wait.

SPARQL Basic Graph Pattern Processing with Iterative MapReduce

Similar presentations


Presentation on theme: "SPARQL Basic Graph Pattern Processing with Iterative MapReduce"— Presentation transcript:

1 SPARQL Basic Graph Pattern Processing with Iterative MapReduce
Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

2 MapReduce MapReduce is easily accessible
The Hadoop project provides an open-source MR implementation MapReduce gives users a simple abstraction for utilizing parallel and distributed system Programming Model Map(k,v) -> list(k’, v’) Reduce(k’, list(v’)) -> list(v’’) Useful for Massive Data Processing Center for E-Business Technology

3 MR & Cloud Computing MapReduce MapReduce is a kind of platform
MapReduce utilizes a number of commodity machines There can be a number of applications using MapReduce App. App. App. MapReduce Center for E-Business Technology

4 RDF Data Warehouse using MapReduce
With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses Hive, CloudBase Data warehousing solutions built on top of Hadoop Advantages Scalability Extensibility Fault-tolerance My Research Interest RDF Data Warehouse using MapReduce Center for E-Business Technology

5 Why RDF Data Warehouse? Flexible Data Model Data Integration Inference
The underlying structure of any expression in RDF is a collection of triples (s, p, o) Data Integration RDB-to-RDF (intra) Linked Open Data (inter) Incremental Integration Inference We can discover some knowledge from what we already know A goal of data analyses Center for E-Business Technology

6 Approaches & Advantages
Support Tools Simple Fast Performance Optimization Building a Data Warehouse Conventional DW Solutions RDF Data Warehouse Centralized Distributed & Parallel Before the Cloud (MR)Cloud Computing Flexibility Integration Inference Complexity Large-scale data analyses Scalability Extensibility Fault-tolerance Center for E-Business Technology

7 SPARQL BGP Processing with MapReduce
Both RDF and MapReduce can benefit a data warehouse RDF is a data model Flexibility, Integration, Inference MapReduce is a programming model Scalability, Extensibility, Fault-tolerance It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework We should focus on a MR algorithm that manipulates RDF datasets A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing Center for E-Business Technology

8 SPARQL Basic Graph Pattern
SPARQL is a query language for RDF datasets Basic Graph Pattern(BGP) is a set of triple patterns Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable BGP processing is important Most of SPARQL queries have one or more BGPs BGPs require expansive join operations among triple patterns SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } TP#1 BGP TP#2 TP#3 TP#4 TP#5 Center for E-Business Technology

9 SPARQL BGP Processing with MapReduce
Two Operations MR-Selection Extracts RDF triples which satisfy at least one triple pattern MR-Join Merges selected triples SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } 1 2 3 4 5 <Prof0> rdf:type ub:Professor MR-Join <Prof0> rdf:type ub:Professor ub:worksFor <Dept0> ub:name “Professor0” ub: ub:telephone “ ” <Prof0> ub:worksFor <Dept0> <Prof0> ub:name “Professor0” <Prof0> ub: <Prof0> ub:telephone “ ” <Dept0> rdf:type ub:Department MR-Selection Center for E-Business Technology

10 MR-Selection 1 2 3 4 5 SELECT ?x ?y1 ?y2 ?y3 WHERE {
public void map() { Read a triple (s, p, o) // example, s: Prof0 p: rdf:type o:ub:Professor for each (triple pattern in a given query) { if(input triple satisfies a triple pattern) { make a key and a value // key = [x]Prof0 (variable name, value) // value = 1 (# of the satisfied triple pattern) output (key, value) } public void reduce() { read input from the map function // input format: (key, list(satisfied tp_numbers)) for each (value in a list of tp_numbers) { // key = <1>x, value = [x]Prof0 SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } 1 2 3 4 5 Center for E-Business Technology

11 MR-Selection Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern A result table has variable names as a relational table has attribute names It also has values for the variable names, as does the relational table The result table will be used for the next MR-Join operation if necessary tp1 2 3 4 5 x x x y1 x y2 x y3 Center for E-Business Technology

12 MR-Join: Map Join-key (shared variable) ?x Mapper BGP Analyzer
SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } BGP Analyzer BGP Analyzer examines a given query before execution and provides join-keys to the map function 1 2 3 4 5 Join-key (shared variable) ?x <Prof0> rdf:type ub:Professor Mapper <Prof0> rdf:type ub:Professor <Prof0> ub:worksFor <Dept0> <Prof0> ub:worksFor <Dept0> Values of Join-key variable <Prof0> ub:name “Professor0” <Prof0> ub:name “Professor0” <Prof0> ub: <Prof0> ub: <Prof0> <Prof0> ub:telephone “ ” <Prof0> ub:telephone “ ” <Prof1> ub: <Prof1> ub: <Prof1> <Prof1> ub:telephone “ ” <Prof1> ub:telephone “ ” Center for E-Business Technology

13 MR-Join: Map 1 2 3 4 5 SELECT ?x ?y1 ?y2 ?y3 WHERE {
public void map() { read input from MR-Selection // example input (<1>x, [x]Prof0) // example input (<3>x|y1, [x]Prof0|[y1]Professor0) get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer // example join-key: x, tp_numbers=(1, 2, 3, 4, 5) for each (join-key determined by BGP Analyzer) { if(input is related to the join-key) { make a key and a value // key = [x]Prof0 (variable name, value) // value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value) // value = <tp>3</tp>[x]Prof0|[y1]Professor0 output (key, value) } SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } 1 2 3 4 5 Center for E-Business Technology

14 MR-Join: Reduce Triple pattern numbers
SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } BGP Analyzer BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query 1 2 3 4 5 Triple pattern numbers related to the join-key variable <Prof0> rdf:type ub:Professor Reducer <Prof0> rdf:type ub:Professor <Prof0> ub:worksFor <Dept0> ub:worksFor <Dept0> Constraints for Join-key variable X <Prof0> ub:name “Professor0” ub:name “Professor0” <Prof0> ub: ub: <x> 1, 2, 3, 4, 5 <Prof0> ub:telephone “ ” ub:telephone “ ” <Prof1> ub: <Prof1> ub:telephone “ ” Center for E-Business Technology

15 MR-Join: Reduce public void reduce() {
read input from the Map function // example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0]) get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer // example join-key: x, tp_numbers=(1, 2, 3, 4, 5) create a temporary hashtable H for each (value in values) { add an element // key = <1>x, value = [x]Prof0 // key = <3>x|y1, value = [x]Prof0|[y1]Professor0 } // H will be used for checking whether the input satisfies all related tps. if(keys in H cover all tp_numbers to be joined) { make a Cartesian product among values in H // (a1, b1), (a1, c1) => (a1, b1, c1) make a key and a value // key = <1|3>x|y1 // value = [x]Prof0|[y1]Professor0 output (key, value) } Center for E-Business Technology

16 Join-key Selection Strategies
BGP Analyzer provides join-key variables by analyzing a query How to select join-key variables? If a BGP has a shared variable We can easily select the variable If a BGP has two or more shared variables We applied two heuristics to select join-key variables Greedy Selection Select a join-key according to the number of related triple patterns Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join operation Utilize the distributed and parallel system architecture Center for E-Business Technology

17 SPARQL BGP Processing with MR
Advantages MapReduce can benefit from the multi-way join technique If triple patterns share a variable, MR can join them all at once It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model (a) (x, y1, y2, y3) (x, y1, y2) (x, y1) (x) SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } tp1 2 3 4 5 1 2 3 4 5 x x x y1 x y2 x y3 (b) (x, y1, y2, y3) tp1 2 3 4 5 x x x y1 x y2 x y3 Center for E-Business Technology

18 SPARQL BGP Processing with MR
Disadvantages If we have two or more shared variables, we need expansive MR iterations triple patterns in a query cannot be covered by a certain variable If we have two shared variables, MR iterations cannot be avoided To reduce unnecessary MR iteration, join-key selection strategies should be applied SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3. ?y2 ub:alias ?y4 } 1 2 3 4 5 6 (x, y1, y2, y3) tp1 2 3 4 5 6 x x x y1 x y2 x y3 y2 y4 Center for E-Business Technology

19 Experiment Environment The effect of multi-way join LUBM Dataset
Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS The effect of multi-way join Multi-way join technique reduces the execution time by joining several triple patterns at once Some queries do not show a significant difference because they are too simple to take advantages of multi-way join Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 2-way 69.773 75.533 44.198 68.834 66.834 73.369 47.092 Multi-way 86.423 67.214 74.163 44.526 73.337 63.557 86.117 72.825 42.156 Diff. 36.968 77.548 2.559 1.37 -0.328 70.589 92.137 -4.503 3.277 26.685 0.544 4.936 Center for E-Business Technology

20 Experiment Scalability
As the number of machines increase, the average execution time is decreased The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines While we increase the data size, the algorithm shows scalable execution time Center for E-Business Technology

21 Issues & Future Work – Indexing
Execution Time of MR-Selection and each MR-Join Iteration MR-Selection can be a bottleneck because it takes about 40 seconds The underlying storage structure is important N-triple format -> HBase, Partitioning Building an index needs a significant amount of loading time Center for E-Business Technology

22 Issues & Future Work – Pipelining
Hadoop’s MR implementation materializes intermediate results into the file system It takes so much time because of disk I/O Pipelining Allows to send and receive data between tasks and between jobs without disk I/O Some implementations become available Hadoop Online Prototype (http://code.google.com/p/hop/) CGL-MapReduce (eScience 2008) Center for E-Business Technology

23 Conclusion There still remain many issues Conclusion
This work is still in progress Conclusion RDF Data Warehouse using MapReduce RDF: Flexibility, Integration, Inference MapReduce: Scalability, Extensibility, Fault-tolerance SPARQL Processing with MapReduce Synergy effects between RDF and MapReduce Issues System Architecture Loading(Indexing), Pipelining, Encoding, … Center for E-Business Technology


Download ppt "SPARQL Basic Graph Pattern Processing with Iterative MapReduce"

Similar presentations


Ads by Google