Presentation is loading. Please wait.

Presentation is loading. Please wait.

SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer.

Similar presentations


Presentation on theme: "SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer."— Presentation transcript:

1 SPARQL Basic Graph Pattern Processing with Iterative MapReduce Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

2 Copyright 2010 by CEBT MapReduce MapReduce is easily accessible The Hadoop project provides an open-source MR implementation MapReduce gives users a simple abstraction for utilizing parallel and distributed system Programming Model – Map(k,v) -> list(k, v) – Reduce(k, list(v)) -> list(v) Useful for Massive Data Processing Center for E-Business TechnologyMDAC 2010 – 2/23

3 Copyright 2010 by CEBT MR & Cloud Computing MapReduce is a kind of platform MapReduce utilizes a number of commodity machines There can be a number of applications using MapReduce Center for E-Business Technology MapReduce App. MDAC 2010 – 3/23

4 Copyright 2010 by CEBT RDF Data Warehouse using MapReduce Data Warehouse using MapReduce With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses Hive, CloudBase – Data warehousing solutions built on top of Hadoop Advantages – Scalability – Extensibility – Fault-tolerance My Research Interest RDF Data Warehouse using MapReduce Center for E-Business TechnologyMDAC 2010 – 4/23

5 Copyright 2010 by CEBT Why RDF Data Warehouse? Flexible Data Model The underlying structure of any expression in RDF is a collection of triples (s, p, o) Data Integration RDB-to-RDF (intra) Linked Open Data (inter) Incremental Integration Inference We can discover some knowledge from what we already know A goal of data analyses Center for E-Business TechnologyMDAC 2010 – 5/23

6 Copyright 2010 by CEBT Approaches & Advantages Center for E-Business Technology Building a Data Warehouse Building a Data Warehouse Conventional DW Solutions RDF Data Warehouse Centralized Distributed & Parallel Before the Cloud (MR)Cloud Computing Flexibility Integration Inference Complexity Large-scale data analyses Scalability Extensibility Fault- tolerance Support Tools Simple Fast Performance Optimization MDAC 2010 – 6/23

7 Copyright 2010 by CEBT SPARQL BGP Processing with MapReduce Both RDF and MapReduce can benefit a data warehouse RDF is a data model – Flexibility, Integration, Inference MapReduce is a programming model – Scalability, Extensibility, Fault-tolerance It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework We should focus on a MR algorithm that manipulates RDF datasets A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing Center for E-Business TechnologyMDAC 2010 – 7/23

8 Copyright 2010 by CEBT SPARQL Basic Graph Pattern SPARQL is a query language for RDF datasets Basic Graph Pattern(BGP) is a set of triple patterns Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable BGP processing is important – Most of SPARQL queries have one or more BGPs – BGPs require expansive join operations among triple patterns Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } TP#1 BGP TP#2 TP#3 TP#4 TP#5 MDAC 2010 – 8/23

9 Copyright 2010 by CEBT SPARQL BGP Processing with MapReduce Two Operations MR-Selection – Extracts RDF triples which satisfy at least one triple pattern MR-Join – Merges selected triples Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } rdf:type ub:Professor ub:worksFor ub:name Professor0 ub: ub:telephone rdf:type ub:Department … … … … … … rdf:type ub:Professor ub:worksFor ub:name Professor0 ub: ub:telephone MR-Selection MR-Join MDAC 2010 – 9/23

10 Copyright 2010 by CEBT MR-Selection public void map() { Read a triple (s, p, o) // example, s: Prof0 p: rdf:type o:ub:Professor for each (triple pattern in a given query) { if(input triple satisfies a triple pattern) { make a key and a value // key = [x]Prof0 (variable name, value) // value = 1 (# of the satisfied triple pattern) output (key, value) } public void reduce() { read input from the map function // input format: (key, list(satisfied tp_numbers)) for each (value in a list of tp_numbers) { make a key and a value // key = x, value = [x]Prof0 output (key, value) } Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } MDAC 2010 – 10/23

11 Copyright 2010 by CEBT MR-Selection Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern A result table has variable names as a relational table has attribute names It also has values for the variable names, as does the relational table The result table will be used for the next MR-Join operation if necessary Center for E-Business Technology tp1 x … xy1 …… x … xy2 …… xy3 …… MDAC 2010 – 11/23

12 Copyright 2010 by CEBT Mapper Values of Join-key variable MR-Join: Map Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } rdf:type ub:Professor ub:worksFor ub:name Professor0 ub: ub:telephone ub: ub:telephone rdf:type ub:Professor ub:worksFor ub:name Professor0 ub: ub:telephone ub: ub:telephone BGP Analyzer BGP Analyzer examines a given query before execution and provides join- keys to the map function BGP Analyzer BGP Analyzer examines a given query before execution and provides join- keys to the map function Join-key (shared variable) ?x MDAC 2010 – 12/23

13 Copyright 2010 by CEBT MR-Join: Map public void map() { read input from MR-Selection // example input ( x, [x]Prof0) // example input ( x|y1, [x]Prof0|[y1]Professor0) get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer // example join-key: x, tp_numbers=(1, 2, 3, 4, 5) for each (join-key determined by BGP Analyzer) { if(input is related to the join-key) { make a key and a value // key = [x]Prof0 (variable name, value) // value = 1 [x]Prof0 (# of the satisfied triple pattern, variable name, value) // value = 3 [x]Prof0|[y1]Professor0 output (key, value) } Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } MDAC 2010 – 13/23

14 Copyright 2010 by CEBT MR-Join: Reduce Center for E-Business Technology Reducer Constraints for Join-key variable X SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } , 2, 3, 4, 5 1, 2, 3, 4, 5 rdf:type ub:Professor ub:worksFor ub:name Professor0 ub: ub:telephone ub: ub:telephone BGP Analyzer BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query BGP Analyzer BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query Triple pattern numbers related to the join-key variable rdf:type ub:Professor ub:worksFor ub:name Professor0 ub: ub:telephone MDAC 2010 – 14/23

15 Copyright 2010 by CEBT MR-Join: Reduce public void reduce() { read input from the Map function // example input ([x]Prof0, [ 1 [x]Prof0, 3 [x]Prof0|[y1]Professor0]) get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer // example join-key: x, tp_numbers=(1, 2, 3, 4, 5) create a temporary hashtable H for each (value in values) { add an element // key = x, value = [x]Prof0 // key = x|y1, value = [x]Prof0|[y1]Professor0 } // H will be used for checking whether the input satisfies all related tps. if(keys in H cover all tp_numbers to be joined) { make a Cartesian product among values in H // (a1, b1), (a1, c1) => (a1, b1, c1) make a key and a value // key = x|y1 // value = [x]Prof0|[y1]Professor0 output (key, value) } Center for E-Business Technology MDAC 2010 – 15/23

16 Copyright 2010 by CEBT Join-key Selection Strategies BGP Analyzer provides join-key variables by analyzing a query How to select join-key variables? If a BGP has a shared variable – We can easily select the variable If a BGP has two or more shared variables – We applied two heuristics to select join-key variables – Greedy Selection Select a join-key according to the number of related triple patterns – Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join operation Utilize the distributed and parallel system architecture Center for E-Business TechnologyMDAC 2010 – 16/23

17 Copyright 2010 by CEBT SPARQL BGP Processing with MR Advantages MapReduce can benefit from the multi-way join technique – If triple patterns share a variable, MR can join them all at once – It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3 } tp1 (x) (x, y1) (x, y1, y2) (x, y1, y2, y3) (a) (b) x … xy1 …… x … xy2 …… xy3 …… 2345 tp1 x … xy1 …… x … xy2 …… xy3 …… MDAC 2010 – 17/23

18 Copyright 2010 by CEBT SPARQL BGP Processing with MR Disadvantages If we have two or more shared variables, we need expansive MR iterations triple patterns in a query cannot be covered by a certain variable If we have two shared variables, MR iterations cannot be avoided To reduce unnecessary MR iteration, join-key selection strategies should be applied Center for E-Business Technology SELECT ?x ?y1 ?y2 ?y3 WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor. ?x ub:name ?y1. ?x ub: Address ?y2. ?x ub:telephone ?y3. ?y2 ub:alias ?y4 } (x, y1, y2, y3) tp1 x … xy1 …… x … xy2 …… xy3 …… y2y4 …… 6 MDAC 2010 – 18/23

19 Copyright 2010 by CEBT Experiment Environment LUBM Dataset Amazon EC2, Clouderas Hadoop Distribution, Amazon EBS The effect of multi-way join Multi-way join technique reduces the execution time by joining several triple patterns at once Some queries do not show a significant difference because they are too simple to take advantages of multi-way join Center for E-Business Technology Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10Q11Q12Q13Q14 2- way Multi -way Diff MDAC 2010 – 19/23

20 Copyright 2010 by CEBT Experiment Scalability As the number of machines increase, the average execution time is decreased – The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines While we increase the data size, the algorithm shows scalable execution time Center for E-Business TechnologyMDAC 2010 – 20/23

21 Copyright 2010 by CEBT Issues & Future Work – Indexing Execution Time of MR-Selection and each MR-Join Iteration MR-Selection can be a bottleneck because it takes about 40 seconds The underlying storage structure is important N-triple format -> HBase, Partitioning Building an index needs a significant amount of loading time Center for E-Business TechnologyMDAC 2010 – 21/23

22 Copyright 2010 by CEBT Issues & Future Work – Pipelining Hadoops MR implementation materializes intermediate results into the file system It takes so much time because of disk I/O Pipelining Allows to send and receive data between tasks and between jobs without disk I/O – Some implementations become available Hadoop Online Prototype (http://code.google.com/p/hop/) CGL-MapReduce (eScience 2008) Center for E-Business TechnologyMDAC 2010 – 22/23

23 Copyright 2010 by CEBT Conclusion There still remain many issues This work is still in progress Conclusion RDF Data Warehouse using MapReduce – RDF: Flexibility, Integration, Inference – MapReduce: Scalability, Extensibility, Fault-tolerance SPARQL Processing with MapReduce – Synergy effects between RDF and MapReduce – Issues System Architecture Loading(Indexing), Pipelining, Encoding, … Center for E-Business TechnologyMDAC 2010 – 23/23


Download ppt "SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer."

Similar presentations


Ads by Google