Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kurt Rohloff & Richard E. Schantz Presented by Nerya Or as part of seminar 67877 at the Hebrew University of Jerusalem, 2011 1.

Similar presentations


Presentation on theme: "Kurt Rohloff & Richard E. Schantz Presented by Nerya Or as part of seminar 67877 at the Hebrew University of Jerusalem, 2011 1."— Presentation transcript:

1 Kurt Rohloff & Richard E. Schantz Presented by Nerya Or as part of seminar 67877 at the Hebrew University of Jerusalem, 2011 1

2  RDF data and the SPARQL query language  Introduction to MapReduce and Hadoop  Rohloff and Schantz’s article 2

3  This lecture has a lot of background material!  Therefore, we’ll only be touching the basics of the introduction material. 3

4  RDF – Resource Description Framework.  Used to model information about data.  Information is represented using statements about resources.  Each resource is an entity on which multiple statements can be made. 4

5  A very common way to present RDF data is in XML files. 5 137 Cranbook Road Bristol Damian Steer 137 Cranbook Road Bristol Damian Steer Example source: http://www.xml.com/pub/a/2003/02/05/brownsauce.html

6  The data can be seen as a graph.  Each directed edge goes from a subject node to an object node, with the edge label being a predicate. 6 Example source: http://www.xml.com/pub/a/2003/02/05/brownsauce.html

7  Information in RDF can be seen as a set of triples.  A triple: subject-predicate-object.  For example: Barak wears glasses ◦ “Barak” is the subject. ◦ “wears” is the predicate. ◦ “glasses” is the object.  A triples data store keeps all of the data as a set of such triples. 7 Image source: http://www.w3.org/TR/rdf-concepts/

8  So, what do we need to know about RDF for the lecture today?  Understand the concept of a triples data store and how it can be used to model a knowledge base.  There’s much more to RDF – not in the scope of today’s lecture. 8

9  SPARQL: ◦ Simple Protocol and RDF Query Language  A query language for RDF data.  Somewhat similar to SQL, but aimed at RDF data.  Note to members of the audience who took course 67782 “Data on the Web” last year – ◦ The syntax is very similar to what was used in the course project. 9

10  The ‘?’ prefix denotes a variable.  In a query result dataset, all possible ways to substitute all variables while satisfying the query are returned. 10 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. }

11  Can also be seen as: 11 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. }

12 12  The goal: return all possible ways to align nodes in the database with variables in the query.

13  How big can a data graph get? Huge.  For example, we can model a social network using a triples data store. ◦ A triple to represent (directed) friendship: “Alice follows Bob”  How much processing power is required to run a query on the entire Twitter network, for example? 13

14  One way to tackle processing of huge amounts of data is to parallelize tasks among multiple machines.  We can use lots of cheap hardware to easily increase performance.  This approach requires specialized algorithms for distributed computing. 14

15  Today we’ll be focusing on an algorithm which runs under the Hadoop system.  Hadoop is a framework for distributed applications, meant for work on massive amounts of data.  It’s open source and available for free. 15

16  Hadoop is responsible for managing jobs on a distributed cluster of computers. ◦ Communication between nodes, reliability against node failures, etc. is all performed by the Hadoop system.  Provides an abstraction for running distributed algorithms, freeing the developer from handing technical aspects of the cluster.  Developers implement MapReduce algorithms, and Hadoop handles the rest. 16

17  In addition to providing a distributed computing platform with MapReduce, Hadoop also provides a distributed filesystem called HDFS.  Since today’s lecture focuses specifically on a MapReduce algorithm, we will not go into details about HDFS. 17

18  Introduced by Google in 2004. ◦ Paper: “MapReduce: Simplified Data Processing on Large Clusters” by J. Dean and S. Ghemawat.  A way to perform distributed processing on a large number of machines (“nodes”), collectively referred to as a “cluster”. 18

19  Input data is split to independent chunks, and each of those is processed in parallel to the others.  Processing is also divided into two steps: Map and Reduce.  This results in many small independent problems which can be run in parallel on the cluster. ◦ Note, the Reduce stage is dependent on the Map stage.  A system (e.g. Hadoop) is in place to manage these jobs across the cluster. 19

20  We shall now go over the basic MapReduce algorithm, from a developer’s point of view.  At minimum, a developer must specify a Map function, and a Reduce function. The rest is taken care of by the framework. 20

21 21 Source: http://labs.google.com/papers/mapreduce.html

22  Our initial input is made of many (key, value) pairs. ◦ These are each assigned to a single Map function call.  Input for the Map function: a single (key,value) pair.  Output of the Map function: a list of (key,value) pairs. ◦ Possibly of a different type than that of the input. 22 Map(k_in,v_in) → list(k,v)

23  In the Partition stage: ◦ Each Map output is sent to a reducer using some function (e.g. hash the key and modulo the number of reducers).  In the Comparison stage: ◦ Input for Reduce is grouped and sorted by key and sorted by value. ◦ Each Reduce call will be given a single key and its list of values. 23

24  Input for the Reduce function: a single key and a list of values for that key.  Output from the Reduce function: a list of values. ◦ The type of these values can be different than that of the input. 24 Reduce(k, list (v)) → list(v_out)

25  So overall, what do we have?  Our initial input was a list of (key, value) pairs.  The final output of the MapReduce operation is the unification of all of the lists of elements produced by the Reduce function calls. 25 MapReduce(list(k_in,v_in)) → list(v_out)

26 26 Source: http://labs.google.com/papers/mapreduce.html

27  With this framework, developers can perform wonders by simply implementing a Java interface for the Map and Reduce functions. 27 Check out the Hadoop interface! http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Mapper.html Check out the Hadoop interface! http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reducer.html http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Mapper.html

28  A classic example of MapReduce operations is a word counting algorithm over a set of documents.  Input: ◦ A set of (documentName, documentContents) pairs.  Output: ◦ A set of (word, numberOfAppearances) pairs. 28

29 29 Source: http://en.wikipedia.org/wiki/MapReduce

30  This example may seem silly, but given enough machines in a cluster we can process petabytes of data quickly! That’s quite impressive.  Next in this lecture, we will learn of a more interesting use for MapReduce. 30

31  The article under the spotlight today is called “ Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store”.  This “SHARD graph-store” is the name of a system built by the authors over Hadoop, using the algorithm from the article. ◦ Therefore, we will not mention it anymore in the lecture. 31

32  The article (presented in DIDC 2011) gives a MapReduce solution to the problem of answering SPARQLE queries over a triples data store. 32

33  The query above contains 3 clauses.  Two variables are mentioned: “person” and “car”. ◦ The variable “car” could potentially be a “:dog” if it weren’t for the “?car :a :car” clause!  Our mission: find all entities in the database that can satisfy all restrictions in the query. 33 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. ?car :madeIn :Detroit. }

34  The algorithm makes an important assumption: ◦ The i’th clause in the query contains at least one variable in common with any clauses 1,…,(i-1).  This is without loss of generality – ◦ if a query can’t be rearranged to satisfy this assumption, then the query can be split into independent sub-queries that can be processed separately and joined later. 34

35  The database we will be working with consists of lines of text, each being a “triple line”.  A triple line consists of a subject entity and its predicates and objects of those predicates. ◦ “Barak; wears,glasses; owns,car; likes,BarbieDolls; …” 35

36  Throughout the algorithm, variables from the query will be bound to entities in the data store.  To represent such a binding of variables to entities, a string of text is used.  Each bound variable’s name appears, prefixed by a ‘?’ and followed by the assigned entity in this binding. ◦ “?person Barak ?car car0 …” 36

37  Given a query, this algorithm performs a MapReduce operation for each of the clauses in the query. ◦ Thus for a query with N clauses we will perform N MapReduce operations.  We incrementally bind query variables to data nodes this way. 37

38 38

39  An initial MapReduce operation finds all legal assignments of variables to the first clause in the query.  We then iterate on the next clauses, each time joining the new assignment options with what we already had, using a MapReduce operation.  A final MapReduce operation leaves only variables in the SELECT, and removes duplicate results. 39

40  When performing the intermediate MapReduce operations, we always pass the union of the current variable bindings (from the previous iteration), and the initial triples database to the map operations.  Inside the intermediate MapReduce, there is a condition to handle tuple input and binding input differently. 40

41  We shall now look at each of the three different MapReduce operation types that are performed in the algorithm.  These are “firstClauseMapReduce”, “intermediateClauseMapReduce”, and “selectMapReduce”. 41

42 42

43  Input: ◦ Each map job gets.  A triples line: “Barak; wears,glasses; owns,car; likes,BarbieDolls; …” ◦ Also note that the first clause in the query is passed.  Output: ◦ A set of. The bindings are all the possible ways to satisfy clause 0 in the query.  A binding: “?person Barak ?car car0 …” 43

44  Input: ◦ Since we only add null as the value in the map job, each reduce job gets a set of.  Output: ◦, i.e. the reduce job simply removes any duplicates. 44

45  Once the initial MapReduce operation is finished, we have all the possible variable bindings for the first clause in the query.  We then start iterating on the rest of the clauses. 45

46 46

47  Input: ◦ Each map job gets either, or. ◦ commVars is supplied, which is the list of variables that are both currently bound (from past iterations) and appear in the current clause.  Always non-empty by the algorithm’s assumption! ◦ The current clause in the query is also supplied. 47

48  Output: ◦ If input was a triples line, the output is a set with bindings for the current clause, each binding being outputted in the format. ◦ If input was a binding line, simply output a re- organized version of the binding:. 48

49 49 Variables are in curVars Variables are in boundVars old

50  Input: ◦ Each reduce job receives as input. 50

51  A note: ◦ We have 2 categories of bound variables at this stage (the two are not necessarily distinct):  Variables that are present in the current clause (curVars)  Variables which are already bound from previous clauses (boundVars). 51

52  A note (continued): ◦ Note that because of the map operation, we always have any variables in the intersection of the two categories as the key in the reducer input. ◦ Therefore the value set consists of many bindings, and each binding must be either exclusively in curVars or exclusively in boundVars. 52

53  Output for each reduce job is with the following logic: ◦ Mark the input as. ◦ We output all possible variations of: ◦.  This Cartesian product is implemented with a simple nested loop in the reducer. 53

54  Each key in the ‘reduce’ is a binding, with the variables in commVars.  If the binding in the key contains any new assignments (different than the assignments until now on these variables), then the “oldBindings” set will be empty when the algorithm runs, and no results with this assignment will be outputted. 54

55  Else, all bindings inside the key have appeared in previous iterations as they are.  The only case here in which we would want to filter out a result, is when this past binding contradicts the current clause’s restriction.  Indeed, in this case the “newBindings” set will be empty and thus no result will be outputted. 55

56  The only case that’s left to look at is when the bindings inside the key have appeared as they are in the past, and in addition they don’t contradict the current clause.  In this case both “newBindings” and “oldBindings” will be non-empty, and we will output the key prefixed to their Cartesian product. 56

57  Therefore, we can always be sure that the following holds –  After the i’th iteration, all outputted tuples from the Reduce runs satisfy the query up to and including the i’th clause. ◦ And those are all the tuples that satisfy this property (i.e. none are missing). 57

58  After iterating over all clauses, we run one last MapReduce job to take only the variables in the query’s SELECT.  We also remove any duplicate output in this final MapReduce. 58

59 59

60  Input: ◦ Every Map job gets input: ◦ The list of variables in the query’s SELECT is also passed.  Output: ◦ Only the variables in the input binding which also appear in the query’s SELECT. 60

61  Input: ◦ Every Reduce’s input is:  Output: ◦ A list of all keys in the input (removed duplicates). 61

62  As we’ve seen, the algorithm is correct. ◦ Its correctness is not trivial to see…  Since it runs on Hadoop, the process is parallelized and scalable to extremely large data sets.  We now present a sample run of the algorithm. ◦ Source for demo: http://www.cse.buffalo.edu/faculty/tkosar/dadc08/slides/didc-paper5.pdf http://www.cse.buffalo.edu/faculty/tkosar/dadc08/slides/didc-paper5.pdf 62

63 63 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. }

64 64 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. }

65 65 SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. } SELECT ?person WHERE { ?person :owns ?car. ?car :a :car. }   The final MapReduce will remove “?car” bindings.

66  The authors have implemented the algorithm over Cloudera Hadoop.  The code was run on an Amazon EC2 cloud of 20 XL compute nodes.  Benchmark used: LUBM ◦ Input data is auto-generated. ◦ Data revolves around students, courses and lecturers in universities, with relationships from that area. 66

67  Queries 1,9,14 of the LUBM benchmark were run. ◦ Query 1 asks for the students that take a particular course and returns a very small set of responses. ◦ Query 9 asks for all teachers, students and courses such that the teacher is the adviser of the student who takes a course taught by the teacher. ◦ Query 14 is relatively simple as it asks for all undergraduate students (but the response is very large). 67

68  Data: approx. 800 million edges - several GB ◦ Query 1: 404 sec. (approx. 0.1 hr.) ◦ Query 9: 740 sec. (approx. 0.2 hr.) ◦ Query 14: 118 sec. (approx. 0.03 hr.)  Results get better as more hardware is added. 68

69  The results were compared against an industry- standard triple store DB: DAMLDB. ◦ DAMLDB runs on a single server. ◦ Other triple store DBs timed out when loading the input data.  A comparison of the time taken (in hours): ◦ The results are great! 69 SHARDDAMLDB Query 10.1 Query 90.21 Query 140.031

70  Comprehensive information about RDF: http://www.w3.org/standards/techs/rdf http://www.w3.org/standards/techs/rdf  SPARQL Query Language for RDF (official specification): http://www.w3.org/TR/rdf-sparql-query/ http://www.w3.org/TR/rdf-sparql-query/  “MapReduce: Simplified Data Processing on Large Clusters”, by J. Dean and S. Ghemawat: http://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.html  Apache™ Hadoop™: http://hadoop.apache.org/http://hadoop.apache.org/  “Clause-Iteration with MapReduce to Scalably Query Data Graphs in the SHARD Graph-Store “ by K. Rohloff and R. Schantz: http://www.dist- systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_201 1.pdfhttp://www.dist- systems.bbn.com/people/krohloff/papers/2011/Rohloff_Schantz_DIDC_201 1.pdf 70


Download ppt "Kurt Rohloff & Richard E. Schantz Presented by Nerya Or as part of seminar 67877 at the Hebrew University of Jerusalem, 2011 1."

Similar presentations


Ads by Google