1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion.

1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion

2 Acknowledgements uJoint work with Foto Afrati uAlkis Polyzotis and Vinayak Borkar contributed to the architecture discussions.

3 Implementing Datalog via Map-Reduce uJoins are straightforward to implement as a round of map-reduce. uLikewise, union/duplicate-elimination is a round of map-reduce. uBut implementation of a recursion can thus take many rounds of map-reduce.

4 Seminaïve Evaluation uSpecific combination of joins and unions. uExample: chain rule q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) uLet r, s, t = “old” relations; r’, s’, t’ = incremental relations. uSimplification: assume |r’| = a|r|, etc.

5 A 3-Way Join Using Map-Reduce q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) uUse k compute nodes. uGive X and Y shares to determine the reduce-task that gets each tuple. uOptimum strategy replicates r and t, not s, using communication |s| + 2  k|r||t|.

6 Seminaïve Evaluation – (2) uNeed to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’ uObvious method for computing a round of seminaïve evaluation: wReplicate r and r’; replicate t and t’; do not replicate s or s’. wCommunication = (1+a)(|s| + 2  k|r||t|)

7 Seminaïve Evaluation – (3) uThere are many other ways we might use k nodes to do the same task. uExample: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’. uTheorem: no grouping does better than the obvious method for this example.

8 Networks of Processes for Recursions uIs it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost? uNote: tasks do not have to be Map or Reduce tasks; they can have other behaviors.

9 Example: Very Simple Recursion p(X,Y) :- e(X,Z) & p(Z,Y); p(X,Y) :- p 0 (X,Y); uUse k compute nodes. uHash Y-values to one of k buckets h(Y). uEach node gets a complete copy of e. up 0 is distributed among the k nodes, with p 0 (x,y) going to node h(y).

10 Example – Continued p(X,Y) :- e(X,Z) & p(Z,Y) uEach node applies the recursive rule and generates new tuples p(x,y). uKey point: since new tuples have a Y- value that hashes to the same node, no communication is necessary. uDuplicates are eliminated locally.

11 Harder Case of Recursion uConsider a recursive rule p(X,Y) :- p(X,Z) & p(Z,Y) uResponsibility divided among compute nodes by hashing Z-values. uNode n gets tuple p(a,b) if either h(a) = n or h(b) = n.

12 Compute Node for h(Z) = n Node for h(Z) = n Remember all Received tuples (eliminate duplicates) p(a,b) if h(a) = n or h(b) = n p(c,d) produced To nodes for h(c) and h(d) Search for matches

13 Comparison with Iteration uAdvantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds. uDisadvantage: Tasks run longer, more likely to fail.

14 Node Failures uTo cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes. uBut recursions can’t work that way. uWhat happens if a node fails after some of its output has been consumed?

15 Node Failures – (2) uActually, there is no problem! uWe restart the tasks of the failed node at another node. uThe replacement task will send some data that the failed task also sent. uBut each node remembers tuples to eliminate duplicates anyway.

16 Node Failures – (3) uBut the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets. uArgument would fail if we were computing bags or aggregations of the tuples produced. uSimilar problems for other recursions, e.g., PDE’s.

17 Extension of Map-Reduce Architecture for Recursion uNecessarily, all tasks need to operate in rounds. uThe master controller learns of all input files that are part of the round-i input to task T and records that T has received these files.

18 Extension – (2) uSuppose some task S fails, and it never supplies the round-(i +1) input to T. uA replacement S’ for S is restarted at some other node. uThe master knows that T has received up to round i from S, so it ignores the first i output files from S’.

19 Extension – (3) uMaster knows where all the inputs ever received by S are from, so it can provide those to S’.

20 Checkpointing and State uAnother approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere. uTasks take input + state. wInitially, state is empty. uMaster can restart a task from some state and feed it only inputs received after that state was written.

21 Example: Checkpointing p(X,Y) :- p(X,Z) & p(Z,Y) uTwo groups of tasks: 1.Join tasks: hash on Z, using h(Z). uLike tasks from previous example. 2.Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y). uReceives tuples from join tasks. uDistributes truly new tuples to join tasks.

22 Example – (2)...... Join tasks. State has p(x,y) if h(x) or h(y) is right. Dup-elim tasks. State has p(x,y) if h’(x,y) is right. p(a,b) to h’(a,b) p(a,b) to h(a) and h(b) if new

23 Example – Details uEach task writes “buffer” files locally, one for each of the tasks in the other rank. uThe two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time.

24 Example – Details – (2) uPeriodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it. uProblem: the controller can’t be too eager to pass output files to their input, or files become tiny.

25 Future Research uThere is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog –, Datalog + aggregation. wCheck out Hive, PIG, as well as work on multiway join optimization.

26 Future Research – (2) uAlmost everything is open about recursive Datalog implementation under map-reduce or similar systems. wSeminaïve evaluation in general case. wArchitectures for managing failures. Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce. wWhen can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)?

1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion.

Similar presentations

Presentation on theme: "1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion.

Similar presentations

Presentation on theme: "1 Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion."— Presentation transcript:

Similar presentations

About project

Feedback