Presentation is loading. Please wait.

Presentation is loading. Please wait.

War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate.

Similar presentations


Presentation on theme: "War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate."— Presentation transcript:

1 War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate by exchanging messages –We do not have shared memory Communication is much more expensive –Sending a message takes much more time than sending data through a channel –Possibly non-uniform communication We only have 1-to-1 communication (no many-to- many channels)

2 Initial Distributed-memory Settings We consider settings where there is no multithreading within a single MPI node We consider systems where communication latency between different nodes is –Low –Uniform

3 Good Shared Memory Orbit Version {z 1,z 4,z 5,…, z l1 } {z 1,z 4,z 5,…, z l1 } f1f1 f1f1 O1O1 Hash Server Thread 1 Worker Threads f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 1, x 2, …, x m ] {z 2,z 3,z 8,…, z l2 } {z 2,z 3,z 8,…, z l2 } O2O2 Hash Server Thread 2 {z 6,z 7,z 9,…, z l3 } {z 6,z 7,z 9,…, z l3 } O3O3 Hash Server Thread 3 Shared Task Pool

4 Why is this version hard to port to MPI? Singe task pool! –Requires a shared structure to which all of the hash servers write data, and all of the workers read data from Not easy to implement using MPI, where we only have 1-to-1 communication We could have a dedicated node which will hold task queue –Workers send messages to it to request work –Hash servers send messages to it to push work –This would make the node potential bottleneck, and would involve a lot of communication

5 MPI Version 1 Maybe merge workers and hash servers? Each MPI node acts both as a hash server and as a worker Each node has its own task pool If task pool of a node is empty, the node tries to steal work from some other node

6 MPI Version 1 {z 1,z 4,z 5,…, z l1 } {z 1,z 4,z 5,…, z l1 } f1f1 f1f1 MPI Nodes f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 11,x 12,… x 1m1 ] {z 2,z 3,z 8,…, z l2 } {z 2,z 3,z 8,…, z l2 } {z 6,z 7,z 9,…, z l3 } {z 6,z 7,z 9,…, z l3 } f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 21,x 22,… x 2m2 ] f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 31,x 32,… x 3m3 ]

7 MPI Version 1 is Bad! Bad performance, especially for smaller number of nodes Same process does hash table lookups, and applies generator functions to points –It cannot do both at the same time => something has to wait –This creates contention

8 MPI Version 2 Separate hash servers and workers, after all Hash server nodes –Keep parts of the hash table –Also keep parts of task pool Worker nodes just apply generators to points Workers obtain work from hash server nodes using work-stealing

9 MPI Version 2 {z 1,z 4,z 5,…, z l1 } {z 1,z 4,z 5,…, z l1 } f1f1 f1f1 O1O1 Worker nodes f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 [x 11,x 12,… x 1m1 ] {z 2,z 3,z 8,…, z l2 } {z 2,z 3,z 8,…, z l2 } O2O2 {z 6,z 7,z 9,…, z l3 } {z 6,z 7,z 9,…, z l3 } O3O3 [x 21,x 22,… x 2m2 ] [x 31,x 32,… x 3m3 ] Hash Server nodes T1T1 T2T2 T3T3 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5 f1f1 f1f1 f2f2 f2f2 f3f3 f3f3 f4f4 f4f4 f5f5 f5f5

10 MPI Version 2 Much better performance than MPI Version 1 (on low-latency systems) Key thing is separating hash lookup and applying generators to points in different nodes

11 Big Issue with MPI Versions 1 and 2 -- Detecting Termination! We need to detect the situation where all of the hash server nodes have empty task pools, and where no new work will be produced by hash servers! –Even detecting that all task pools are empty and all hash servers and all workers are idle is not enough, as there may be messages flying around that will create more work! –Woe unto me! What are we to do? Good ol’ Dijkstra comes to rescue - We use a variant of Dijkstra-Scholten Termination Detection Algorithm

12 Termination Detection Algorithm Each hash server keeps two counters –Number of points sent (my_nr_points_sent) –Number of points received (my_nr_points_rcvd) We enumerate hash servers - H 0 … H n Hash server H 0, when idle, sends a token to the hash server H 1 –It attaches a token count (my_nr_points_sent, my_nr_points_rcvd) to the token When a hash server H i receives a token –If it is active (has tasks in the task pool), sends the token back to H 0 –If it is idle, it increases each component of the count attached to the token and sends the token to H i+1 –If received token count was (pts_sent, pts_rcvd), the new token count is (my_nr_points_sent + pts_sent, my_nr_points_rcvd + pts_rcvd) If H 0 receives the token, and if token count is (pts_sent, pts_rcvd) such that pts_rcvd = num_gens * pts_sent, then termination is detected

13 MPIGAP Code for MPI Version 2 Not trivial (~400 lines of GAP code) Explicit message passing using low-level MPI bindings –This version is hard to implement using task abstraction

14 MPIGAP Code for MPI Version 2 Worker := function(gens,op,f) local g,j,n,m,res,t,x,toSend,idle; n := nrHashes; while true do t := GetWork(); if IsIdenticalObj (t, fail) then return; fi; m := QuoInt(Length(t)*Length(gens)*2,n); res := List([1..n],x->EmptyPlist(m)); for j in [1..Length(t)] do for g in gens do x := op(t[j],g); Add(res[f(x)],x); od; for j in [1..n] do if Length(res[j]) > 0 then OrbSendMessage(res[j],minHashId+j-1); fi; od; end;

15 MPIGAP Code for MPI Version 2 GetWork := function() local msg, target; tid := minHashId; OrbSendMessage(["getwork",processId],tid); msg := OrbGetMessage(true); if msg[1]<>"finish" then return msg; else return fail; fi; end;

16 MPIGAP Code for MPI Version 2 OrbGetMessage := function(blocking) local test, msg, tmp, veg; if blocking then test := MPI_Probe(); else test := MPI_Iprobe(); fi; if test then msg := UNIX_MakeString(MPI_Get_count()); MPI_Recv(msg); tmp := DeserializeNativeString(msg); totalProcTime := totalProcTime + veg; else return fail; fi; end; OrbSendMessage := function(raw,dest) local msg; msg := SerializeToNativeString(raw); MPI_Binsend(msg,dest,Length(msg)); end;

17 Work in Progress - Extending MPI Version 2 To Systems With Non-Uniform Latency Communication latencies between nodes might be different Where to place hash server nodes? And how many? How to do work distribution? –Is work stealing still a good idea in a setting where communication distance between a worker and different hash servers is not uniform? We can look at the Shared memory + MPI world as a special case of this –Multithreading within MPI nodes –Threads from the same node can communicate fast –Nodes communicate much slower


Download ppt "War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate."

Similar presentations


Ads by Google