1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton
2 Addition Example: Value of sum to be transmitted to all nodes
3
4 Consider replicating computation as an option.
5
6
7 Collective Communication §Global interaction operations §Building block. §Proper implementation is necessary
8 §Algorithms for rings can be extended for meshes.
9 Parallel algorithms using regular data structures map naturally to mesh.
10 §Many algorithms with recursive interactions map naturally onto hypercube topology. Practical for interconnection networks used in modern computers. Time to transfer the data between two nodes is considered to be independent of the relative location of nodes in the interconnection networks. Routing techniques
11 §ts : startup time l Prepare message: add headers, trailers, error correction information etc §tw : per-word transfer time l tw = 1/r, where r is bandwidth in r words per second Transfer of m words between any pair of nodes in interconnection networks incurs a cost of ts + m tw
12 §Assumptions: l Links are bidirectional l Send a message on only one of its links at a time l Receive a message on only one link at a time l Can send to / receive from same/different links l Effect of congestion not shown in the total transfer time
13 One to all broadcast Dual: All-to-One reduction Each node has a buffer M containing m words. Data from all are combined through an operator and accumulated at a single destination process into one buffer of size m.
14 One to all broadcast One way: § Send p-1 messages from source to p-1 nodes
15 One to all broadcast Inefficient way: § Send p-1 messages from source to p-1 nodes §Source becomes the bottleneck §Connection between a single pair of nodes is used at the time l Under-utilization of communication network
16 One to all broadcast.Recursive doubling Step 1
17 One to all broadcast.Recursive doubling Step 2
18 One to all broadcast.Recursive doubling Step 3
19 One to all broadcast.Recursive doubling Step 3 log p steps
20 One to all broadcast.What if 0 had sent to 1 and then 0 and 1 had attempted to send to 2 and 3?
21 All to One Reduction §Reverse the direction and sequence of communication
22 Matrix vector multiplication
23 One to all broadcast Mesh Regard each row and column as a linear array.
24 One to all broadcast Mesh
25 One to all broadcast Hypercube d-dimensional mesh d-steps
26 One to all broadcast Hypercube
27 One to all broadcast Hypercube
28 One to all broadcast Hypercube
29 One to all broadcast.What if 0 had sent to 1 and then 0 and 1 had attempted to send to 2 and 3?
30 One to all broadcast
31 All-to-All communication
32 All-to-All communication All-to-All broadcast: p One-to-All broadcasts?
33 All-to-All broadcast Broadcasts are pipelined.
34 All-to-All broadcast
35 All-to-All Reduction
36 All-to-All broadcast in Mesh
37 All-to-All broadcast in Mesh
38 All-to-All broadcast in Hypercube
39 All-to-All broadcast in Hypercube ,1 2,3 0,1 2,3 4,5 6,7 4,5
40 All-to-All broadcast in Hypercube ,1,2,3 4,5,6,7
41 All-to-All broadcast in Hypercube ,1,2,3,4,5,6,7
42 All- reduce §Each node has a buffer of size m §Final results are identical buffers on each node formed by combining the original p buffers. l All to one reduction followed by one-to-all broadcast
43 Gather and Scatter
44 Scatter
45 §Speedup l The ratio between execution time on a single processor and execution time on multiple processors. §Given a computational job that is distributed among N processors l Results in N-fold speedup?
46 §Speedup l The ratio between execution time on a single processor and execution time on multiple processors. §Given a computational job that is distributed among N processors l Results in N-fold speedup? In a perfect world!
47 §Every algorithm has a sequential component to be done by a single processor. This is not diminished when parallel part is split up. §There are communication costs, idle time, replicated computation etc.
48 Amdahl’s Law Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp
49 Amdahl’s Law Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp S(N) (Ts + Tp) (Ts + Tp/N) An optimistic estimate.
50 Amdahl’s Law Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp S(N) (Ts + Tp) (Ts + Tp/N) An optimistic estimate. §Ignores the overhead incurred due to parallelizing the code.
51 Amdahl’s Law S(N) (Ts + Tp) (Ts + Tp/N) NTp=0.5Tp=
52 Amdahl’s Law §If the sequential component is 5 percent then the maximum speedup that can be achieved is ?
53 Amdahl’s Law §If the sequential component is 5 percent then the maximum speedup that can be achieved is 20 §Useful when sequential programs are parallelized incrementally. l A sequential program can be profiled to identify computationally demanding components (hotspots).