Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

Similar presentations


Presentation on theme: "1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton."— Presentation transcript:

1 1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton

2 2 Addition Example: Value of sum to be transmitted to all nodes

3 3

4 4 Consider replicating computation as an option.

5 5

6 6

7 7 Collective Communication §Global interaction operations §Building block. §Proper implementation is necessary

8 8 §Algorithms for rings can be extended for meshes.

9 9 Parallel algorithms using regular data structures map naturally to mesh.

10 10 §Many algorithms with recursive interactions map naturally onto hypercube topology. Practical for interconnection networks used in modern computers. Time to transfer the data between two nodes is considered to be independent of the relative location of nodes in the interconnection networks. Routing techniques

11 11 §ts : startup time l Prepare message: add headers, trailers, error correction information etc §tw : per-word transfer time l tw = 1/r, where r is bandwidth in r words per second  Transfer of m words between any pair of nodes in interconnection networks incurs a cost of ts + m tw

12 12 §Assumptions: l Links are bidirectional l Send a message on only one of its links at a time l Receive a message on only one link at a time l Can send to / receive from same/different links l Effect of congestion not shown in the total transfer time

13 13 One to all broadcast Dual: All-to-One reduction Each node has a buffer M containing m words. Data from all are combined through an operator and accumulated at a single destination process into one buffer of size m.

14 14 One to all broadcast One way: § Send p-1 messages from source to p-1 nodes

15 15 One to all broadcast Inefficient way: § Send p-1 messages from source to p-1 nodes §Source becomes the bottleneck §Connection between a single pair of nodes is used at the time l Under-utilization of communication network

16 16 One to all broadcast.Recursive doubling 7654 0123 Step 1

17 17 One to all broadcast.Recursive doubling 7654 0123 Step 2

18 18 One to all broadcast.Recursive doubling 7654 0123 Step 3

19 19 One to all broadcast.Recursive doubling 7654 0123 Step 3 log p steps

20 20 One to all broadcast.What if 0 had sent to 1 and then 0 and 1 had attempted to send to 2 and 3? 7654 0123

21 21 All to One Reduction §Reverse the direction and sequence of communication

22 22 Matrix vector multiplication

23 23 One to all broadcast Mesh Regard each row and column as a linear array.

24 24 One to all broadcast Mesh

25 25 One to all broadcast Hypercube d-dimensional mesh d-steps

26 26 One to all broadcast Hypercube 2 0 3 1 6 4 7 5

27 27 One to all broadcast Hypercube 2 0 3 1 6 4 7 5

28 28 One to all broadcast Hypercube 2 0 3 1 6 4 7 5

29 29 One to all broadcast.What if 0 had sent to 1 and then 0 and 1 had attempted to send to 2 and 3?

30 30 One to all broadcast

31 31 All-to-All communication

32 32 All-to-All communication All-to-All broadcast: p One-to-All broadcasts?

33 33 All-to-All broadcast Broadcasts are pipelined.

34 34 All-to-All broadcast

35 35 All-to-All Reduction

36 36 All-to-All broadcast in Mesh

37 37 All-to-All broadcast in Mesh

38 38 All-to-All broadcast in Hypercube 2 0 3 1 6 4 7 5 0 2 1 3 5 67 4

39 39 All-to-All broadcast in Hypercube 2 0 3 1 6 4 7 5 0,1 2,3 0,1 2,3 4,5 6,7 4,5

40 40 All-to-All broadcast in Hypercube 2 0 3 1 6 4 7 5 0,1,2,3 4,5,6,7

41 41 All-to-All broadcast in Hypercube 2 0 3 1 6 4 7 5 0,1,2,3,4,5,6,7

42 42 All- reduce §Each node has a buffer of size m §Final results are identical buffers on each node formed by combining the original p buffers. l All to one reduction followed by one-to-all broadcast

43 43 Gather and Scatter

44 44 Scatter

45 45 §Speedup l The ratio between execution time on a single processor and execution time on multiple processors. §Given a computational job that is distributed among N processors l Results in N-fold speedup?

46 46 §Speedup l The ratio between execution time on a single processor and execution time on multiple processors. §Given a computational job that is distributed among N processors l Results in N-fold speedup? In a perfect world!

47 47 §Every algorithm has a sequential component to be done by a single processor. This is not diminished when parallel part is split up. §There are communication costs, idle time, replicated computation etc.

48 48 Amdahl’s Law  Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp

49 49 Amdahl’s Law  Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp S(N)  (Ts + Tp) (Ts + Tp/N) An optimistic estimate.

50 50 Amdahl’s Law  Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp S(N)  (Ts + Tp) (Ts + Tp/N) An optimistic estimate. §Ignores the overhead incurred due to parallelizing the code.

51 51 Amdahl’s Law S(N)  (Ts + Tp) (Ts + Tp/N) NTp=0.5Tp=0.9 101.85.26 1001.989.17 10001.999.91 100001.999.91

52 52 Amdahl’s Law §If the sequential component is 5 percent then the maximum speedup that can be achieved is ?

53 53 Amdahl’s Law §If the sequential component is 5 percent then the maximum speedup that can be achieved is 20 §Useful when sequential programs are parallelized incrementally. l A sequential program can be profiled to identify computationally demanding components (hotspots).


Download ppt "1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton."

Similar presentations


Ads by Google