1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Basic Communication Operations

Lecture 9: Group Communication Operations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

SE-292 High Performance Computing

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

1 Wide-Sense Nonblocking Multicast in a Class of Regular Optical Networks From: C. Zhou and Y. Yang, IEEE Transactions on communications, vol. 50, No.

Parallel System Performance CS 524 – High-Performance Computing.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]

1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.

Communication operations Efficient Parallel Algorithms COMP308.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.

Design of parallel algorithms

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Parallel System Performance CS 524 – High-Performance Computing.

Design of parallel algorithms Matrix operations J. Porras.

Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.

Collective Communication

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

Data Structures and Algorithms in Parallel Computing Lecture 1.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Super computers Parallel Processing

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

HYPERCUBE ALGORITHMS-1

Lecture 9 Architecture Independent (MPI) Algorithm Design

Basic Communication Operations Carl Tropper Department of Computer Science.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Interconnection Networks Communications Among Processors.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Potential for parallel computers/parallel programming

Introduction to parallel algorithms

Collective Communication Operations

Introduction to parallel algorithms

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Parallel Programming in C with MPI and OpenMP

Potential for parallel computers/parallel programming

Introduction to parallel algorithms

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton

2 Addition Example: Value of sum to be transmitted to all nodes

3

4 Consider replicating computation as an option.

5

6

7 Collective Communication §Global interaction operations §Building block. §Proper implementation is necessary

8 §Algorithms for rings can be extended for meshes.

9 Parallel algorithms using regular data structures map naturally to mesh.

10 §Many algorithms with recursive interactions map naturally onto hypercube topology. Practical for interconnection networks used in modern computers. Time to transfer the data between two nodes is considered to be independent of the relative location of nodes in the interconnection networks. Routing techniques

11 §ts : startup time l Prepare message: add headers, trailers, error correction information etc §tw : per-word transfer time l tw = 1/r, where r is bandwidth in r words per second  Transfer of m words between any pair of nodes in interconnection networks incurs a cost of ts + m tw

12 §Assumptions: l Links are bidirectional l Send a message on only one of its links at a time l Receive a message on only one link at a time l Can send to / receive from same/different links l Effect of congestion not shown in the total transfer time

13 One to all broadcast Dual: All-to-One reduction Each node has a buffer M containing m words. Data from all are combined through an operator and accumulated at a single destination process into one buffer of size m.

14 One to all broadcast One way: § Send p-1 messages from source to p-1 nodes

15 One to all broadcast Inefficient way: § Send p-1 messages from source to p-1 nodes §Source becomes the bottleneck §Connection between a single pair of nodes is used at the time l Under-utilization of communication network

16 One to all broadcast.Recursive doubling Step 1

17 One to all broadcast.Recursive doubling Step 2

18 One to all broadcast.Recursive doubling Step 3

19 One to all broadcast.Recursive doubling Step 3 log p steps

20 One to all broadcast.What if 0 had sent to 1 and then 0 and 1 had attempted to send to 2 and 3?

21 All to One Reduction §Reverse the direction and sequence of communication

22 Matrix vector multiplication

23 One to all broadcast Mesh Regard each row and column as a linear array.

24 One to all broadcast Mesh

25 One to all broadcast Hypercube d-dimensional mesh d-steps

26 One to all broadcast Hypercube

27 One to all broadcast Hypercube

28 One to all broadcast Hypercube

29 One to all broadcast.What if 0 had sent to 1 and then 0 and 1 had attempted to send to 2 and 3?

30 One to all broadcast

31 All-to-All communication

32 All-to-All communication All-to-All broadcast: p One-to-All broadcasts?

33 All-to-All broadcast Broadcasts are pipelined.

34 All-to-All broadcast

35 All-to-All Reduction

36 All-to-All broadcast in Mesh

37 All-to-All broadcast in Mesh

38 All-to-All broadcast in Hypercube

39 All-to-All broadcast in Hypercube ,1 2,3 0,1 2,3 4,5 6,7 4,5

40 All-to-All broadcast in Hypercube ,1,2,3 4,5,6,7

41 All-to-All broadcast in Hypercube ,1,2,3,4,5,6,7

42 All- reduce §Each node has a buffer of size m §Final results are identical buffers on each node formed by combining the original p buffers. l All to one reduction followed by one-to-all broadcast

43 Gather and Scatter

44 Scatter

45 §Speedup l The ratio between execution time on a single processor and execution time on multiple processors. §Given a computational job that is distributed among N processors l Results in N-fold speedup?

46 §Speedup l The ratio between execution time on a single processor and execution time on multiple processors. §Given a computational job that is distributed among N processors l Results in N-fold speedup? In a perfect world!

47 §Every algorithm has a sequential component to be done by a single processor. This is not diminished when parallel part is split up. §There are communication costs, idle time, replicated computation etc.

48 Amdahl’s Law  Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp

49 Amdahl’s Law  Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp S(N)  (Ts + Tp) (Ts + Tp/N) An optimistic estimate.

50 Amdahl’s Law  Let T(N) be the time required to complete the task on N processors. The speedup S(N) is the ratio S(N)=T(1) T(N) Serial portion Ts and parallelizable portion Tp S(N)  (Ts + Tp) (Ts + Tp/N) An optimistic estimate. §Ignores the overhead incurred due to parallelizing the code.

51 Amdahl’s Law S(N)  (Ts + Tp) (Ts + Tp/N) NTp=0.5Tp=

52 Amdahl’s Law §If the sequential component is 5 percent then the maximum speedup that can be achieved is ?

53 Amdahl’s Law §If the sequential component is 5 percent then the maximum speedup that can be achieved is 20 §Useful when sequential programs are parallelized incrementally. l A sequential program can be profiled to identify computationally demanding components (hotspots).