CS 684.

Slides:



Advertisements
Similar presentations
Basic Communication Operations
Advertisements

Lecture 9: Group Communication Operations
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
1 Collective Operations Dr. Stephen Tse Lesson 12.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
4.1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Heiko Schröder, 2003 Parallel Architectures 1 Various communication networks State of the art technology Important aspects of routing schemes Known results.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
ICN’s The n-D hypercube (n-cube) contains 2^n nodes (processors).
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
Broadcast and scatter algorithms on mesh- based topologies.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Interconnection Network Topologies
CS 584. Algorithm Analysis Assumptions n Consider ring, mesh, and hypercube. n Each process can either send or receive a single message at a time. n No.
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Examples Broadcasting and Gossiping. Broadcast on ring (Safe and Forward)
Topic Overview One-to-All Broadcast and All-to-One Reduction
Design of parallel algorithms Matrix operations J. Porras.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Distributed Routing Algorithms. In a message passing distributed system, message passing is the only means of interprocessor communication. Unicast, Multicast,
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Matrix Multiplication Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi Sampath Instructor : Dr Sushil K Prasad Presented By : R. Jayampathi.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.
1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,
Copyright 2004 Koren & Krishna ECE655/Koren Part.8.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Super computers Parallel Processing
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
HYPERCUBE ALGORITHMS-1
Lecture 9 Architecture Independent (MPI) Algorithm Design
Basic Communication Operations Carl Tropper Department of Computer Science.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Interconnection Networks Communications Among Processors.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
A Concurrent Matrix Transpose Algorithm Pourya Jafari.
Collective Communication Implementations
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Multi-Node Broadcasting in Hypercube and Star Graphs
Parallel Matrix Multiplication and other Full Matrix Algorithms
Collective Communication Operations
Communication operations
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Parallel Sorting Algorithms
Presentation transcript:

CS 684

Algorithm Analysis Assumptions Consider ring, mesh, and hypercube. Each process can either send or receive a single message at a time. No special communication hardware. When discussing a mesh architecture we will consider a square toroidal mesh. Latency is ts and Bandwidth is tw

Basic Algorithms Broadcast Algorithms one to all (scatter) all to one (gather) all to all Reduction all to one

Broadcast (ring) Distribute a message of size m to all nodes. source

Broadcast (ring) Distribute a message of size m to all nodes. Start the message both ways 4 3 2 source 1 4 3 2 T = (ts + twm)(p/2)

Broadcast (mesh)

Broadcast (mesh) Broadcast to source row using ring algorithm

Broadcast (mesh) Broadcast to source row using ring algorithm Broadcast to the rest using ring algorithm from the source row

Broadcast (mesh) Broadcast to source row using ring algorithm Broadcast to the rest using ring algorithm from the source row T = 2(ts + twm)(p1/2/2)

Broadcast (hypercube)

Broadcast (hypercube) 3 3 2 3 2 1 3 A message is sent along each dimension of the hypercube. Parallelism grows as a binary tree.

Broadcast (hypercube) 3 3 2 T = (ts + twm)log2 p 3 2 1 3 A message is sent along each dimension of the hypercube. Parallelism grows as a binary tree.

Broadcast Mesh algorithm was based on embedding rings in the mesh. Can we do better on the mesh? Can we embed a tree in a mesh? Exercise for the reader. (-: hint, hint ;-)

Other Broadcasts Many algorithms for all-to-one and all-to-all communication are simply reversals and duals of the one-to-all broadcast. Examples All-to-one Reverse the algorithm and concatenate All-to-all Butterfly and concatenate

Scatter Operation Often called one-to-all personalized communication. 1,2,3,4,5,6,7,8 Often called one-to-all personalized communication. Send a different message to each node. 8 6 4 2 5,6 7,8 5 7 5,6,7,8 3,4 3 1,2 1 1,2,3,4

Reduction Algorithms Reduce or combine a set of values on each processor to a single set. Summation Max/Min Many reduction algorithms simply use the all-to-one broadcast algorithm. Operation is performed at each node.

Reduction If the goal is to have only one processor with the answer, use broadcast algorithms. If all must know, use butterfly. Reduces algorithm from 2log p to log p

How'd they do that? Broadcast and Reduction algorithms are based on Gray code numbering of nodes. Consider a hypercube. 000 100 010 001 110 101 111 011 1 2 3 4 5 6 7 Neighboring nodes differ by only one bit location.

How'd they do that? Start with most significant bit. Flip the bit and send to that processor Proceed with the next most significant bit Continue until all bits have been used.

Procedure SingleNodeAccum(d, my_id, m, X, sum) for j = 0 to m-1 sum[j] = X[j]; mask = 0 for i = 0 to d-1 if ((my_id AND mask) == 0) if ((my_id AND 2i) != 0 msg_dest = my_id XOR 2i send(sum, msg_dest) else msg_src = my_id XOR 2i recv(sum, msg_src) for j = 0 to m-1 sum[j] += X[j] endif mask = mask XOR 2i endfor end

All-to-all personalized Comm. What about when everybody needs to communicate something different to everybody else? matrix transpose with row-wise partitioning Issues Everybody Scatter? Bottlenecks?

All-to-All Personalized Communication

All-to-All Personalized Communication: Example Consider the problem of transposing a matrix. Each processor contains one full row of the matrix. The transpose operation in this case is identical to an all-to-all personalized communication operation.

All-to-All Personalized Communication: Example All-to-all personalized communication in transposing a 4 x 4 matrix using four processes.

All-to-All Personalized Communication on a Ring Each node sends all pieces of data as one consolidated message of size m(p – 1) to one of its neighbors. Each node extracts the information meant for it from the data received, and forwards the remaining (p – 2) pieces of size m each to the next node. The algorithm terminates in p – 1 steps. The size of the message reduces by m at each step.

All-to-All Personalized Communication on a Ring All-to-all personalized communication on a six-node ring. The label of each message is of the form {x,y}, where x is the label of the node that originally owned the message, and y is the label of the node that is the final destination of the message. The label ({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is formed by concatenating n individual messages.

All-to-All Personalized Communication on a Ring: Cost We have p – 1 steps in all. In step i, the message size is m(p – i). The total time is given by: The tw term in this equation can be reduced by a factor of 2 by communicating messages in both directions.

All-to-All Personalized Communication on a Mesh Each node first groups its p messages according to the columns of their destination nodes. All-to-all personalized communication is performed independently in each row with clustered messages of size m√p. Messages in each node are sorted again, this time according to the rows of their destination nodes. All-to-all personalized communication is performed independently in each column with clustered messages of size m√p.

All-to-All Personalized Communication on a Mesh The distribution of messages at the beginning of each phase of all-to-all personalized communication on a 3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,{8,i}), where 0 ≤ i ≤ 8. The groups of nodes communicating together in each phase are enclosed in dotted boundaries.

All-to-All Personalized Communication on a Mesh: Cost Time for the first phase is identical to that in a ring with √p processors, i.e., (ts + twmp/2)(√p – 1). Time in the second phase is identical to the first phase. Therefore, total time is twice of this time, i.e., It can be shown that the time for rearrangement is less much less than this communication time.

All-to-All Personalized Communication on a Hypercube Generalize the mesh algorithm to log p steps. At any stage in all-to-all personalized communication, every node holds p packets of size m each. While communicating in a particular dimension, every node sends p/2 of these packets (consolidated as one message). A node must rearrange its messages locally before each of the log p communication steps.

All-to-All Personalized Communication on a Hypercube An all-to-all personalized communication algorithm on a three-dimensional hypercube.

All-to-All Personalized Communication on a Hypercube We have log p iterations and mp/2 words are communicated in each iteration. Therefore, the cost is: This is not optimal!

All-to-All Personalized Communication on a Hypercube: Optimal Algorithm Each node performs p – 1 communication steps, exchanging m words of data with a different node in every step. A node must choose its communication partner in each step so that the hypercube links do not suffer congestion. In the jth communication step, node i exchanges data with node (i XOR j). In this schedule, all paths in every communication step are congestion-free, and none of the bidirectional links carry more than one message in the same direction.

Seven steps in all-to-all personalized communication on an eight-node hypercube.

All-to-All Personalized Communication on a Hypercube: Optimal Algorithm A procedure to perform all-to-all personalized communication on a d-dimensional hypercube. The message Mi,j initially resides on node i and is destined for node j.

All-to-all personalized hypercube 000 0 001 1 010 2 011 3 100 4 101 5 110 6 111 7 000 Etc. node step

Basic Communication Algorithms Many different ways Goal: Perform communication in least time Depends on architecture Hypercube algorithms are often provably optimal Even on fully connected architecture