Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +

Slides:



Advertisements
Similar presentations
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advertisements

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
What is shape function ? shape function is a function that will give the displacements inside an element if its displacement at all the node locations.
Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to Computers and Programming Lecture 10: For Loops Professor: Evan Korth New York University.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
General Computer Science for Engineers CISC 106 Final Exam Review Dr. John Cavazos Computer and Information Sciences 05/18/2009.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Collective Communication
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
CS1020E Sitin 1 Discussion -- Counting Palindromes.
IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.
Computer Science 320 Load Balancing for Hybrid SMP/Clusters.
Introduction to Parallel Programming with C and MPI at MCSR Part 2 Broadcast/Reduce.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Processes & Threads Emery Berger and Mark Corner University.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Computer Science 320 Load Balancing with Clusters.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
Introduction to Parallel Programming Language notation: message passing 5 parallel algorithms of increasing complexity: –Matrix multiplication –Successive.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Computer Science 320 Parallel Image Generation. The Mandelbrot Set.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Message Passing Computing 1 iCSC2015,Helvi Hartmann, FIAS Message Passing Computing Lecture 2 Message Passing Helvi Hartmann FIAS Inverted CERN School.
Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Basic Communication Operations Carl Tropper Department of Computer Science.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Computer Science 320 Barrier Actions. 1-D Continuous Cellular Automata 1-D array of cells, each having a value between 0.0 and 1.0 Each cell has a neighborhood.
Computer Science 320 Reduction. Estimating π Throw N darts, and let C be the number of darts that land within the circle quadrant of a unit circle Then,
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS 221 – May 22 Timing (sections 2.6 and 3.6) Speedup Amdahl’s law – What happens if you can’t parallelize everything Complexity Commands to put in your.
Computer Science 320 Introduction to Cluster Computing.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Chun-Yuan Lin MPI-Programming training-1. Broadcast Sending same message to all processes concerned with problem. Multicast - sending same message to.
CS 440 Database Management Systems
Parallel Programming in C with MPI and OpenMP
Parallel Graph Algorithms
Parallel Programming with MPI and OpenMP
Introduction to Parallel Programming
Lecture 22: Parallel Algorithms
CSCE569 Parallel Computing
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1020: Introduction to Programming Mohamed Shehata November 7, 2016
CSCE569 Parallel Computing
CSCE569 Parallel Computing
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
External Sorting.
Parallel Graph Algorithms
Parallel Processing - MPI
Presentation transcript:

Computer Science 320 Broadcasting

Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri + d ic )

Floyd’s Algorithm on Cluster Root node reads distance matrix from input file and scatters row slices to other nodes Other nodes compute distances and update their slices The slices are gathered back to the root node for output

Parallel I/O File Pattern Eliminate the gather of data by having each node write its slice to a separate file Eliminate the scatter of data by having each node read its slice from the input file

Execution Timeline

Sharing Data in Computation On each pass through the outer loop, the ith row must be available to all of the processes (they all execute the same line of code in the inner loop) They can do this in SMP because they share the entire matrix They can’t do this in a cluster setup, because they don’t share for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri + d ic )

Share Row via a Broadcast Message The process that owns a row broadcasts it before the parallel loop is run, on each pass through the outer loop Process that owns the row acts as the root for the broadcast, setting up the source buffer The other processes set up a destination buffer Broadcast also enforces synchronization; they all wait for the broadcast for i = 0 to n – 1 broadcast row i of d parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri + d ic )

// Allocate storage for row broadcast from another process. row_i = new double [n]; row_i_buf = DoubleBuf.buffer (row_i); int i_root = 0; for (int i = 0; i < n; ++ i){ double[] d_i = d[i]; // Determine which process owns row i. if (! ranges[i_root].contains(i)) ++ i_root; // Broadcast row i from owner process to all processes. if (rank == i_root) world.broadcast(i_root, DoubleBuf.buffer (d_i)); else{ world.broadcast(i_root, row_i_buf); d_i = row_i; } // Inner loops over rows in my slice and over all columns. for (int r = mylb; r <= myub; ++ r){ double[] d_r = d[r]; for (int c = 0; c < n; ++ c) d_r[c] = Math.min (d_r[c], d_r[i] + d_i[c]); }

Problem: Too Many Messages The amount of time spent in communication is too high when compared to the time spent in computation