L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011.

Slides:



Advertisements
Similar presentations
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Advertisements

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
L15: Tree-Structured Algorithms on GPUs CS6963L15: Tree Algorithms.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CHAPTER ONE Matrices and System Equations
CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.
Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Copyright © 2007 Pearson Education, Inc. Slide 7-1.
Slide Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
L21: “Irregular” Graph Algorithms November 11, 2010.
LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.
ECE 1747H : Parallel Programming Message Passing (MPI)
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
L20: Introduction to “Irregular” Algorithms November 9, 2010.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slide Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,
11/04/2010CS4961 CS4961 Parallel Programming Lecture 19: Message Passing, cont. Mary Hall November 4,
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
CS240A: Conjugate Gradients and the Model Problem.
PARALLELIZATION OF MULTIPLE BACKSOLVES James Stanley April 25, 2002 Project #2.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
L17: MPI, cont. October 25, Final Project Purpose: -A chance to dig in deeper into a parallel programming model and explore concepts. -Research.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
L19: Putting it together: N-body (Ch. 6) November 22, 2011.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
Lecture 9 Architecture Independent (MPI) Algorithm Design
Irregular Applications –Sparse Matrix Vector Multiplication
Two dimensional arrays A two dimensional m x n array A is a collection of m. n elements such that each element is specified by a pair of integers (such.
L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Martin Kruliš by Martin Kruliš (v1.0)1.
MPI: Message Passing Interface An Introduction S. Lakshmivarahan School of Computer Science.
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
L18: MPI, cont. November 10, Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA.
CUDA Interoperability with Graphical Environments
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
Parallel Matrix Multiplication and other Full Matrix Algorithms
Lecture 14: Inter-process Communication
Parallel Matrix Operations
CSCE569 Parallel Computing
Parallel Matrix Multiplication and other Full Matrix Algorithms
Parallel build blocks.
© 2012 Elsevier, Inc. All rights reserved.
James Demmel CS 267 Applications of Parallel Computers Lecture 14: Graph Partitioning - I James Demmel.
Presentation transcript:

L17: Introduction to “Irregular” Algorithms and MPI, cont. November 8, 2011

Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project 4, due November 21 -Available on CADE Linux machines (lab1 and lab3) and Windows machines (lab5 and lab6) -You can also use your own Nvidia GPUs

Outline Introduction to irregular parallel computation -Sparse matrix operations and graph algorithms Finish MPI discussion -Review blocking and non-blocking communication -One-sided communication Sources for this lecture: - -Kathy Yelick/Jim Demmel (UC Berkeley): CS 267, Spr 07 s s -“Implementing Sparse Matrix-Vector Multiplication on Throughput Oriented Processors,” Bell and Garland (Nvidia), SC09, Nov

Motivation: Dense Array-Based Computation Dense arrays and loop-based data-parallel computation has been the focus of this class so far Review: what have you learned about parallelizing such computations? -Good source of data parallelism and balanced load -Top500 measured with dense linear algebra -How fast is your computer?” = “How fast can you solve dense Ax=b?” -Many domains of applicability, not just scientific computing -Graphics and games, knowledge discovery, social networks, biomedical imaging, signal processing What about “irregular” computations? -On sparse matrices? (i.e., many elements are zero) -On graphs? -Start with representations and some key concepts

Sparse Matrix or Graph Applications Telephone network design - Original application, algorithm due to Kernighan Load Balancing while Minimizing Communication Sparse Matrix times Vector Multiplication -Solving PDEs N = {1,…,n}, (j,k) in E if A(j,k) nonzero, -WN(j) = #nonzeros in row j, WE(j,k) = 1 VLSI Layout -N = {units on chip}, E = {wires}, WE(j,k) = wire length Data mining and clustering Analysis of social networks Physical Mapping of DNA

Dense Linear Algebra vs. Sparse Linear Algebra Matrix vector multiply: for (i=0; i<n; i++) for (j=0; j<n; j++) a[i] += c[j][i]*b[j]; What if n is very large, and some large percentage (say 90%) of c is zeros? Should you represent all those zeros? If not, how to represent “c”?

Sparse Linear Algebra Suppose you are applying matrix-vector multiply and the matrix has lots of zero elements -Computation cost? Space requirements? General sparse matrix representation concepts -Primarily only represent the nonzero data values -Auxiliary data structures describe placement of nonzeros in “dense matrix”

Some common representations [] A = data = * 1 7 * * [] 1 7 * 2 8 * * [] 0 1 * 1 2 * * [] offsets = [-2 0 1] data =indices = ptr = [ ] indices = [ ] data = [ ] row = [ ] indices = [ ] data = [ ] DIA: Store elements along a set of diagonals. Compressed Sparse Row (CSR): Store only nonzero elements, with “ptr” to beginning of each row and “indices” representing column. ELL: Store a set of K elements per row and pad as needed. Best suited when number non-zeros roughly consistent across rows. COO: Store nonzero elements and their corresponding “coordinates”.

Connect to dense linear algebra Equivalent CSR matvec: for (i=0; i<nr; i++) { for (j = ptr[i]; j<ptr[i+1]-1; j++) t[i] += data[j] * b[indices[j]]; Dense matvec from L15: for (i=0; i<n; i++) { for (j=0; j<n; j++) { a[i] += c[j][i] * b[j]; }

Other Representation Examples Blocked CSR -Represent non-zeros as a set of blocks, usually of fixed size -Within each block, treat as dense and pad block with zeros -Block looks like standard matvec -So performs well for blocks of decent size Hybrid ELL and COO -Find a “K” value that works for most of matrix -Use COO for rows with more nonzeros (or even significantly fewer)

Today’s MPI Focus – Communication Primitives Collective communication -Reductions, Broadcast, Scatter, Gather Blocking communication -Overhead -Deadlock? Non-blocking One-sided communication 11

Quick MPI Review Six most common MPI Commands (aka, Six Command MPI) -MPI_Init -MPI_Finalize -MPI_Comm_size -MPI_Comm_rank -MPI_Send -MPI_Recv Send and Receive refer to “point-to-point” communication Last time we also showed Broadcast communication -Reduce 12

More difficult p2p example: 2D relaxation Replaces each interior value by the average of its four nearest neighbors. Sequential code: for (i=1; i<n-1; i++) for (j=1; j<n-1; j++) b[i,j] = (a[i-1][j]+a[i][j-1]+ a[i+1][j]+a[i][j+1])/4.0; Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

MPI code, main loop of 2D SOR computation Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

MPI code, main loop of 2D SOR computation, cont. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

MPI code, main loop of 2D SOR computation, cont. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Broadcast: Collective communication within a group Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

MPI_Scatter() Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Distribute Data from input using a scatter operation Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Other Basic Features of MPI MPI_Gather Analogous to MPI_Scatter Scans and reductions (reduction last time) Groups, communicators, tags -Mechanisms for identifying which processes participate in a communication MPI_Bcast -Broadcast to all other processes in a “group”

The Path of a Message A blocking send visits 4 address spaces Besides being time-consuming, it locks processors together quite tightly Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley