CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Slides:

Advertisements

Similar presentations

Lecture 19: Parallel Algorithms

Advertisements

CSCI-455/552 Introduction to High Performance Computing Lecture 25.

Lecture 3: Parallel Algorithm Design

03/23/07CS267 Lecture 201 CS 267: Multigrid on Structured Grids Kathy Yelick

SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

1 02/11/2009CS267 Lecture 7+ CS 267 Tricks with Trees James Demmel

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1.5 Elementary Matrices and a Method for Finding

Numerical Algorithms Matrix multiplication

CS 240A: Parallel Prefix Algorithms or Tricks with Trees

Numerical Algorithms • Matrix multiplication

Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

Lecture 21: Parallel Algorithms

CS267 Poisson 2.1 Demmel Fall 2002 CS 267 Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel

02/09/2005CS267 Lecture 71 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 James Demmel

VLSI Arithmetic. Multiplication A = a n-1 a n-2 … a 1 a 0 B = b n-1 b n-2 … b 1 b 0  eg)  Shift.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

03/09/06CS267 Lecture 16 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel

1 CS 267 Tricks with Trees James Demmel

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel

CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CS267 L25 Solving PDEs II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 25: Solving Linear Systems arising from PDEs - II James Demmel.

CS240A: Conjugate Gradients and the Model Problem.

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

1 Sources of Parallelism and Locality in Simulation.

Introduction to CMOS VLSI Design Lecture 11: Adders David Harris Harvey Mudd College Spring 2004.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Systems of Linear Equation and Matrices

02/14/2007CS267 Lecture 91 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 Kathy Yelick

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …

Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Great Theoretical Ideas in Computer Science.

02/07/2006CS267 Lecture 71 CS 267 Sources of Parallelism and Locality in Simulation – Part 2 James Demmel

CS 267 Sources of Parallelism and Locality in Simulation – Part 2

CS240A: Conjugate Gradients and the Model Problem.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

Grade School Again: A Parallel Perspective CS Lecture 7.

Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.

CHAPTER 7 Determinant s. Outline - Permutation - Definition of the Determinant - Properties of Determinants - Evaluation of Determinants by Elementary.

Numerical Algorithms Chapter 11.

CS 267 Tricks with Trees James Demmel Why talk about this now?

Lecture 3: Parallel Algorithm Design

Propositional Equivalence

Model Problem: Solving Poisson’s equation for temperature

Chapter 6 Transform-and-Conquer

CS 267 Tricks with Trees James Demmel Why talk about this now?

Computation on meshes, sparse matrices, and graphs

Computational meshes, matrices, conjugate gradients, and mesh partitioning Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy.

CS 267 Tricks with Trees Why talk about this now?

James Demmel CS 267 Tricks with Trees James Demmel 01/31/2012 CS267 Lecture.

CS 267 Sources of Parallelism and Locality in Simulation – Part 2

James Demmel and Kathy Yelick

Systems of Linear Equations and Matrices

James Demmel CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality.

James Demmel and Horst Simon

James Demmel CS 267 Tricks with Trees James Demmel 02/11/2009 CS267 Lecture.

Presentation transcript:

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3) Tricks with Trees James Demmel

CS267 L12 Sources of Parallelism(3).2 Demmel Sp 1999 Recap of last lecture °ODEs Sparse Matrix-vector multiplication Graph partitioning to balance load and minimize communication °PDEs Heat Equation and Poisson Equation Solving a certain special linear system T Many algorithms, ranging from -Dense Gaussian elimination, slow but very general, to -Multigrid, fast but only works on matrices like T

CS267 L12 Sources of Parallelism(3).3 Demmel Sp 1999 Outline °Continuation of PDEs What do realistic meshes look like? °Tricks with Trees

CS267 L12 Sources of Parallelism(3).4 Demmel Sp 1999 Partial Differential Equations PDEs

CS267 L12 Sources of Parallelism(3).5 Demmel Sp 1999 Poisson’s equation in 1D °Solve Tx=b where T = 2 Graph and “stencil”

CS267 L12 Sources of Parallelism(3).6 Demmel Sp 1999 Poisson’s equation in 2D °Solve Tx=b where °3D is analogous T = 4 Graph and “stencil”

CS267 L12 Sources of Parallelism(3).7 Demmel Sp 1999 Algorithms for 2D Poisson Equation with N unknowns AlgorithmSerialPRAMMemory #Procs °Dense LUN 3 NN 2 N 2 °Band LUN 2 NN 3/2 N °JacobiN 2 NNN °Explicit Inv.N log NNN °Conj.Grad.N 3/2 N 1/2 *log NNN °RB SORN 3/2 N 1/2 NN °Sparse LUN 3/2 N 1/2 N*log NN °FFTN*log Nlog NNN °MultigridNlog 2 NNN °Lower boundNlog NN PRAM is an idealized parallel model with zero cost communication (see next slide for explanation) 222

CS267 L12 Sources of Parallelism(3).8 Demmel Sp 1999 Relation of Poisson’s equation to Gravity, Electrostatics °Force on particle at (x,y,z) due to particle at 0 is -(x,y,z)/r^3, where r = sqrt(x +y +z ) °Force is also gradient of potential V = -1/r = -(d/dx V, d/dy V, d/dz V) = -grad V °V satisfies Poisson’s equation (try it!) 222

CS267 L12 Sources of Parallelism(3).9 Demmel Sp 1999 Comments on practical meshes °Regular 1D, 2D, 3D meshes Important as building blocks for more complicated meshes °Practical meshes are often irregular Composite meshes, consisting of multiple “bent” regular meshes joined at edges Unstructured meshes, with arbitrary mesh points and connectivities Adaptive meshes, which change resolution during solution process to put computational effort where needed

CS267 L12 Sources of Parallelism(3).10 Demmel Sp 1999 Composite mesh from a mechanical structure

CS267 L12 Sources of Parallelism(3).11 Demmel Sp 1999 Converting the mesh to a matrix

CS267 L12 Sources of Parallelism(3).12 Demmel Sp 1999 Effects of Ordering Rows and Columns on Gaussian Elimination

CS267 L12 Sources of Parallelism(3).13 Demmel Sp 1999 Irregular mesh: NASA Airfoil in 2D (direct solution)

CS267 L12 Sources of Parallelism(3).14 Demmel Sp 1999 Irregular mesh: Tapered Tube (multigrid)

CS267 L12 Sources of Parallelism(3).15 Demmel Sp 1999 Adaptive Mesh Refinement (AMR) °Adaptive mesh around an explosion °John Bell and Phil Colella at LBL (see class web page for URL) °Goal of Titanium is to make these algorithms easier to implement in parallel

CS267 L12 Sources of Parallelism(3).16 Demmel Sp 1999 Challenges of irregular meshes (and a few solutions) °How to generate them in the first place Triangle, a 2D mesh partitioner by Jonathan Shewchuk 3D harder! °How to partition them ParMetis, a parallel graph partitioner °How to design iterative solvers PETSc, a Portable Extensible Toolkit for Scientific Computing Prometheus, a multigrid solver for finite element problems on irregular meshes Titanium, a language to implement Adaptive Mesh Refinement °How to design direct solvers SuperLU, parallel sparse Gaussian elimination °These are challenges to do sequentially, the more so in parallel

CS267 L12 Sources of Parallelism(3).17 Demmel Sp 1999 Tricks with Trees

CS267 L12 Sources of Parallelism(3).18 Demmel Sp 1999 Outline °A log n lower bound to compute any function in parallel °Reduction and broadcast in O(log n) time °Parallel prefix (scan) in O(log n) time °Adding two n-bit integers in O(log n) time °Multiplying n-by-n matrices in O(log n) time °Inverting n-by-n triangular matrices in O(log n) time °Inverting n-by-n dense matrices in O(log n) time °Evaluating arbitrary expressions in O(log n) time °Evaluating recurrences in O(log n) time °Solving n-by-n tridiagonal matrices in O(log n) time °Traversing linked lists °Computing minimal spanning trees °Computing convex hulls of point sets 2 2

CS267 L12 Sources of Parallelism(3).19 Demmel Sp 1999 A log n lower bound to compute any function of n variables °Assume we can only use binary operations, one per time unit °After 1 time unit, an output can only depend on two inputs °Use induction to show that after k time units, an output can only depend on 2 k inputs °A binary tree performs such a computation

CS267 L12 Sources of Parallelism(3).20 Demmel Sp 1999 Broadcasts and Reductions on Trees

CS267 L12 Sources of Parallelism(3).21 Demmel Sp 1999 Parallel Prefix, or Scan °If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes °Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1

CS267 L12 Sources of Parallelism(3).22 Demmel Sp 1999 Mapping Parallel Prefix onto a Tree - Details °Up-the-tree phase (from leaves to root ) °Down the tree phase (from root to leaves) °By induction, S = sum of all leaves to left of subtree rooted at the parent 1) Get values L and R from left and right children 2) Save L in a local register M 3) Pass sum S = L+R to parent 1) Get value S from parent (the root gets 0) 2) Send S to the left child 3) Send S + M to the right child

CS267 L12 Sources of Parallelism(3).23 Demmel Sp 1999 Adding two n-bit integers in O(log n) time °Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers °We want their sum s = a+b = s[n]s[n-1]…s[0] °Challenge: compute all c[i] in O(log n) time via parallel prefix °Used in all computers to implement addition - Carry look-ahead c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] )... next carry bit s[i] = a[i] xor b[i] xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix

CS267 L12 Sources of Parallelism(3).24 Demmel Sp 1999 Multiplying n-by-n matrices in O(log n) time °For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) cost = 1 time unit, using n^3 processors °For all (1 <= I,j <= n) C(i,j) =  P(i,j,k) cost = O(log n) time, using a tree with n^3 / 2 processors k =1 n

CS267 L12 Sources of Parallelism(3).25 Demmel Sp 1999 Inverting triangular n-by-n matrices in O(log 2 n) time °Fact: °Function TriInv(T) … assume n = dim(T) = 2 m for simplicity °time(TriInv(n)) = time(TriInv(n/2)) + O(log(n)) Change variable to m = log n to get time(TriInv(n)) = O(log 2 n) A 0 C B = A 0 -B CA B If T is 1-by-1 return 1/T else … Write T = A 0 C B In parallel do { invA = TriInv(A) invB = TriInv(B) } … implicitly uses a tree newC = -invB * C * invA Return invA 0 newC invB

CS267 L12 Sources of Parallelism(3).26 Demmel Sp 1999 Inverting Dense n-by-n matrices in O(log n) time °Lemma 1: Cayley-Hamilton Theorem expression for A -1 via characteristic polynomial in A °Lemma 2: Newton’s Identities Triangular system of equations for coefficients of characteristic polynomial °Lemma 3: trace(A k ) =  A k [i,i] =  [ i (A)] k °Csanky’s Algorithm (1976) °Completely numerically unstable 2 i=1 n n 1) Compute the powers A 2, A 3, …,A n-1 by parallel prefix cost = O(log 2 n) 2) Compute the traces s k = trace(A k ) cost = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log 2 n) 4) Evaluate A -1 using Cayley-Hamilton Theorem cost = O(log n)

CS267 L12 Sources of Parallelism(3).27 Demmel Sp 1999 Evaluating arbitrary expressions °Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately °Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labelled by +, -, * and / °Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity °Sketch of (modern) proof: evaluate expression tree E greedily by collapsing all leaves into their parents at each time step evaluating all “chains” in E with parallel prefix

CS267 L12 Sources of Parallelism(3).28 Demmel Sp 1999 Evaluating recurrences °Let x i = f i (x i-1 ), f i a rational function, x 0 given °How fast can we compute x n ? °Theorem (Kung): Suppose degree(f i ) = d for all I If d=1, x n can be evaluated in O(log n) using parallel prefix If d>1, evaluating x n takes  (n) time, i.e. no speedup is possible °Sketch of proof when d=1 °Sketch of proof when d>1 degree(x i ) as a function of x 0 is d i After k parallel steps, degree(anything) <= 2 k Computing x i take  (i) steps x i = f i (x i-1 ) = (a i * x i-1 + b i )/( c i * x i-1 + d i ) can be written as x i = num i / den i = (a i * num i-1 + b i * den i-1 )/(c i * num i-1 + d i * den i-1 ) or num i = a i b i * num i-1 = M i * num i-1 = M i * M i-1 * … * M 1 * num 0 dem i c i d i den i-1 den i-1 den 0 Can use parallel prefix with 2-by-2 matrix multiplication

CS267 L12 Sources of Parallelism(3).29 Demmel Sp 1999 Summary of tree algorithms °Lots of problems can be done quickly - in theory - using trees °Some algorithms are widely used broadcasts, reductions, parallel prefix carry look ahead addition °Some are of theoretical interest only Csanky’s method for matrix inversion Solving general tridiagonals (without pivoting) Both numerically unstable Csanky needs too many processors °Embedded in various systems CM-5 hardware control network MPI, Split-C, Titanium, NESL, other languages