CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …

Slides:

Advertisements

Similar presentations

Gerth Stølting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen, Denmark International PhD School in Algorithms for Advanced.

Advertisements

Lecture 3: Parallel Algorithm Design

1 02/11/2009CS267 Lecture 7+ CS 267 Tricks with Trees James Demmel

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Numerical Algorithms Matrix multiplication

Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

CS 240A: Parallel Prefix Algorithms or Tricks with Trees

Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.

CIS 101: Computer Programming and Problem Solving Lecture 7 Usman Roshan Department of Computer Science NJIT.

Data Parallel Algorithms Presented By: M.Mohsin Butt

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Algorithm Design Techniques: Induction Chapter 5 (Except Section 5.6)

Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a.

1 CS 267 Tricks with Trees James Demmel

UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.

Unit 11a 1 Unit 11: Data Structures & Complexity H We discuss in this unit Graphs and trees Binary search trees Hashing functions Recursive sorting: quicksort,

1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

Topic Overview One-to-All Broadcast and All-to-One Reduction

CS240A: Conjugate Gradients and the Model Problem.

Theory I Algorithm Design and Analysis (9 – Randomized algorithms) Prof. Dr. Th. Ottmann.

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Analysis of Algorithms

Chapter 3 Sec 3.3 With Question/Answer Animations 1.

C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.

Week 12 - Wednesday.  What did we talk about last time?  Asymptotic notation.

Set A formal collection of objects, which can be anything May be finite or infinite Can be defined by: – Listing the elements – Drawing a picture – Writing.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Great Theoretical Ideas in Computer Science.

MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?

Survey of Sorting Ananda Gunawardena. Naïve sorting algorithms Bubble sort: scan for flips, until all are fixed Etc...

CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.

Algorithm Analysis CS 400/600 – Data Structures. Algorithm Analysis2 Abstract Data Types Abstract Data Type (ADT): a definition for a data type solely.

CS240A: Conjugate Gradients and the Model Problem.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

1/4 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.

Grade School Again: A Parallel Perspective CS Lecture 7.

Maitrayee Mukerji. Factorial For any positive integer n, its factorial is n! is: n! = 1 * 2 * 3 * 4* ….* (n-1) * n 0! = 1 1 ! = 1 2! = 1 * 2 = 2 5! =

Parallel Prefix The Parallel Prefix Method –This is our first example of a parallel algorithm –Watch closely what is being optimized.

CS 267 Tricks with Trees James Demmel Why talk about this now?

Lecture 3: Parallel Algorithm Design

CS 267 Tricks with Trees James Demmel Why talk about this now?

CS 213: Data Structures and Algorithms

PRAM Algorithms.

CS 267 Tricks with Trees Why talk about this now?

Parallel Prefix.

Mattan Erez The University of Texas at Austin

James Demmel CS 267 Tricks with Trees James Demmel 01/31/2012 CS267 Lecture.

Unit –VIII PRAM Algorithms.

CSE 373: Data Structures & Algorithms

James Demmel and Kathy Yelick

ECE 498AL Lecture 15: Reductions and Their Implementation

Parallel build blocks.

Low Depth Cache-Oblivious Algorithms

James Demmel CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality.

James Demmel and Horst Simon

James Demmel CS 267 Tricks with Trees James Demmel 02/11/2009 CS267 Lecture.

Presentation transcript:

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands …

PRAM model of parallel computation Very simple theoretical model, used in 1970s and 1980s for lots of “paper designs” of parallel algorithms. Processors have unit-time access to any location in shared memory. Number of processors is allowed to grow with problem size. Goal is (usually) an algorithm with span O(log n) or O(log 2 n). Eg: Can you sort n numbers with T 1 = O(n log n) and T n = O(log n)? Was a big open question until Cole solved it in Very unrealistic model but sometimes useful for thinking about a problem. Memory P1 P2 Pn... Parallel Random Access Machine

Parallel Vector Operations Vector add: z = x + y Embarrassingly parallel if vectors are aligned; span = 1 DAXPY: v = α*v + β*w (vectors v, w; scalar α, β) Broadcast α & β, then pointwise vector +; span = log n DDOT : α = v T *w (vectors v, w; scalar α) Pointwise vector *, then sum reduction; span = log n

Broadcast and reduction Broadcast of 1 value to p processors with log p span Reduction of p values to 1 with log p span Uses associativity of +, *, min, max, etc. α Add-reduction Broadcast

A theoretical secret for turning serial into parallel Surprising parallel algorithms: If “there is no way to parallelize this algorithm!” … … it’s probably a variation on parallel prefix! Parallel Prefix Algorithms

Example of a prefix (also called a scan) Sum Prefix Input x = (x 1, x 2,..., x n ) Output y = (y 1, y 2,..., y n ) y i = Σ j=1:i x j Example x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Prefix functions-- outputs depend upon an initial string

What do you think? Can we really parallelize this? It looks like this kind of code: y(0) = 0; for i = 1:n y(i) = y(i-1) + x(i); The ith iteration of the loop depends completely on the (i-1)st iteration. Work = n, span = n, parallelism = 1. Impossible to parallelize, right?

A clue? x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Is there any value in adding, say, ? If we separately have 1+2+3, what can we do? Suppose we added 1+2, 3+4, etc. pairwise -- what could we do?

(Recursively compute prefix sums) Prefix sum in parallel Algorithm: 1. Pairwise sum 2. Recursive prefix 3. Pairwise sum

What’s the total work? Pairwise sums Recursive prefix Update “odds” T 1 (n) = n/2 + n/2 + T 1 (n/2) = n + T 1 (n/2) = 2n – 1 at the cost of more work! 10 Parallel prefix cost: Work and Span

What’s the total work? Pairwise sums Recursive prefix Update “odds” T 1 (n) = n/2 + n/2 + T 1 (n/2) = n + T 1 (n/2) = 2n – 1 Parallelism at the cost of more work! 11 Parallel prefix cost: Work and Span

What’s the total work? Pairwise sums Recursive prefix Update “odds” T 1 (n) = n/2 + n/2 + T 1 (n/2) = n + T 1 (n/2) = 2n – 1 T ∞ (n) = 2 log n Parallelism at the cost of twice the work! 12 Parallel prefix cost: Work and Span

13 Non-recursive view of parallel prefix scan Tree summation: two phases up sweep get values L and R from left and right child save L in local variable Mine compute Tmp = L + R and pass to parent down sweep get value Tmp from parent send Tmp to left child send Tmp+Mine to right child Up sweep: mine = left tmp = left + right X = Down sweep: tmp = parent (root is 0) right = tmp + mine

All (and) Any ( or) Input: Bits (Booleans) Sum (+) Product (*) Max Min Input: Reals Any associative operation works Associative: (a  b)  c = a  (b  c) MatMul Input: Matrices (not commutative!)

15 Scan (Parallel Prefix) Operations Definition: the parallel prefix operation takes a binary associative operator , and an array of n elements [a 0, a 1, a 2, … a n-1 ] and produces the array [a 0, (a 0  a 1 ), … (a 0  a 1 ...  a n-1 )] Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14]

16 Applications of scans Many applications, some more obvious than others lexically compare strings of characters add multi-precision numbers add binary numbers fast in hardware graph algorithms evaluate polynomials implement bucket sort, radix sort, and even quicksort solve tridiagonal linear systems solve recurrence relations dynamically allocate processors search for regular expression (grep) image processing primitives

17 Using Scans for Array Compression Given an array of n elements [a 0, a 1, a 2, … a n-1 ] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements into [a 0, a 2, a 3, a 6, …] Compute an add scan of [0, flags] : [0,1,1,2,3,3,4,…] Gives the index of the i th element in the compressed array If the flag for this element is 1, write it into the result array at the given position

Matlab code % Start with a vector of n random #s % normally distributed around 0. A = randn(1,n); flag = (A > 0); addscan = cumsum(flag); parfor i = 1:n if flag(i) B(addscan(i)) = A(i); end; Array compression: Keep only positives 18

19 Fibonacci via Matrix Multiply Prefix F n+1 = F n + F n-1 Can compute all F n by matmul_prefix on [,,,,,,,, ] then select the upper left entry

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Carry First Int Second Int Sum

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Intb 3 b 2 b 1 b Sums 3 s 2 s 1 s 0

Carry-Look Ahead Addition (Babbage 1800’s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Intb 3 b 2 b 1 b Sums 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1 (addition mod 2)

Carry-Look Ahead Addition (Babbage 18) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Intb 3 b 2 b 1 b Sums 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1 c i a i + b i a i b i c i = (addition mod 2)

Carry-Look Ahead Addition (Babbage 1s) Goal: Add Two n-bit Integers Example Notation Carryc 2 c 1 c First Inta 3 a 2 a 1 a Second Intb 3 b 2 b 1 b Sums 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1 c i a i + b i a i b i c i = 1. compute c i by binary matmul prefix 2. compute s i = a i + b i +c i-1 in parallel

25 Adding two n-bit integers in O(log n) time Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers We want their sum s = a+b = s[n]s[n-1]…s[0] Challenge: compute all c[i] in O(log n) time via parallel prefix Used in all computers to implement addition - Carry look-ahead c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] )... next carry bit s[i] = a[i] xor b[i] xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = M[i] * c[i-1] … 2-by-2 Boolean matrix multiplication (associative) = M[i] * M[i-1] * … M[0] * 0 1 … evaluate each product M[i] * M[i-1] * … * M[0] by parallel prefix

26 Segmented Operations  2 (y, T)(y, F) (x, T)(x  y, T)(y, F) (x, F)(y, T)(x  y, F) e. g TT FFF T F T Result Inputs = ordered pairs (operand, boolean) e.g. (x, T) or (x, F) Change of segment indicated by switching T/F

Any Prefix Operation May Be Segmented!

Graph algorithms by segmented scans nbr: firstnbr: TTFFFTF flag: The usual CSR data structure, plus segment flags!

29 Multiplying n-by-n matrices in O(log n) span For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) span = 1, work = n 3 For all (1 <= i,j <= n) C(i,j) =  P(i,j,k) span = O(log n), work = n 3 using a tree

30 Inverting dense n-by-n matrices in O(log 2 n) span Lemma 1: Cayley-Hamilton Theorem expression for A -1 via characteristic polynomial in A Lemma 2: Newton’s Identities Triangular system of equations for coefficients of characteristic polynomial Lemma 3: trace(A k ) =  A k [i,i] =  [ i (A)] k Csanky’s Algorithm (1976) Completely numerically unstable i=1 n n 1) Compute the powers A 2, A 3, …,A n-1 by parallel prefix span = O(log 2 n) 2) Compute the traces s k = trace(A k ) span = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial span = O(log 2 n) 4) Evaluate A -1 using Cayley-Hamilton Theorem span = O(log n)

31 Evaluating arbitrary expressions Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labelled by +, -, * and / Theorem (Brent): E can be evaluated with O(log n) span, if we reorganize it using laws of commutativity, associativity and distributivity Sketch of (modern) proof: evaluate expression tree E greedily by collapsing all leaves into their parents at each time step evaluating all “chains” in E with parallel prefix

32 The log 2 n span is not the main reason for the usefulness of parallel prefix. Say n = p ( summands per processor) Cost = ( adds) + (log 2 P message passings) fast & embarassingly parallel ( local adds are serial for each processor, of course) The myth of log n

33 Summary of tree algorithms Lots of problems can be done quickly - in theory - using trees Some algorithms are widely used broadcasts, reductions, parallel prefix carry look ahead addition Some are of theoretical interest only Csanky’s method for matrix inversion Solving tridiagonal linear systems (without pivoting) Both numerically unstable Csanky does too much work Embedded in various systems CM-5 hardware control network MPI, UPC, Titanium, NESL, other languages