Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013.

Slides:



Advertisements
Similar presentations
Lecture 19: Parallel Algorithms
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Analysis of Algorithms
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Copyright © 2014, 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with C++ Early Objects Eighth Edition by Tony Gaddis,
Chapter 9: Searching, Sorting, and Algorithm Analysis
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Rossella Lau Lecture 7, DCO20105, Semester A, DCO Data structures and algorithms  Lecture 7: Big-O analysis Sorting Algorithms  Big-O analysis.
Programming Logic and Design Fourth Edition, Comprehensive
CSE 373 Data Structures Lecture 15
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Sorting HKOI Training Team (Advanced)
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
Parallel and Distributed Simulation Time Parallel Simulation.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Data Structures Haim Kaplan & Uri Zwick December 2013 Sorting 1.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
4C8 Dr. David Corrigan Jpeg and the DCT. 2D DCT.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Buffering Techniques Greg Stitt ECE Department University of Florida.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
JPEG Compression What is JPEG? Motivation
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
CS 179: GPU Programming Lecture 7.
Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2002 Bin Sort, Radix.
CS/EE 217 – GPU Architecture and Parallel Programming
GPGPU: Parallel Reduction and Scan
Parallel Computation Patterns (Reduction)
Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2001 Bin Sort, Radix.
ECE 498AL Lecture 15: Reductions and Their Implementation
Parallel build blocks.
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Convolution Layer Optimization
Presentation transcript:

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013

Announcements Project 1  Due Thursday 09/19 Reminders  Commit often  Make a great README.md Philly Transit Hackathon this weekend 

Review SP, SM Kernel, thread, warp, block, grid 3

Agenda Parallel Algorithms  Parallel Reduction  Scan  Stream Compression  Summed Area Tables  Radix Sort 4

Parallel Reduction Given an array of numbers, design a parallel algorithm to find the sum. Consider:  Arithmetic intensity: compute to memory access ratio 5

Parallel Reduction Given an array of numbers, design a parallel algorithm to find:  The sum  The maximum value  The product of values  The average value How different are these algorithms? 6

Parallel Reduction Reduction: An operation that computes a single result from a set of data Examples:  Minimum/maximum value  Average, sum, product, etc. Parallel Reduction: Do it in parallel. Obviously 7

Parallel Reduction Example. Find the sum: 8

Parallel Reduction

Parallel Reduction

Parallel Reduction

Parallel Reduction Similar to brackets for a basketball tournament log( n ) passes for n elements 12

All-Prefix-Sums  Input Array of n elements: Binary associate operator: Identity: I  Outputs the array: Images from 13

All-Prefix-Sums Example  If is addition, the array [ ]  is transformed to [ ] Seems sequential, but there is an efficient parallel solution 14

Scan Scan: all-prefix-sums operation on an array of data Exclusive Scan: Element j of the result does not include element j of the input: In: [ ] Out: [ ] Inclusive Scan (Prescan): All elements including j are summed In: [ ] Out: [ ] 15

Scan How do you generate an exclusive scan from an inclusive scan? Input: [ ] Inclusive: [ ] Exclusive: [ ]  // Shift right, insert identity How do you go in the opposite direction? 16

Scan Use cases  Stream compaction  Summed-area tables for variable width image processing  Radix sort  … 17

Scan Used to convert certain sequential computation into equivalent parallel computation Image from 18

Scan Design a parallel algorithm for exclusive scan  In: [ ]  Out: [ ] Consider:  Total number of additions 19

Scan Sequential Scan: single thread, trivial n adds for an array of length n How many adds will our parallel version have? Image from 20

Scan Naive Parallel Scan Image from Is this exclusive or inclusive? Each thread  Writes one sum  Reads two values for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 21

Scan Naive Parallel Scan: Input

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 23

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 24

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 25

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 26

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 27

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 28

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 29

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 30

Scan Naive Parallel Scan: d = 1, 2 d-1 = Recall, it runs in parallel! for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 31

Scan Naive Parallel Scan: d = 1, 2 d-1 = Recall, it runs in parallel! for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 32

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 33

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 Consider only k = 7 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 34

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 after d = 2 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 35

Scan Naive Parallel Scan: d = 3, 2 d-1 = after d = 1 after d = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 36

Scan Naive Parallel Scan: d = 3, 2 d-1 = after d = 1 after d = Consider only k = 7 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k]; 37

Scan Naive Parallel Scan: Final

Stream Compaction  Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order a b f c d e g h 39

Stream Compaction  Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order a b f c d e g h a c d g 40

Stream Compaction  Used in path tracing, collision detection, sparse matrix compression, etc.  Can reduce bandwidth from GPU to CPU a b f c d e g h a c d g 41

Stream Compaction  Step 1: Compute temporary array containing 1 if corresponding element meets criteria 0 if element does not meet criteria a b f c d e g h 42

Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1 43

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h It runs in parallel! 47

Stream Compaction  Step 1: Compute temporary array a b f c d e g h It runs in parallel! 48

Stream Compaction  Step 2: Run exclusive scan on temporary array a b f c d e g h Scan result: 49

Stream Compaction  Step 2: Run exclusive scan on temporary array  Scan runs in parallel  What can we do with the results? a b f c d e g h Scan result: 50

Stream Compaction  Step 3: Scatter Result of scan is index into final array Only write an element if temporary array has a 1 51

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d g Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: Final array: Scatter runs in parallel!

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d g Final array: Scatter runs in parallel! 58

Summed Area Table Summed Area Table (SAT): 2D table where each element stores the sum of all elements in an input image between the lower left corner and the entry location. 59

Summed Area Table Input image SAT ( ) + ( ) + ( ) = 9 Example: 60

Summed Area Table Benefit  Used to perform different width filters at every pixel in the image in constant time per pixel  Just sample four pixels in SAT: Image from 61

Summed Area Table Uses  Approximate depth of field  Glossy environment reflections and refractions Image from 62

Summed Area Table Input image SAT 63

Summed Area Table Input image 1 SAT 64

Summed Area Table Input image 1 2 SAT 65

Summed Area Table Input image SAT 66

Summed Area Table Input image SAT 67

Summed Area Table Input image SAT 68

Summed Area Table Input image SAT 69

Summed Area Table … 70

Summed Area Table Input image SAT 71

Summed Area Table Input image SAT 72

Summed Area Table Input image SAT 73

Summed Area Table How would implement this on the GPU? 74

Summed Area Table How would compute a SAT on the GPU using inclusive scan? 75

Summed Area Table Input image Partial SAT One inclusive scan for each row Step 1 of 2: 76

Summed Area Table Partial SAT One inclusive scan for each column, bottom to top Step 2 of 2: Final SAT 77

Radix Sort Efficient for small sort keys  k-bit keys require k passes 78

Radix Sort Each radix sort pass partitions its input based on one bit First pass starts with the least significant bit (LSB). Subsequent passes move towards the most significant bit (MSB) 010 LSBMSB 79

Radix Sort 100 Example from Example input: 80

Radix Sort First pass: partition based on LSB LSB == 0 LSB == 1 81

Radix Sort Second pass: partition based on middle bit bit == 0 bit ==

Radix Sort Final pass: partition based on MSB MSB == 0 MSB ==

Radix Sort Completed:

Radix Sort Completed:

Parallel Radix Sort Where is the parallelism? 86

Parallel Radix Sort 1. Break input arrays into tiles  Each tile fits into shared memory for an SM 2. Sort tiles in parallel with radix sort 3. Merge pairs of tiles using a parallel bitonic merge until all tiles are merged. Our focus is on Step 2 87

Parallel Radix Sort Where is the parallelism?  Each tile is sorted in parallel  Where is the parallelism within a tile? 88

Parallel Radix Sort Where is the parallelism?  Each tile is sorted in parallel  Where is the parallelism within a tile? Each pass is done in sequence after the previous pass. No parallelism Can we parallelize an individual pass? How?  Merge also has parallelism 89

Parallel Radix Sort Implement spilt. Given:  Array, i, at pass n:  Array, b, which is true/false for bit n: Output array with false keys before true keys:

Parallel Radix Sort i array b array Step 1: Compute e array e array 91

Parallel Radix Sort b array Step 2: Exclusive Scan e e array f array i array 92

Parallel Radix Sort i array b array Step 3: Compute totalFalses e array f array totalFalses = e[n – 1] + f[n – 1] totalFalses = totalFalses = 4 93

Parallel Radix Sort i array b array Step 4: Compute t array e array f array t array t[i] = i – f[i] + totalFalses totalFalses = 4 94

Parallel Radix Sort i array b array Step 4: Compute t array e array f array 4 t array t[0] = 0 – f[0] + totalFalses t[0] = 0 – t[0] = 4 totalFalses =

Parallel Radix Sort i array b array Step 4: Compute t array e array f array 4 4 t array t[1] = 1 – f[1] + totalFalses t[1] = 1 – t[1] = 4 totalFalses =

Parallel Radix Sort i array b array Step 4: Compute t array e array f array 44 5 t array t[2] = 2 – f[2] + totalFalses t[2] = 2 – t[2] = 5 totalFalses =

Parallel Radix Sort i array b array Step 4: Compute t array e array f array t array totalFalses = 4 t[i] = i – f[i] + totalFalses

Parallel Radix Sort i array b array Step 5: Scatter based on address d e array f array t array 0 d[i] = b[i] ? t[i] : f[i]

Parallel Radix Sort i array b array Step 5: Scatter based on address d e array f array t array 04 d[i] = b[i] ? t[i] : f[i]

Parallel Radix Sort i array b array Step 5: Scatter based on address d e array f array t array 041 d[i] = b[i] ? t[i] : f[i]

Parallel Radix Sort i array b array Step 5: Scatter based on address d e array f array t array d[i] = b[i] ? t[i] : f[i]

Parallel Radix Sort i array Step 5: Scatter based on address d d output 103

Parallel Radix Sort i array Step 5: Scatter based on address d d output 104

Parallel Radix Sort Given k-bit keys, how do we sort using our new split function? Once each tile is sorted, how do we merge tiles to provide the final sorted array? 105

Summary Parallel reduction, scan, and sort are building blocks for many algorithms An understanding of parallel programming and GPU architecture yields efficient GPU implementations 106