Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2012.

Slides:



Advertisements
Similar presentations
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Advertisements

Hashing.
Lecture 19: Parallel Algorithms
Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring.
Chapter 4 Systems of Linear Equations; Matrices Section 2 Systems of Linear Equations and Augmented Matrics.
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
CS 206 Introduction to Computer Science II 03 / 06 / 2009 Instructor: Michael Eckmann.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
C++ for Engineers and Scientists Third Edition
CS 206 Introduction to Computer Science II 10 / 28 / 2009 Instructor: Michael Eckmann.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the basic concepts and uses of arrays ❏ To be able to define C.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Microsoft Visual Basic 2005 CHAPTER 9 Using Arrays and File Handling.
Using Arrays and File Handling
IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Computer Science: A Structured Programming Approach Using C1 8-7 Two-Dimensional Arrays The arrays we have discussed so far are known as one- dimensional.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Computer Science: A Structured Programming Approach Using C1 8-7 Two-Dimensional Arrays The arrays we have discussed so far are known as one- dimensional.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Higher Computing Science 2016 Prelim Revision. Topics to revise Computational Constructs parameter passing (value and reference, formal and actual) sub-programs/routines,
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.
Arrays Declaring arrays Passing arrays to functions Searching arrays with linear search Sorting arrays with insertion sort Multidimensional arrays Programming.
CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Buffering Techniques Greg Stitt ECE Department University of Florida.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
JPEG Compression What is JPEG? Motivation
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
CS 179: GPU Programming Lecture 7.
CS/EE 217 – GPU Architecture and Parallel Programming
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
© 2012 Elsevier, Inc. All rights reserved.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Parallel build blocks.
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Convolution Layer Optimization
Presentation transcript:

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012

Announcements Presentation topics due 02/07 Homework 2 due 02/13

Agenda Finish atomic functions from Monday Parallel Algorithms  Parallel Reduction  Scan  Stream Compression  Summed Area Tables

Parallel Reduction Given an array of numbers, design a parallel algorithm to find the sum. Consider:  Arithmetic intensity: compute to memory access ratio

Parallel Reduction Given an array of numbers, design a parallel algorithm to find:  The sum  The maximum value  The product of values  The average value How different are these algorithms?

Parallel Reduction Reduction: An operation that computes a single result from a set of data Examples:  Minimum/maximum value  Average, sum, product, etc. Parallel Reduction: Do it in parallel. Obviously

Parallel Reduction Example. Find the sum:

Parallel Reduction

Parallel Reduction

Parallel Reduction

Parallel Reduction Similar to brackets for a basketball tournament log( n ) passes for n elements

All-Prefix-Sums  Input Array of n elements: Binary associate operator: Identity: I  Outputs the array: Images from

All-Prefix-Sums Example  If is addition, the array [ ]  is transformed to [ ] Seems sequential, but there is an efficient parallel solution

Scan Scan: all-prefix-sums operation on an array of data Exclusive Scan: Element j of the result does not include element j of the input: In: [ ] Out: [ ] Inclusive Scan (Prescan): All elements including j are summed In: [ ] Out: [ ]

Scan How do you generate an exclusive scan from an inclusive scan? Input: [ ] Inclusive: [ ] Exclusive: [ ]  // Shift right, insert identity How do you go in the opposite direction?

Scan Use cases  Stream compaction  Summed-area tables for variable width image processing  Radix sort  …

Scan Used to convert certain sequential computation into equivalent parallel computation Image from

Scan Design a parallel algorithm for exclusive scan  In: [ ]  Out: [ ] Consider:  Total number of additions

Scan Sequential Scan: single thread, trivial n adds for an array of length n Work complexity: O(n) How many adds will our parallel version have? Image from

Scan Naive Parallel Scan Image from Is this exclusive or inclusive? Each thread  Writes one sum  Reads two values for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: Input

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = Recall, it runs in parallel! for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = Recall, it runs in parallel! for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 Consider only k = 7 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 after d = 2 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 3, 2 d-1 = after d = 1 after d = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 3, 2 d-1 = after d = 1 after d = Consider only k = 7 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: Final

Scan Naive Parallel Scan  What is naive about this algorithm? What was the work complexity for sequential scan? What is the work complexity for this?

Stream Compaction  Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order a b f c d e g h

Stream Compaction  Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order a b f c d e g h a c d g

Stream Compaction  Used in collision detection, sparse matrix compression, etc.  Can reduce bandwidth from GPU to CPU a b f c d e g h a c d g

Stream Compaction  Step 1: Compute temporary array containing 1 if corresponding element meets criteria 0 if element does not meet criteria a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1

Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1 0

Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1 0 1

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Stream Compaction  Step 1: Compute temporary array a b f c d e g h It runs in parallel!

Stream Compaction  Step 1: Compute temporary array a b f c d e g h It runs in parallel!

Stream Compaction  Step 2: Run exclusive scan on temporary array a b f c d e g h Scan result:

Stream Compaction  Step 2: Run exclusive scan on temporary array  Scan runs in parallel  What can we do with the results? a b f c d e g h Scan result:

Stream Compaction  Step 3: Scatter Result of scan is index into final array Only write an element if temporary array has a 1

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d g Final array:

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: Final array: Scatter runs in parallel!

Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d g Final array: Scatter runs in parallel!

Summed Area Table Summed Area Table (SAT): 2D table where each element stores the sum of all elements in an input image between the lower left corner and the entry location.

Summed Area Table Input image SAT ( ) + ( ) + ( ) = 9 Example:

Summed Area Table Benefit  Used to perform different width filters at every pixel in the image in constant time per pixel  Just sample four pixels in SAT: Image from

Summed Area Table Uses  Glossy environment reflections and refractions  Approximate depth of field Image from

Summed Area Table Input image SAT

Summed Area Table Input image 1 SAT

Summed Area Table Input image 1 2 SAT

Summed Area Table Input image SAT

Summed Area Table Input image SAT

Summed Area Table Input image SAT

Summed Area Table Input image SAT

Summed Area Table …

Input image SAT

Summed Area Table Input image SAT

Summed Area Table Input image SAT

Summed Area Table How would implement this on the GPU?

Summed Area Table How would compute a SAT on the GPU using inclusive scan?

Summed Area Table Input image Partial SAT One inclusive scan for each row Step 1 of 2:

Summed Area Table Partial SAT One inclusive scan for each column, bottom to top Step 2 of 2: Final SAT

Summary Parallel reductions and scan are building blocks for many algorithms An understanding of parallel programming and GPU architecture yields efficient GPU implementations