GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS 565 - Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

Slides:



Advertisements
Similar presentations
Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider.
Advertisements

The University of Adelaide, School of Computer Science
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:
Matrix Operations on the GPU CIS 665: GPU Programming and Architecture TA: Joseph Kider.
GLSL Applications: 1 of 2 Joseph Kider Source: Patrick Cozzi – Spring 2011 University of Pennsylvania CIS Fall 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the basic concepts and uses of arrays ❏ To be able to define C.
Image Processing.  a typical image is simply a 2D array of color or gray values  i.e., a texture  image processing takes as input an image and outputs.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
© NVIDIA and UC Davis 2008 Advanced Data-Parallel Programming: Data Structures and Algorithms John Owens UC Davis.
IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
GPU Broad Phase Collision Detection GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Parallel Algorithms Continued Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
OpenGL Shading Language
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
GLSL I.  Fixed vs. Programmable  HW fixed function pipeline ▪ Faster ▪ Limited  New programmable hardware ▪ Many effects become possible. ▪ Global.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Build your own 2D Game Engine and Create Great Web Games using HTML5, JavaScript, and WebGL. Sung, Pavleas, Arnez, and Pace, Chapter 5 Examples 1.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
CS 179: GPU Programming Lecture 7.
CS/EE 217 – GPU Architecture and Parallel Programming
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Parallel build blocks.
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Convolution Layer Optimization
CS 480/680 Computer Graphics GLSL Overview.
CS 480/680 Fall 2011 Dr. Frederick C Harris, Jr. Computer Graphics
Presentation transcript:

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan

Agenda ?

Activity Given a set of numbers, design a GPU algorithm for:  Team 1: Sum of values  Team 2: Maximum value  Team 3: Product of values  Team 4: Average value Consider:  Bottlenecks  Arithmetic intensity: compute to memory access ratio  Optimizations  Limitations

Parallel Reduction Reduction: An operation that computes a single result from a set of data Examples:  Minimum/maximum value (for tone mapping)  Average, sum, product, etc. Parallel Reduction: Do it in parallel. Obviously

Parallel Reduction Store data in 2D texture Render viewport-aligned quad ¼ of texture size  Texture coordinates address every other texel  Fragment shader computes operation with four texture reads for surrounding data Use output as input to the next pass Repeat until done

Parallel Reduction Image from

Parallel Reduction uniform sampler2D u_Data; in vec2 fs_Texcoords; out float out_MaxValue; void main(void) { float v0 = texture(u_Data, fs_Texcoords).r; float v1 = textureOffset(u_Data, fs_Texcoords, ivec2(0, 1)).r; float v2 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 0)).r; float v3 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 1)).r; out_MaxValue = max(max(v0, v1), max(v2, v3)); }

Parallel Reduction uniform sampler2D u_Data; in vec2 fs_Texcoords; out float out_MaxValue; void main(void) { float v0 = texture(u_Data, fs_Texcoords).r; float v1 = textureOffset(u_Data, fs_Texcoords, ivec2(0, 1)).r; float v2 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 0)).r; float v3 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 1)).r; out_MaxValue = max(max(v0, v1), max(v2, v3)); } Read four values in a 2x2 area of the texture

Parallel Reduction uniform sampler2D u_Data; in vec2 fs_Texcoords; out float out_MaxValue; void main(void) { float v0 = texture(u_Data, fs_Texcoords).r; float v1 = textureOffset(u_Data, fs_Texcoords, ivec2(0, 1)).r; float v2 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 0)).r; float v3 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 1)).r; out_MaxValue = max(max(v0, v1), max(v2, v3)); } Only using the red channel. How efficient is this?

Parallel Reduction uniform sampler2D u_Data; in vec2 fs_Texcoords; out float out_MaxValue; void main(void) { float v0 = texture(u_Data, fs_Texcoords).r; float v1 = textureOffset(u_Data, fs_Texcoords, ivec2(0, 1)).r; float v2 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 0)).r; float v3 = textureOffset(u_Data, fs_Texcoords, ivec2(1, 1)).r; out_MaxValue = max(max(v0, v1), max(v2, v3)); } Output the maximum value

Parallel Reduction Reduces n 2 elements in log(n) time 1024x1024 is only 10 passes

Parallel Reduction Bottlenecks  Read back to CPU, recall glReadPixels  Each pass depends on the previous How does this affect pipeline utilization?  Low arithmetic intensity

Parallel Reduction Optimizations  Use just red channel or rgba ?  Read 2x2 areas or nxn? What is the trade off?  When do you read back? 1x1?  How many textures are needed?

Parallel Reduction Ping Ponging  Use two textures: X and Y  First pass: X is input, Y is output  Additional passes swap input and output  Implement with FBOs Input Output Pass X YX YX YX Y

Parallel Reduction Limitations  Maximum texture size  Requires a power of two in each dimension How do you work around these?

All-Prefix-Sums  Input Array of n elements: Binary associate operator: Identity: I  Outputs the array: Images from

All-Prefix-Sums Example  If is addition, the array [ ]  is transformed to [ ] Seems sequential, but there is an efficient parallel solution

Scan Scan: all-prefix-sums operation on an array of data Exclusive Scan: Element j of the result does not include element j of the input: In: [ ] Out: [ ] Inclusive Scan (Prescan): All elements including j are summed In: [ ] Out: [ ]

Scan How do you generate an exclusive scan from an inclusive scan? In: [ ] Inclusive: [ ] Exclusive: [ ]  // Shift right, insert identity How do you go in the opposite direction?

Scan Use cases  Summed-area tables for variable width image processing  Stream compaction  Radix sort  …

Scan: Stream Compaction Stream Compaction  Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order a b f c d e g h

Scan: Stream Compaction Stream Compaction  Given an array of elements Create a new array with elements that meet a certain criteria, e.g. non null Preserve order a b f c d e g h a c d g

Scan: Stream Compaction Stream Compaction  Used in collision detection, sparse matrix compression, etc.  Can reduce bandwidth from GPU to CPU a b f c d e g h a c d g

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array containing 1 if corresponding element meets criteria 0 if element does not meet criteria a b f c d e g h

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1 0

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h 1 0 1

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h It runs in parallel!

Scan: Stream Compaction Stream Compaction  Step 1: Compute temporary array a b f c d e g h It runs in parallel!

Scan: Stream Compaction Stream Compaction  Step 2: Run exclusive scan on temporary array a b f c d e g h Scan result:

Scan: Stream Compaction Stream Compaction  Step 2: Run exclusive scan on temporary array  Scan runs in parallel  What can we do with the results? a b f c d e g h Scan result:

Scan: Stream Compaction Stream Compaction  Step 3: Scatter Result of scan is index into final array Only write an element if temporary array has a 1

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: Final array:

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a Final array:

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c Final array:

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d Final array:

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d g Final array:

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: Final array: Scatter runs in parallel!

Scan: Stream Compaction Stream Compaction  Step 3: Scatter a b f c d e g h Scan result: a c d g Final array: Scatter runs in parallel!

Scan Used to convert certain sequential computation into equivalent parallel computation Image from

Activity Design a parallel algorithm for exclusive scan  In: [ ]  Out: [ ] Consider:  Total number of additions  Ignore GLSL constraints

Scan Sequential Scan: single thread, trivial n adds for an array of length n Work complexity: O(n) How many adds will our parallel version have? Image from

Scan Naive Parallel Scan Image from Is this exclusive or inclusive? Each thread  Writes one sum  Reads two values for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: Input

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = Recall, it runs in parallel! for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 1, 2 d-1 = Recall, it runs in parallel! for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 Consider only k = 7 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 2, 2 d-1 = after d = 1 after d = 2 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 3, 2 d-1 = after d = 1 after d = for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: d = 3, 2 d-1 = after d = 1 after d = Consider only k = 7 for d = 1 to log 2 n for all k in parallel if (k >= 2 d-1 ) x[k] = x[k – 2 d-1 ] + x[k];

Scan Naive Parallel Scan: Final

Scan Naive Parallel Scan  What is naive about this algorithm? What was the work complexity for sequential scan? What is the work complexity for this?

Summary Parallel reductions and scan are building blocks for many algorithms An understanding of parallel programming and GPU architecture yields efficient GPU implementations