Parallel primitives – Scan operation CDP 236370 – Written by Uri Verner 1 GPU Algorithm Design.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Lecture 3: Parallel Algorithm Design
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA Programming
Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter
CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
LECTURE 3: INTRODUCTION TO PARALLEL COMPUTING USING CUDA Ken Domino, Domem Technologies May 16, 2011 IEEE Boston Continuing Education Program.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Martin Kruliš by Martin Kruliš (v1.0)1.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Single Instruction Multiple Threads
Introduction toData structures and Algorithms
CS/EE 217 – GPU Architecture and Parallel Programming
PRAM and Parallel Computing
Lecture 3: Parallel Algorithm Design
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
PRAM Algorithms.
ME964 High Performance Computing for Engineering Applications
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Introduction to CUDA Programming
CS 179: GPU Programming Lecture 7.
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
GPGPU: Parallel Reduction and Scan
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
© 2012 Elsevier, Inc. All rights reserved.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel build blocks.
CUDA Grids, Blocks, and Threads
Low Depth Cache-Oblivious Algorithms
ECE 498AL Lecture 10: Control Flow
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
6- General Purpose GPU Programming
Presentation transcript:

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design

Scan primitive Definition: Scan( , ,[a 0,a 1,a 2,…,a n-1 ] ) = [ , a 0, a 0  a 1, a 0  a 1  a 2,…, a 0  …  a n-2 ] Example: Scan (+,0,[ ]) = [ ] Specification: The defined Scan is an exclusive scan An inclusive Scan returns [a 0, a 0  a 1, a 0  a 1  a 2,…, a 0  …  a n-1 ] Implementation: Coming next - for Scan( +,0,…), also called Prefix-Sum. 2

Sequential implementation void scan(int* in, int* out, int n) { out[0] = 0; for (int i = 1; i < n; i++) out[i] = in[i-1] + out[i-1]; } 3 Complexity: O(n)

Parallel implementation - naïve for (d = 1; d < log 2 n; d++) for all k in parallel if( k  2 d-1 ) x[out][k] = x[in][k – 2 d-1 ] + x[in][k] else x[out][k] = x[in][k] swap(in,out) x is Double-Buffered. x[in]  x[out] 4 “Complexity”: O(nlog 2 n) Q: Why Double-Buffer?

Parallel implementation The naïve implementation is less efficient than the sequential. We want O(n) operations in total. Solution (Blelloch’90): Balanced trees. Build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum. Complexity: O(n). The algorithm consists of two phases: Reduce (also known as the up-sweep) Down-sweep 5

Up-Sweep CDP – Written by Uri Verner Traverse the tree from leaves to root computing partial sums at internal nodes of the tree. Sum of all elements. What collective operation that we know returns this value? 6

Down-Sweep Traverse back up the tree from the root, using the partial sums. Replace last element with 0. Do we need that value? 7

CUDA GPU Execution model SIMD Execution of threads in a warp (currently 32 threads). A group of warps executing a common task is a thread-block. The size of a thread-block is limited (e.g. to 1024 threads). Threads in the same thread-block are: Executed on the same Multi-Processor (GPU core) Enumerated – locally (in thread-block) and globally Share data and synchronize Threads in different thread-blocks cannot cooperate Number of thread-blocks is (practically) unlimited. 8

CUDA Implementation (1/2) __global__ void prescan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; // points to shared memory int thid = threadIdx.x; int offset = 1; temp[2*thid] = g_idata[2*thid]; // load input into shared memory temp[2*thid+1] = g_idata[2*thid+1]; for (int d = n/2; d > 0; d /= 2) // build sum in place up the tree { __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; } // continues on next slide 9

CUDA Implementation (2/2) if (thid == 0) { temp[n - 1] = 0; } // clear the last element for (int d = 1; d < n; d *= 2) // traverse down tree & build scan { offset /= 2; __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } __syncthreads(); g_odata[2*thid] = temp[2*thid]; // write results to device memory g_odata[2*thid+1] = temp[2*thid+1]; } 10

Analysis __global__ void prescan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; // points to shared memory int thid = threadIdx.x; int offset = 1; temp[2*thid] = g_idata[2*thid]; // load input into shared memory... Restriction: Limited to a single thread block threadIdx.x is an identifier inside the thread block Shared memory is defined in the scope of a thread block only We need to find a solution that supports arrays of arbitrary size. 11

Scan - arbitrary array size Array of arbitrary values Scan block 0Scan block 1Scan block 2Scan block 3 00 11 22 33 Store block sum to auxiliary array: 00 11 22 33  0  1  2  3 Scan block sums (recursively): 3  0  1  2  3 Add  0 to allAdd  1 to allAdd  2 to allAdd  3 to all 4 Final array of scanned values scan 12

Scan - arbitrary array size 1. Divide the large array into blocks that each can be scanned by a single thread block 2. Scan the blocks, and write the total sum of each block to another array of block sums 3. Scan the block sums, generating an array of block increments that are added to all elements in their respective blocks 13

Performance Table source: Parallel Prefix Sum (Scan) with CUDA (2007), Mark Harris. 14

References Parallel Prefix Sum (Scan) with CUDA (2007), Mark Harris. Parallel Prefix Sum (Scan) with CUDA, GPU Gems 3 Chapter