List Ranking and Parallel Prefix

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA Zheng Wei and Joseph JaJa Rohit Nigam
List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Advanced Topics in Algorithms and Data Structures Lecture 6.1 – pg 1 An overview of lecture 6 A parallel search algorithm A parallel merging algorithm.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Synchronization These notes introduce:
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CUDA Programming Model
Basic CUDA Programming
Programming Massively Parallel Graphics Processors
Introduction to CUDA Programming
CS 179: GPU Programming Lecture 7.
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Programming Massively Parallel Graphics Processors
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
Synchronization These notes introduce:
Presentation transcript:

List Ranking and Parallel Prefix Sathish Vadhiyar

List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in a linked list Linked list represented as an array Irregular memory accesses – successor of each node of a linked list can be contained anywhere List ranking – special case of list prefix computations in which all the values are identity, i.e., 1.

List ranking L is a singly linked list Each node contains two fields – a data field, and a pointer to the successor Prefix sums – updating data field with summation of values of its predecessors and itself L represented by an array X with fields X[i].prefix and X[i].succ

Sequential Algorithm Simple and effective Two passes Pass 1: To identify the head node Pass 2: Traverses starting from the head, follow the successor nodes accumulating the prefix sums in the traversal order Works well in practice

Parallel Algorithm: Prefix computations on arrays Array X partitioned into subarrays Local prefix sums of each subarray calculated in parallel Prefix sums of last elements of each subarray written to a separate array Y Prefix sums of elements in Y are calculated. Each prefix sum of Y is added to corresponding block of X Divide and conquer strategy

Example 123456789 456 789 1,3,6 4,9,15 7,15,24 6,15,24 6,21,45 1,3,6,10,15,21,28,36,45 Divide Local prefix sum Passing last elements to a processor Computing prefix sum of last elements on the processor Adding global prefix sum to local prefix sums in each processor

Prefix computation on list The previous strategy cannot be applied here Division of array X that represents list will lead to subarrays each of which can have many sublist fragments Head nodes will have to be calculated for each of them

Parallel List Ranking (Wyllie’s algorithm) Involved repeated pointer jumping Successor pointer of each element is repeatedly updated so that it jumps over its successor until it reaches the end of the list As each processor traverses and updates the successor, the ranks are updated A process or thread is assigned to each element of the list

Parallel List Ranking (Wyllie’s algorithm) Will lead to high synchronizations among threads In CUDA - many kernel invocations

Parallel List Ranking (Helman and JaJa) Randomly select s nodes or splitters. The head node is also a splitter Form s sublists. In each sublist, start from a splitter as the head node, and traverse till another splitter is reached. Form prefix sums in each sublist Form another list, L’, consisting of only these splitters in the order they are traversed. The values in each entry of this list will be the prefix sum calculated in the respective sublists Calculate prefix sums for this list Add these sums to the values of the sublists

Parallel List Ranking on GPUs: Steps Step 1: Compute the location of the head of the list Each of the indices between 0 and n-1, except head node, occur exactly only once in the successors. Hence head node = n(n-1)/2 – SUM_SUCC SUM_SUCC = sum of the successor values Can be done on GPUs using parallel reduction

Parallel List Ranking on GPUs: Steps Step 2: Select s random nodes to split list into s random sublists For every subarray of X of size X/s, select random location as a splitter. Highly data parallel, can be done independent of each other

Parallel List Ranking on GPUs: Steps Step 3: Using standard sequential algorithm, compute prefix sums of each sublist separately The most computationally demanding step s sublists allocated equally among CUDA blocks, and then allocated equally among threads in a block Each thread computes prefix sums of each of its sublists, and copy prefix value of last element of sublist i to Sublist[i]

Parallel List Ranking on GPUs: Steps Step 4: Compute prefix sum of splitters, where the successor of a splitter is the next splitter encountered when traversing the list This list is small Hence can be done on CPU

Parallel List Ranking on GPUs: Steps Step 5: Update values of prefix sums computed in step 3 using splitter prefix sums of step 4 This can be done using coalesced memory access – access by threads to contiguous locations

Choosing s Large values of s increase the chance of threads dealing with equal number of nodes However, too large values result in overhead of sublist creation and aggregation

Parallel Prefix on GPUs Using binary tree An upward reduction phase (reduce phase or up-sweep phase) Traversing tree from leaves to root forming partial sums at internal nodes Down-sweep phase Traversing from root to leaves using partial sums computed in reduction phase

Up Sweep

Down Sweep

Host Code int main(){ const unsigned int num_threads = num_elements / 2; /* cudaMalloc d_idata and d_odata */ cudaMemcpy( d_idata, h_data, mem_size, cudaMemcpyHostToDevice) ); dim3 grid(256, 1, 1); dim3 threads(num_threads, 1, 1); scan<<< grid, threads>>> (d_odata, d_idata); cudaMemcpy( h_data, d_odata[i], sizeof(float) * num_elements, cudaMemcpyDeviceToHost /* cudaFree d_idata and d_odata */ }

Device Code __global__ void scan_workefficient(float *g_odata, float *g_idata, int n) { // Dynamically allocated shared memory for scan kernels extern __shared__ float temp[]; int thid = threadIdx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) __syncthreads(); if (thid < d) int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2;

Device Code // scan back down the tree // clear the last element   // clear the last element if (thid == 0) temp[n - 1] = 0;   // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads();  if (thid < d) int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } __syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1];

References Fast and Scalable List Ranking on the GPU. ICS 2009. Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA. IPDPS 2010.