CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

ISOM MIS 215 Module 7 – Sorting. ISOM Where are we? 2 Intro to Java, Course Java lang. basics Arrays Introduction NewbieProgrammersDevelopersProfessionalsDesigners.
Data Structures and Algorithms PLSD210 Sorting. Card players all know how to sort … First card is already sorted With all the rest, ¶Scan back from the.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Sorting II/ Slide 1 Lecture 24 May 15, 2011 l merge-sorting l quick-sorting.
SAGE: Self-Tuning Approximation for Graphics Engines
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
FASTER SORTING using RECURSION : QUICKSORT 2014-T2 Lecture 16 School of Engineering and Computer Science, Victoria University of Wellington COMP 103 Marcus.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
QUICKSORT 2015-T2 Lecture 16 School of Engineering and Computer Science, Victoria University of Wellington COMP 103 Marcus Frean.
Static block can be used to check conditions before execution of main begin, Suppose we have developed an application which runs only on Windows operating.
Quicksort This is probably the most popular sorting algorithm. It was invented by the English Scientist C.A.R. Hoare It is popular because it works well.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
Sorting – Lecture 3 More about Merge Sort, Quick Sort.
CUDA programming Performance considerations (CUDA best practices)
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CS/EE 217 – GPU Architecture and Parallel Programming
CUDA C/C++ Basics Part 2 - Blocks and Threads
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
CUDA and OpenCL Kernels
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Programming Lecture 7.
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Computation Patterns (Scan)
Distributed System Gang Wu Spring,2018.
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
© 2012 Elsevier, Inc. All rights reserved.
slides adapted from Marty Stepp
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 3: CUDA Threads & Atomics
Parallel Computation Patterns (Histogram)
Presentation transcript:

CS 193G Lecture 7: Parallel Patterns II

Overview Segmented Scan Sort Mapreduce Kernel Fusion

SEGMENTED SCAN

Segmented Scan What it is: Scan + Barriers/Flags associated with certain positions in the input arrays Operations don’t propagate beyond barriers Do many scans at once, no matter their size

Image taken from “Efficient parallel scan algorithms for GPUs” by S. Sengupta, M. Harris, and M. Garland

Segmented Scan __global__ void segscan(int * data, int * flags) { __shared__ int s_data[BL_SIZE]; __shared__ int s_flags[BL_SIZE]; int idx = threadIdx.x + blockDim.x * blockIdx.x; // copy block of data into shared // memory s_data[idx] = …; s_flags[idx] = …; __syncthreads();

Segmented Scan … // choose whether to propagate s_data[idx] = s_flags[idx] ? s_data[idx] : s_data[idx - 1] + s_data[idx]; // create merged flag s_flags[idx] = s_flags[idx - 1] | s_flags[idx]; // repeat for different strides }

Segmented Scan Doing lots of reductions of unpredictable size at the same time is the most common use Think of doing sums/max/count/any over arbitrary sub-domains of your data

Segmented Scan Common Usage Scenarios: Determine which region/tree/group/object class an element belongs to and assign that as its new ID Sort based on that ID Operate on all of the regions/trees/groups/objects in parallel, no matter what their size or number

Segmented Scan Also useful for implementing divide-and- conquer type algorithms Quicksort and similar algorithms

SORT

Sort Useful for almost everything Optimized versions for the GPU already exist Sorted lists can be processed by segmented scan Sort data to restore memory and execution coherence

Sort binning and sorting can often be used interchangeably Sort is standard, but can be suboptimal Binning is usually custom, has to be optimized, can be faster

Sort Radixsort is faster than comparison-based sorts If you can generate a fixed-size key for the attribute you want to sort on, you get better performance

MAPREDUCE

Mapreduce Old concept from functional progamming Repopularized by Google as parallel computing pattern Combination of sort and reduction (scan)

Mapreduce Image taken from Jeff Dean’s presentation at slides/index-auto-0007.html

Mapreduce: Map Phase Map a function over a domain Function is provided by the user Function can be anything which produces a (key, value) pair Value can just be a pointer to arbitrary datastructure

Mapreduce: Sort Phase All the (key,value) pairs are sorted based on their keys Happens implicitly Creates runs of (k,v) pairs with same key User usually has no control over sort function

Mapreduce: Reduce Phase Reduce function is provided by the user Can be simple plus, max,… Library makes sure that values from one key don’t propagate to another (segscan) Final result is a list of keys and final values (or arbitrary datastructures)

KERNEL FUSION

Kernel Fusion Combine kernels with simple producer- >consumer dataflow Combine generic data movement kernel with specific operator function Save memory bandwidth by not writing out intermediate results to global memory

Separate Kernels __global__ void is_even(int * in, int * out) { int i = … out[i] = ((in[i] % 2) == 0) ? 1: 0; } __global__ void scan(…) { … }

Fused Kernel __global__ void fused_even_scan(int * in, int * out, …) { int i = … int flag = ((in[i] % 2) == 0) ? 1: 0; // your scan code here, using the flag directly }

Kernel Fusion Best when the pattern looks like Any simple one-to-one mapping will work output[i] = g(f(input[i]));

Fused Kernel template __global__ void opt_stencil(float * in, float * out, F f) {// your 2D stencil code here for(i,j) { partial = f(partial,in[…],i,j); } float result = partial; }

Fused Kernel class boxfilter { private: table[3][3]; boxfilter(float input[3][3]) public: float operator()(float a, float b, int i, int j) { return a + b*table[i][j]; }

Fused Kernel class maxfilter { public: float operator()(float a, float b, int i, int j) { return max(a,b); }

Questions?

Backup Slides

Example Segmented Scan int data[10] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; int flags[10] = {0, 0, 0, 1, 0, 1, 1, 0, 0, 0}; int step1[10] = {1, 2, 1, 1, 1, 1, 1, 2, 1, 2}; int flg1[10] = {0, 0, 0, 1, 0, 1, 1, 1, 0, 0}; int step2[10] = {1, 2, 1, 1, 1, 1, 1, 2, 1, 2}; int flg2[10] = {0, 0, 0, 1, 0, 1, 1, 1, 0, 0}; …

Example Segmented Scan int step2[10] = {1, 2, 1, 1, 1, 1, 1, 2, 1, 2}; int flg2[10] = {0, 0, 0, 1, 0, 1, 1, 1, 0, 0}; … int result[10] = {1, 2, 3, 1, 2, 1, 1, 2, 3, 4};