© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 17 Atomic.
Advertisements

GPU Programming Lecture 7 Atomic Operations and Histogramming
More on threads, shared memory, synchronization
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
© David Kirk/NVIDIA and Wen-mei W
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
CUDA C/C++ Basics Part 2 - Blocks and Threads
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Programming Lecture 7.
Parallel Computation Patterns (Scan)
Parallel Computation Patterns (Reduction)
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
General Purpose Graphics Processing Units (GPGPUs)
ECE 498AL Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 3: CUDA Threads & Atomics
ECE 498AL Spring 2010 Lecture 10: Control Flow
Parallel Computation Patterns (Histogram)
6- General Purpose GPU Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations and Histogramming - Part 2

2 Objective To learn practical histogram programming techniques –Basic histogram algorithm using atomic operations –Privatization

Review: A Histogram Example In phrase “Programming Massively Parallel Processors” build a histogram of frequencies of each letter A(4), C(1), E(1), G(1), … 3 How do you do this in parallel? –Have each thread to take a section of the input –For each input letter, use atomic operations to build the histogram

Iteration #1 – 1 st letter in each section 4 PROGRAMMINGMAVISSYLEP Thread 0Thread 1Thread 2 Thread 3 ABCDE 1 FGHIJKLM 2 N O P 1 QR S T U V

Iteration #2 – 2 nd letter in each section 5 PROGRAMMINGMAVISSYLEP Thread 0Thread 1Thread 2 Thread 3 A 1 BCDE 1 FGHIJKL 1 M 3 N O P 1 QR 1 S T U V

Iteration #3 6 PROGRAMMINGMAVISSYLEP Thread 0Thread 1Thread 2 Thread 3 A 1 BCDE 1 FGHI 1 JKL 1 M 3 N O 1 P 1 QR 1 S 1 T U V

Iteration #4 7 PROGRAMMINGMAVISSYLEP Thread 0Thread 1Thread 2 Thread 3 A 1 BCDE 1 FG 1 HI 1 JKL 1 M 3 N 1 O 1 P 1 QR 1 S 2 T U V

Iteration #5 8 PROGRAMMINGMAVISSYLEP Thread 0Thread 1Thread 2 Thread 3 A 1 BCDE 1 FG 1 HI 1 JKL 1 M 3 N 1 O 1 P 2 QR 2 S 2 T U V

What is wrong with the algorithm? Reads from the input array are not coalesced –Assign inputs to each thread in a strided pattern –Adjacent threads process adjacent input letters 9 PROGRAMMINGMAVISSYLEP Thread 0Thread 1 Thread 2 Thread 3 ABCDEFG 1 HIJKLM N O 1 P 1 QR 1 S T U V

Iteration 2 10 PROGRAMMINGMAVISSYLEP Thread 0Thread 1 Thread 2 Thread 3 A 1 BCDEFG 1 HIJKLM 2 N O 1 P 1 QR 2 S T U V All threads move to the next section of input

A Histogram Kernel The kernel receives a pointer to the input buffer Each thread process the input in a strided pattern __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; 11

More on the Histogram Kernel // All threads handle blockDim.x * gridDim.x // consecutive elements while (i < size) { atomicAdd( &(histo[buffer[i]]), 1); i += stride; } 12

Atomic Operations on DRAM An atomic operation starts with a read, with a latency of a few hundred cycles 13

Atomic Operations on DRAM An atomic operation starts with a read, with a latency of a few hundred cycles The atomic operation ends with a write, with a latency of a few hundred cycles During this whole time, no one else can access the location 14

Atomic Operations on DRAM Each Load-Modify-Store has two full memory access delays –All atomic operations on the same variable (RAM location) are serialized DRAM delay transfer delay internal routing DRAM delay transfer delay internal routing.. atomic operation Natomic operation N+1 time 15

Latency determines throughput of atomic operations Throughput of an atomic operation is the rate at which the application can execute an atomic operation on a particular location. The rate is limited by the total latency of the read- modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations. This means that if many threads attempt to do atomic operation on the same location (contention), the memory bandwidth is reduced to < 1/1000! 16

You may have a similar experience in supermarket checkout Some customers realize that they missed an item after they started to check out They run to the isle and get the item while the line waits –The rate of check is reduced due to the long latency of running to the isle and back. Imagine a store where every customer starts the check out before they even fetch any of the items –The rate of the checkout will be 1 / (entire shopping time of each customer) 17

Hardware Improvements (cont.) Atomic operations on Fermi L2 cache –medium latency, but still serialized –Global to all blocks –“Free improvement” on Global Memory atomics internal routing.. atomic operation Natomic operation N+1 time data transfer 18

Hardware Improvements Atomic operations on Shared Memory –Very short latency, but still serialized –Private to each thread block –Need algorithm work by programmers (more later) internal routing.. atomic operation Natomic operation N+1 time data transfer 19

Atomics in Shared Memory Requires Privatization Create private copies of the histo[] array for each thread block __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { __shared__ unsigned int histo_private[256]; if (threadIdx.x < 256) histo_private[threadidx.x] = 0; __syncthreads(); 20

Build Private Histogram int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; while (i < size) { atomicAdd( &(histo_private[buffer[i]), 1); i += stride; } 21

Build Final Histogram // wait for all other threads in the block to finish __syncthreads(); if (threadIdx.x < 256) atomicAdd( &(histo[threadIdx.x]), histo_private[threadIdx.x] ); } 22

More on Privatization Privatization is a powerful and frequently used techniques for parallelizing applications The operation needs to be associative and commutative –Histogram add operation is associative and commutative The histogram size needs to be small –Fits into shared memory What if the histogram is too large to privatize? 23

Other Atomic operations atomicCAS (int *p, int cmp, int val) –CAS = compare and swap //atomically perform the following int old = *p; if(cmp == old) *p = v; return old; AtomicExch – unconditional version of CAS int old = *p; *p = v; return old What are these used for? 24

Locking causes control divergence in GPUs 25 Divergence deadlock if locking thread idles

Alternatives to locking? Lock-free algorithms/data structures –Update a private copy –Try to atomically update a global data structure using compare and swap or similar –Retry if failed –Need data structures that support this kind of operation Wait-free algorithms/data structures –Similar to histogramming – don’t wait, but atomic update –But applies only to some algorithms 26

Lock free vs. locking 27 Example from Nvidia presentation at GTC 2013

Parallel Linked List Example 28

29

30

31

ANY MORE QUESTIONS? 32