GPU Programming Lecture 7 Atomic Operations and Histogramming

Slides:



Advertisements
Similar presentations
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) Customer Supplier Customer authorizes Enrollment ( )
Writing Pseudocode And Making a Flow Chart A Number Guessing Game
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 5: Repetition and Loop Statements Problem Solving & Program.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
Create an Application Title 1A - Adult Chapter 3.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 10 Arrays and Tile Mapping Starting Out with Games & Graphics.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 17 Atomic.
PP Test Review Sections 6-1 to 6-6
Chapter 17 Linked Lists.
Chapter 1 Object Oriented Programming 1. OOP revolves around the concept of an objects. Objects are created using the class definition. Programming techniques.
Chapter 10: Virtual Memory
XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)
VOORBLAD.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Lilian Blot PART III: ITERATIONS Core Elements Autumn 2012 TPOP 1.
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Pointers and Arrays Chapter 12
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Energy Generation in Mitochondria and Chlorplasts

1 © R. Guerraoui The Limitations of Registers R. Guerraoui Distributed Programming Laboratory.
The University of Adelaide, School of Computer Science
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
© David Kirk/NVIDIA and Wen-mei W
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Lecture 5: GPU Compute Architecture for the last time
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
Parallel Computation Patterns (Histogram)
Presentation transcript:

GPU Programming Lecture 7 Atomic Operations and Histogramming © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Objective To understand atomic operations Read-modify-write in parallel computation Use of atomic operations in CUDA Why atomic operations reduce memory system throughput How to avoid atomic operations in some parallel algorithms Histogramming as an example application of atomic operations Basic histogram algorithm Privatization © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A Common Collaboration Pattern Multiple bank tellers count the total amount of cash in the safe Each grabs a pile and counts Have a central display of the running total Whenever someone finishes counting a pile, add the subtotal of the pile to the running total © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A Common Parallel Coordination Pattern Multiple customer service agents serving customers Each customer gets a number A central display shows the number of the next customer who will be served When an agent becomes available, he/she calls the number and he/she adds 1 to the display © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A Common Arbitration Pattern Multiple customers booking air tickets Each Brings up a flight seat map Decides on a seat Update the the seat map, mark the seat as taken A bad outcome Multiple passengers ended up booking the same seat © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations thread1: Old  Mem[x] New  Old + 1 Mem[x]  New thread2: Old  Mem[x] New  Old + 1 Mem[x]  New If Mem[x] was initially 0, what would the value of Mem[x] be after threads 1 and 2 have completed? What does each thread get in their Old variable? The answer may vary due to data races. To avoid data races, you should use atomic operations © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Timing Scenario #1 Thread 1 Old = 0 Thread 2 Old = 1 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 (1) Mem[x]  New 4 (1) Old  Mem[x] 5 (2) New  Old + 1 6 (2) Mem[x]  New Thread 1 Old = 0 Thread 2 Old = 1 Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Timing Scenario #2 Thread 1 Old = 1 Thread 2 Old = 0 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 (1) Mem[x]  New 4 (1) Old  Mem[x] 5 (2) New  Old + 1 6 (2) Mem[x]  New Thread 1 Old = 1 Thread 2 Old = 0 Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Timing Scenario #3 Thread 1 Old = 0 Thread 2 Old = 0 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 4 (1) Mem[x]  New 5 6 Thread 1 Old = 0 Thread 2 Old = 0 Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Timing Scenario #4 Thread 1 Old = 0 Thread 2 Old = 0 Time Thread 1 Thread 2 1 (0) Old  Mem[x] 2 (1) New  Old + 1 3 4 (1) Mem[x]  New 5 6 Thread 1 Old = 0 Thread 2 Old = 0 Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations – To Ensure Good Outcomes thread1: Old  Mem[x] New  Old + 1 Mem[x]  New thread2: Old  Mem[x] New  Old + 1 Mem[x]  New Or thread2: Old  Mem[x] New  Old + 1 Mem[x]  New thread1: Old  Mem[x] New  Old + 1 Mem[x]  New © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Without Atomic Operations Mem[x] initialized to 0 thread1: Old  Mem[x] New  Old + 1 Mem[x]  New thread2: Old  Mem[x] New  Old + 1 Mem[x]  New Both threads receive 0 Mem[x] becomes 1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations in General Performed by a single ISA instruction on a memory location address Read the old value, calculate a new value, and write the new value to the location The hardware ensures that no other threads can access the location until the atomic operation is complete Any other threads that access the location will typically be held in a queue until its turn All threads perform the atomic operation serially © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations in CUDA Function calls that are translated into single instructions (a.k.a. intrinsics) Atomic add, sub, inc, dec, min, max, exch (exchange), CAS (compare and swap) Read CUDA C programming Guide 5.0 for details Atomic Add int atomicAdd(int* address, int val); reads the 32-bit word old pointed to by address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. The function returns old. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

More Atomic Adds in CUDA Unsigned 32-bit integer atomic add unsigned int atomicAdd(unsigned int* address, unsigned int val); Unsigned 64-bit integer atomic add unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val); Single-precision floating-point atomic add (capability > 2.0) float atomicAdd(float* address, float val); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Histogramming A method for extracting notable features and patterns from large data sets Feature extraction for object recognition in images Fraud detection in credit card transactions Correlating heavenly object movements in astrophysics … Basic histograms - for each element in the data set, use the value to identify a “bin” to increment © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A Histogram Example In phrase “Programming Massively Parallel Processors” build a histogram of frequencies of each letter A(4), C(1), E(1), G(1), … How do you do this in parallel? Have each thread to take a section of the input For each input letter, use atomic operations to build the histogram © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Iteration #1 – 1st letter in each section P R O G R A M M I N G M A S S I V E L Y P Thread 0 Thread 1 Thread 2 Thread 3 1 2 1 A B C D E F G H I J K L M N O P Q R S T U V © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Iteration #2 – 2nd letter in each section P R O G R A M M I N G M A S S I V E L Y P Thread 0 Thread 1 Thread 2 Thread 3 1 1 1 3 1 1 A B C D E F G H I J K L M N O P Q R S T U V © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Iteration #3 P R O G R A M M I N G M A S S I V E L Y P Thread 0 1 1 1 1 3 1 1 1 1 A B C D E F G H I J K L M N O P Q R S T U V © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Iteration #4 P R O G R A M M I N G M A S S I V E L Y P Thread 0 1 1 1 1 1 3 1 1 1 1 2 A B C D E F G H I J K L M N O P Q R S T U V © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Iteration #5 P R O G R A M M I N G M A S S I V E L Y P Thread 0 1 1 1 1 1 3 1 1 2 2 2 A B C D E F G H I J K L M N O P Q R S T U V © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

What is wrong with the algorithm? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

What is wrong with the algorithm? Reads from the input array are not coalesced Assign inputs to each thread in a strided pattern Adjacent threads process adjacent input letters P R O G R A M M I N G M A S S I V E L Y P Thread 0 Thread 1 Thread 2 Thread 3 1 1 1 1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 A B C D E F G H I J K L M N O P Q R S T U V

Iteration 2 All threads move to the next section of input P R O G R A Y P Thread 0 Thread 1 Thread 2 Thread 3 1 1 2 1 1 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 A B C D E F G H I J K L M N O P Q R S T U V

A Histogram Kernel The kernel receives a pointer to the input buffer Each thread process the input in a strided pattern __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

More on the Histogram Kernel // All threads handle blockDim.x * gridDim.x // consecutive elements while (i < size) { atomicAdd( &(histo[buffer[i]]), 1); i += stride; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations on DRAM An atomic operation starts with a read, with a latency of a few hundred cycles © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations on DRAM An atomic operation starts with a read, with a latency of a few hundred cycles The atomic operation ends with a write, with a latency of a few hundred cycles During this whole time, no one else can access the location © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations on DRAM Each Load-Modify-Store has two full memory access delays All atomic operations on the same variable (RAM location) are serialized time internal routing internal routing DRAM delay DRAM delay DRAM delay .. transfer delay transfer delay atomic operation N atomic operation N+1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Latency determines throughput of atomic operations Throughput of an atomic operation is the rate at which the application can execute an atomic operation on a particular location. The rate is limited by the total latency of the read-modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations. This means that if many threads attempt to do atomic operation on the same location (contention), the memory bandwidth is reduced to < 1/1000! © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

You may have a similar experience in supermarket checkout Some customers realize that they missed an item after they started to check out They run to the isle and get the item while the line waits The rate of check is reduced due to the long latency of running to the isle and back. Imagine a store where every customer starts the check out before they even fetch any of the items The rate of the checkout will be 1 / (entire shopping time of each customer) © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Hardware Improvements (cont.) Atomic operations on Fermi L2 cache medium latency, but still serialized Global to all blocks “Free improvement” on Global Memory atomics time internal routing .. data transfer data transfer atomic operation N atomic operation N+1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Hardware Improvements Atomic operations on Shared Memory Very short latency, but still serialized Private to each thread block Need algorithm work by programmers (more later) time internal routing .. data transfer atomic operation N atomic operation N+1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomics in Shared Memory Requires Privatization Create private copies of the histo[] array for each thread block __global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) { __shared__ unsigned int histo_private[256]; if (threadIdx.x < 256) histo_private[threadidx.x] = 0; __syncthreads(); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Build Private Histogram int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; while (i < size) { atomicAdd( &(private_histo[buffer[i]), 1); i += stride; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Build Final Histogram // wait for all other threads in the block to finish __syncthreads(); if (threadIdx.x < 256) atomicAdd( &(histo[threadIdx.x]), private_histo[threadIdx.x] ); } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

More on Privatization Privatization is a powerful and frequently used techniques for parallelizing applications The operation needs to be associative and commutative Histogram add operation is associative and commutative The histogram size needs to be small Fits into shared memory What if the histogram is too large to privatize? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Any More Questions © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012