CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

Slides:



Advertisements
Similar presentations
Fast Fourier Transform for speeding up the multiplication of polynomials an Algorithm Visualization Alexandru Cioaca.
Advertisements

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Parallel Computation of the Minimum Separation Distance of Bezier Curves and Surfaces Lauren Bissett, Nicholas Woodfield,
Lecture 41: Review Session #3 Reminders –Office hours during final week TA as usual (Tuesday & Thursday 12:50pm-2:50pm) Hassan: Wednesday 1pm to 4pm or.
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
Please open your laptops and pull up Quiz Only the provided online calculator may be used on this quiz. You may use your yellow formula sheet for.
By Dominik Seifert B Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
JRN 440 Adv. Online Journalism Resizing and resampling Monday, 2/6/12.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
Computer Vision Lab Seoul National University Keyframe-Based Real-Time Camera Tracking Young Ki BAIK Vision seminar : Mar Computer Vision Lab.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Authors: Kenneth S.Bogh, Sean Chester, Ira Assent (Data-Intensive Systems Group, Aarhus University). Type: Research Paper Presented by: Dardan Xhymshiti.
SIFT DESCRIPTOR K Wasif Mrityunjay
Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)
Sunpyo Hong, Hyesoon Kim
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Please CLOSE YOUR LAPTOPS, and turn off and put away your cell phones, and get out your note- taking materials.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Single Instruction Multiple Threads
Lecture 10 CUDA Instructions
CS/EE 217 – GPU Architecture and Parallel Programming
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Lecture 2: Intro to the simd lifestyle and GPU internals
Enough Mathematical Appetizers!
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 140 Lecture Notes: Virtual Machines
CS 179: GPU Programming Lecture 7.
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
Applied Discrete Mathematics Week 6: Computation
Parallel Computation Patterns (Reduction)
CS 179: Lecture 3.
General Purpose Graphics Processing Units (GPGPUs)
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 5: Synchronization and ILP
CT of Kidney Volume in Autosomal Dominant Polycystic Kidney Disease: Accuracy, Reproducibility, and Radiation Dose Low-dose CT protocols can yield total.
6- General Purpose GPU Programming
Presentation transcript:

CS 179: Lecture 4 Lab Review 2

Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block) * (number of blocks)  “Block”:  Size: User-specified  Should at least be a multiple of 32 (often, higher is better)  Upper limit given by hardware (512 in Tesla, 1024 in Fermi)  Features:  Shared memory  Synchronization

Groups of Threads  “Warp”:  Group of 32 threads  Execute in lockstep (same instructions)  Susceptible to divergence!

Divergence “Two roads diverged in a wood… …and I took both”

Divergence  What happens:  Executes normally until if-statement  Branches to calculate Branch A (blue threads)  Goes back (!) and branches to calculate Branch B (red threads)

“Divergent tree” … 506, 508, 510 Assume 512 threads in block… … 500, 504, 508 … 488, 496, 504 … 464, 480, 496

“Divergent tree” //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…

“Non-divergent tree” Example purposes only! Real blocks are way bigger!

“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…

“Divergent tree” Where is the divergence?  Two branches:  Accumulate  Do nothing  If the second branch does nothing, then where is the performance loss?

“Divergent tree” – Analysis  First iteration: (Reduce 512 -> 256):  Warp of threads 0-31: (After calculating polynomial)  Thread 0: Accumulate  Thread 1: Do nothing  Thread 2: Accumulate  Thread 3: Do nothing  …  Warp of threads 32-63:  (same thing!)  …  (up to) Warp of threads  Number of executing warps: 512 / 32 = 16

“Divergent tree” – Analysis  Second iteration: (Reduce 256 -> 128):  Warp of threads 0-31: (After calculating polynomial)  Threads 0: Accumulate  Thread 1-3: Do nothing  Thread 4: Accumulate  Thread 5-7: Do nothing  …  Warp of threads 32-63:  (same thing!)  …  (up to) Warp of threads  Number of executing warps: 16 (again!)

“Divergent tree” – Analysis  (Process continues, until offset is large enough to separate warps)

“Non-divergent tree” – Analysis  First iteration: (Reduce 512 -> 256): (Part 1)  Warp of threads 0-31:  Accumulate  Warp of threads 32-63:  Accumulate  …  (up to) Warp of threads  Then what?

“Non-divergent tree” – Analysis  First iteration: (Reduce 512 -> 256): (Part 2)  Warp of threads :  Do nothing!  …  (up to) Warp of threads  Number of executing warps: 256 / 32 = 8 (Was 16 previously!)

“Non-divergent tree” – Analysis  Second iteration: (Reduce 256 -> 128):  Warp of threads 0-31, …, :  Accumulate  Warp of threads , …,  Do nothing!  Number of executing warps: 128 / 32 = 4 (Was 16 previously!)

What happened?  “Implicit divergence”

Why did we do this?  Performance improvements  Reveals GPU internals!

Final Puzzle  What happens when the polynomial order increases?  All these threads that we think are competing… are they?

The Real World

In medicine…  More sensitive devices -> more data!  More intensive algorithms  Real-time imaging and analysis  Most are parallelizable problems!

MRI  “k-space” – Inverse FFT  Real-time and high-resolution imaging

CT, PET  Low-dose techniques  Safety!  4D CT imaging  X-ray CT vs. PET CT  Texture memory!

Radiation Therapy  Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells  More accurate algorithms possible!  Accuracy = safety!  40 minutes -> 10 seconds

Notes  Office hours:  Kevin: Monday 8-10 PM  Ben: Tuesday 7-9 PM  Connor: Tuesday 8-10 PM  Lab 2: Due Wednesday (4/16), 5 PM