Modern GPU Architecture and SIMT Programming

Name: Modern GPU Architecture and SIMT Programming
Uploaded: 2017-12-28T18:41:31+00:00
Duration: PTM25S35
Channel: Melvyn Bradley
Description: Modern GPU Architecture and SIMT Programming

Modern GPU Architecture and SIMT Programming
Jack Wadden

Vanity Card Williams College 2011 UVA 2011 – 2012 (Kevin Skadron)
Architectures supporting specultive, trace-based parallelization (this was a bad idea) AMD Research 2013 (Sudhanva Gurumurthi) GPU Reliability, Exascale Compute Project UVA (Kevin Skadron) Center for automata processing AP, FPGA, GPU, CPU, XeonPhi

What are the problems we’re trying to solve?
What is wrong with CPUs today? Why aren’t they sufficient? What are the bottlenecks to make them faster?

“von Neumann” Architecture
”von Neumann Bottleneck” Before I talk about GPUs, I really want to hammer home the motivation for these architectures. GPUs themselves aren’t that complex when compared to a CPU. In fact, they are much *less* complex in many ways. This reduction in complexity allowed us to realize massive power efficiency and performance gains for some applications, and in order to appreciate this, we need to start from the bottom. von neumann architectures have instructions and data. But memory accesses are SLOW! How do we solve this?

Memory Hierarchy Registers L1 L2 L3 DRAM NVM

DRAM L3 L1 L2 “To measure is to know!” – Lord Kelvin

Flynn’s Taxonomy

Why GPUs? Moore’s Law? Moore’s Law!

Why GPUs? Dennard Scaling Transistor size Transistor power

Breakdown in Dennard Scaling
Why GPUs? Breakdown in Dennard Scaling Transistor size Transistor power

Why GPUs? What makes your code execute faster?
What parts of the architecture increase IPC? Pipelining (multiple stages active at once, higher freq) Branch prediction (reduces stalls) Register renaming (reduces stalls) OoO execution (reduces stalls) I/D Caches (reduces latency of stalls) Superscalar execution (parallel execution) Hyperthreading (reduces penalty of stalls)

CPU Floor plan Older intel core

Jaguar Core Floor Plan Jaguar x86 core

Kabini Floor Plan Jaguar x86 core

INTUITION: We don’t need low-latency if we can hide high-latency with parallelism
Latency oriented architecture -> throughput oriented architecture

“The Tradeoff” We give up LATENCY We (try to) gain THROUGHPUT

Why GPUs? What makes your code execute faster?
What parts of the architecture increase IPC? Pipelining (multiple stages active at once, higher freq) Branch prediction (reduces stalls) Register renaming (reduces stalls) OoO execution (reduces stalls) I/D Caches (reduces latency of stalls) Superscalar execution (parallel execution) Hyperthreading (reduces penalty of stalls)

Why not MIMD? Xeon-Phi (intel) Parallela (Adapteva) SparcT5
Knights Corner 72 in-order, P4 cores Knights Landing 72 (76) OoO, Atom cores w/ 16-wide SIMD 2D mesh interconnect Parallela (Adapteva) 1,024 in-order RISC cores 2D Mesh interconnect SparcT5 16-core, 128 thread OoO Barrel processor (8way SMT) MIMD is fine. MIMD is great even. It’s just that we could do better if we know our application is data parallel.

SIMD: Taking parallelism to 11
What if our application is data parallel? Let’s use the same IF/ID hardware to control multiple ALUs! Fetch ONE instruction for 4/16/32/64 pairs of data

Lane Parallelism 1 SIMD “lane” 4 SIMD “lanes”

GPU vs. CPU

Intermission: Questions?
NEXT UP: Modern GPU Architecture

Ryan Smith’s GCN Anandtech Article
AMD’s Evergreen (5000) is VLIW5 AMD’s Northern Islands (6000) is VLIW4 AMD’s Southern Islands (7000+) is Barrel SIMD Questions we will discuss: What happened to VLIW? Why is SIMD a better option for compute? What clever features make GCN Jack’s favorite architecture ever?

VLIW in 30 seconds Instead of disambiguating dependencies on the fly, let the compiler do the heavy lifting! Burden of parallelization is slowly shifting towards the programmer Pros? Less complex hardware (barely any decode!) Lower power Increased throughput Cons? Compilers are famously sucky at automatic parallelization

Fixed instruction width; specialized for graphics (pixel math)
“AMD initially used a VLIW design in those early parts because it allowed them to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time, which was by far the most common graphics operation.”

VLIW instructions are hard to fill

VLIW instructions are hard to fill
“AMD’s own internal research at the time of the Cayman launch [showed] the average shader program was utilizing only 3.4 out of 5 Radeon cores” “VLIW lives and dies by the compiler” – Lord Kelvin

Example of VLIW vs SIMD: a > b ? (a-b)*a : (b-a)*b
Harder to debug, optimize, gain intuition about optimization opportunities

Example of VLIW vs SIMD: a > b ? (a-b)*a : (b-a)*b

QUIZ: How many people are in space right now?

5 (*yesterday there were 8)
3 Onboard the International Space Station Andrei Borisenko (Roscosmos) Sergei Rhyzikov (Roscosmos) Shane Kimbrough (NASA) *Kate Rubins (NASA) *Anatoly Ivanashin (Roscosmos) * Takuya Onishi (JAXA) 2 Onboard Tiangong-2 Space Laboratory Jing Haipeng (CNSA) Chen Dong (CNSA)

GPU computing motivated a massive redesign of AMD’s core microarchitecture
What replaces VLIW? parallel vector processing

AMD GCN 1.0 (Tahiti); 28nm; 2012

Tahiti Die Photo

Tahiti Die Photo Tahiti 32 CUs 4x16 SIMDs/CU 2,048 “cores”
81,920 “threads” Xeon Phi 64 P5 Cores 4-threads/core Cores have 4-wide SIMD units 1024 “threads”

AMD GCN 1.2 (Fiji); 28nm; 2015 Major changes? 64 CUs, stacked HBM DRAM

Nvidia Kepler SMX; 28nm; 2012 12 x 16wide SIMD units per SMX?

Kepler GK104 Die Photo GK104, 8 SMX, 1536 “CUDA cores”

Maxwell SMM; 28nm; 2015

Were there too many SIMDs units per control?

WHAT ARE THE PROBLEMS WE ARE TRYING TO SOLVE?
SIMD helps increase utilization, but how do GPUs hide memory access latency? WHAT ARE THE PROBLEMS WE ARE TRYING TO SOLVE? Breakdown in Dennard Scaling Solved by more efficient SIMD parallel processing Von Neumann Bottleneck ????????

What if loads from SIMD instructions in all 4 Vector Units miss in the cache?
How do CPUs solve this problem?.....

UltraSparc T1: fast context switching to the rescue
Stalls can be hidden by simultaneous multiprocessing (aka hyperthreading)! The UltraSparc takes this idea to the extreme Someone dubbed this a “barrel” processor Threads pool instructions in a “barrel” reach into the barrel and grab an instruction with no dependencies Execute the instruction Rotate barrel UPS analogy. Coffee analogy. The UltraSparc T5 can pick between any instructions of 4 threads (4 way SMT)

What are the benefits of barrel processors?

What are the benefits of barrel processors?
We can tolerate long latency instructions Reduce the size of the cache? Optimize cache for power instead of latency/size? Remove levels of cache entirely? But single–threaded performance will be terrible! Is that OK?

Modern GPUs are Frankenstein SIMD/VECTOR/BARREL processors
16-wide SIMD units + 64 wide waves = 4 cycles to compute one wave 4 SIMD units x 10 Waves x 64 threads/wave = 2,560 threads in flight per compute unit!

Intermission: Questions?
NEXT UP: How the heck do we program these things?

Review of Monday GPUs solve two different problems
Breakdown in Dennard scaling (power density) Von Neumann Bottleneck (memory latency) Barrel processor features: Allows GPU threads to tolerate latency via fast context switching Reduce or remove expensive latency oriented structures like branch predictor/cache/OoO etc… SIMD/Vector architecture features: Do massive amounts of processing in parallel Reduces overhead and power associated with IF/ID

Questions: How will GPUs evolve in the near future?
Deep neural network training can be greatly accelerated by GPU computation NVIDIA has adjusted their microarchitecture to support 16-bit precision floating point operations at 2x the rate of 32-bit FP Bio-informatics might also drive some architectural changes (4-bit comparisons), but the market isn’t as big yet “Increased floating point performance particularly benefits classification and convolution — two key activities in deep learning — while achieving needed accuracy.” – NVIDIA Marketing Paper

Questions: How are GPUs used in supercomputers?
The Titan supercomputer at Oak Ridge National Labs has 18,688 NVIDIA K20X GPUs 1-1 CPU chip to GPU chip ratio Linpack performance to peak performance: 64.9% Linpack vs. pp of Sunway TaihuLight: 74.2% Are GPUs losing ground to easier to program CPU systems?

Today: How do we program a SIMD vector computer? What is SIMT/CUDA?
CUDA programming model How do we handle branching? Code examples Assignment overview and walkthrough

Programming CPU SIMD: VMIPS

VMIPS: DAXPY “Double precision A * X Plus Y”
int i; int bound = 64; for(i = 0; i < bound; i++) { Y[i] = A * X[i] + Y[i]; }

VMIPS: Code Example x64 x1

Why isn’t SIMD ubiquitous?

Why isn’t SIMD ubiquitous?
Not all applications have friendly data parallelism It’s hard! Explicit vectorization is hard Automatic vectorization does OK Hardware is expensive Vector processors used to be expensive, so programming language support lagged SIMD extensions (MMX, AVX, SSE) Not very wide No control flow within SIMD

Auto vectorizable code
int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; }

NON auto vectorizable code
foo (int *a, int *b, int *c, uint N) { int i; for (i=0; i<N; i++){ a[i] = b[i] + c[i]; }

pragma vectorizable code?
foo (int *a, int *b, int *c, uint N) { int i; #pragma loop_unroll 16 for (i=0; i<N; i++){ a[i] = b[i] + c[i]; }

Auto vectorizable code
foo (int *restrict a, int *restrict b, int *restrict c, uint N) { int i; for (i=0; i<N; i++){ a[i] = b[i] + c[i]; }

Solution: GPU SIMT Programming
Single Instruction Multiple THREAD Fancy NVIDIA marketing term meant to distinguish explicit VMIPS style SIMD from parallel threads that use SIMD hardware CUDA, OpenCL are both SIMT programming models CUDA Hierarchical thread programming model that combines MIMD/SIMD ideas Maps well to GPU architectures

CUDA: programming model
Compute Unified Device Architecture: A virtual machine model for SIMT programming HOST DEVICE DATA TRANSFER HOST PROGRAM KERNEL

CUDA: thread model Single Instruction Multiple Thread (SIMT)
Map a grid of threads over some domain Each thread executes a kernel function Threads are grouped into blocks (<1024) (blocks are mapped to SMXs) Blocks are further divided into vector groups or warps (32) (warps are executed on SIMD units) Threads distinguish computation by querying their grid/block location (think thread IDs)

CUDA: thread model Warp (32)

History aside….. Kevin Skadron was on sabbatical at NVIDIA while they came up with the SIMT acronym He helped develop the CUDA programming model and co-wrote the original CUDA paper His take on the SIMT acronym….

We were trying to figure out how to describe the SM execution model, which is SIMD but not vector. However, many people today don't appreciate that SIMD != vector (vector is a subset of SIMD, and people even argue about that if there is only a single lane and the vector architecture is only helping by virtue of deeper pipelining). A further complication was that there are two levels of parallelism, the warp-level SIMD parallelism and the thread-block-level parallelism of interleaving execution among multiple warps, which is kind of like classic forms of pipelining (except SMT) but on warps instead of single instructions. There was some discussion of warp and weft parallelism (ala weaving), but I think it was John Nickolls, one of the original architects of the NVIDIA GPGPU execution model, who came up with SIMT. I've actually never been thrilled with the term, because it's not a very precise description of what's really going on, but I was never able to come up with a better term that is sufficiently crisp. (John sadly died from melanoma shortly after that ACM Queue article was published.)

CUDA: thread model Warp (32)

CUDA: memory model All threads share Global Memory
Threads within a block share Shared Memory Threads have their own private Registers Memory consistency: very relaxed Cannot guarantee ordering of instructions across blocks/wavefronts! Can insert explicit memory fences/barriers to order threads within a block

CUDA: memory model Global Shared Shared Block 0 Block 1 Thread 0
Regs Regs Regs Regs

CUDA code examples

CUDA: vectorAdd ( c = a + b)
__global__ void add( int *a, int *b, int *c ) { int b_off = blockIdx.x * blockDim.x; int lid = threadIdx.x; int tid = b_off + lid; // each thread adds one location in vectors c[tid] = a[tid] + b[tid]; } Block offset Local id Global id

CUDA: control flow How the heck to we handle this?!
__global__ void add( int *a, int *b, int *c) { int tid = blockIdx.x*blockDim.x + threadIdx.x; if(tid % 2) // if even thread id, do add c[tid] = a[tid] + b[tid]; } How the heck to we handle this?! Predication: mask exec. of threads that don’t branch

CUDA: SIMD control flow

CUDA: SIMD warp divergence
50% performance hit!

CUDA: synchronization model
NO synchronization guaranteed between two blocks __threadfence(): stalls current thread until all writes to shared and global memory are visible to all other threads. __syncthreads(): must be reached by all threads from the block (e.g. no divergent if statements) and ensures that the code preceding the instruction is executed before the instructions following it, for all threads in the block.

CUDA: Using Shared Memory
What if we wanted to sum all of the values in an array? How can SIMD do this? int sum = 0; for( uint i = 0; i < N ; i++) { sum += a[i]; }

CUDA: Reduction Private memory Shared Memory Global Memory
Cost(N) = N * expensive Private memory Shared Memory Global Memory

CUDA: Reduction Private memory Shared Memory Global Memory Cost(N) =
__threadfence() Private memory Cost(N) = 1 * expensive + N * cheap Shared Memory Global Memory

CUDA: Reduction __global__ void sumReduce( int * in, int * out) { __shared__ int fast[64]; int tid = blockIdx.x * blockDim.x + threadIdx.x; //load into shared memory fast[threadIdx.x] = in[tid]; //wait for all loads and stores __syncthreads(); //__threadfence()? //have lead thread do local sum over 64 elements if(threadIdx.x == 0) { int sum = 0; for(int i = 0; i < 64; i++) sum += fast[i]; //store result of block to block output location out[blockIdx.x] = sum; }

QUIZ: How many functional robots are on mars?

2 (14 on the surface) Opportunity (MER-B) Curiosity (MSL)
NASA/JPL 13 Years, 3 Months, 26 Days Curiosity (MSL) 3 years, 11 months, 10 days Schiaparelli EDM lander ESA/Roscosmos Crashed on impact October 19th

QUIZ: How many functional satellites orbit mars?

6 (14 including inactive) Mars Odyssey (2001) Mars Express (2003)
NASA Mars Express (2003) ESA MRO (2005) MOM (2013) India MAVEN (2013) ExoMars Trace Gas Orbiter (2016) Roscosmos

Homework 3 http://www.cs.virginia.edu/~jpw8bd/teaching
PART 1 (~1 week): Implement and optimize kernel that calculates max value in a large vector PART 2 (~3 weeks): Project: implement parallel algorithm in CUDA and compare to CPU version

BACKUP

CUDA: localAvg __global__ void add( int * in, int * out ) { int tid = blockIdx.x * blockDim.x + threadIdx.x; out[tid] = in[tid - 1] + in[tid] + in[tid + 1]; }

CUDA: localAvg Thread x Thread x+1 A[tid - 1] A[tid] A[tid + 1]

CUDA: localAvg __global__ void add( int *a ) { int tid = blockIdx.x * blockDim.x + threadIdx.x; a[tid] = a[tid - 1] + a[tid] + a[tid + 1]; }

CUDA: localAvg __global__ void add( int *a ) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int temp = a[tid - 1] + a[tid] + a[tid + 1]; __syncthreads(); a[tid] = temp; }

Modern GPU Architecture and SIMT Programming

Similar presentations

Presentation on theme: "Modern GPU Architecture and SIMT Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modern GPU Architecture and SIMT Programming

Similar presentations

Presentation on theme: "Modern GPU Architecture and SIMT Programming"— Presentation transcript:

Similar presentations

About project

Feedback