Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The “Real” Title This talk is about SIMT Processors The Past, Present, and a glimpse of the Future

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Past

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Brief Chronology of GPUs at NVIDIA Quake 3 Giants Halo Far Cry UE3Half-Life

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Early NVIDIA GPUs (Precambrian Eon) NV1 (1995) Forward texturing Traverse in texel space, generate pixels (vs. conventional “reverse” texturing where pixel locations are sampled in texture space) Quadratic patches Different than DirectX polygon rendering approach Integrated audio

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Precambrian (cont’d) NV3 - Riva 128 (Aug 1997) 1st 128-bit memory bus “Wider is better” DirectX 3 support 1 pix/clk 100 MHz Unified memory for frame buffer and texture 16b Z / 16b color Integrated VGA from Weitek

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Shades of Programmability (Phanerozoic Eon) NV4 - Riva TNT (Summer 1998) 2 pix/clk @ 90 MHz DirectX 5 Dual texturing @ 1 pix/clk Register combiners

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Rudimentary Shader Processors Early Programmable Shading Google “register combiners”, http://developer.nvidia.com/object/registercombiners.html http://developer.nvidia.com/object/registercombiners.html “Fixed function but programmable”

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors General Combiner Flow

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Birth of Modern GPUs (Cenozoic Eon) NV20 - GeForce3 (Feb 2001) 4 pix/clk @ 240 MHz, 2 bilinear tex/pix DirectX 8 Shaders! Programmable vertex shaders “Configurable” pixel shaders ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; MULR R0.xyz, R0.w, R0.xyzx; ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x;

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors 11 Shaders: Before and After Halo, © Bungie, Elder Scrolls 3: Morrowind, © Bethesda

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Fully Programmable Shader Engines (Cretaceous Period) NV30, NV31 - GeForce FX (Jan 2003) 4 pix/clk 500MHz (Ultra) 8 pix/clk for Z-only 128 pin DDR DRAM interface Superset of DirectX 9 FP32 programmable pixel shader Mainstream derivative: NV31 Not a stellar market success 

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors GeForce FX Shader Program Examples

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Programmability Improved (Paleocene Epoch) NV40 - GeForce 6800 (April 2004) 16 pix/clk @ 500 MHz ; DX9 Shader Model 3.0 256 pin DRAM interface Transition from AGP to PCI-E Evolved NV3x shader; focused perf/area effort SLI Re-born

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Present

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Modern Shader Processors – The “SM” (Pliocene Epoch) G80 - GeForce 8800 (Nov 2006) 24 pix/clk @ 575 MHz 384-bit local memory interface Virtual memory remapping for system and frame buffer DirectX 10 Unified shader for vertex, geometry, and pixel programs Compute!

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors G8x

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors NVIDIA Tesla Scalable High Density Computing Massively Multi-threaded Parallel Computing

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Unified Design

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Streaming Multiprocessor (SM) 8 Streaming Processors (SP) 8 SP FMA, 1 shared DP FMA 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 768 threads active SIMD instruction per 16/32 threads Hot clock 1.5 GHz, tepid 750 MHz, 24 GFLOPS 32 KB local register file (RFn) 16 KB global register file (GRF), aka Shared Memory

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors SM Conceptual Block Diagram

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Future

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The CMOS “Canvas”

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Ideal Processor?

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Processor We Live With? Performance = Total Area X Computational Area Efficiency X Achieved Dynamic Efficiency

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors What is SIMT? SIMDMIMDSIMT

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors SIMD versus MIMD versus SIMT? SIMD: “Synchronous Internally Parallel” MIMD: “Asynchronous Externally Parallel” SIMT: “Quasi- Synchronous Externally Parallel” SIMT = “Near” MIMD Programming Model w/ SIMD Implementation Efficiencies

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors SIMT Multithreaded Execution SIMT: Single-Instruction Multi-Thread executes one instruction across many independent threads Warp: a set of 32 parallel threads that execute a SIMT instruction SIMT provides easy single-thread scalar programming with SIMD efficiency Hardware implements zero-overhead warp and thread scheduling SIMT threads can execute independently SIMT warp diverges and converges when threads branch independently Best efficiency and performance when threads of a warp execute together warp 8 instruction 11 Single-Instruction Multi-Thread instruction scheduler warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12...... time warp 3 instruction 96

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors A Few Open SIMT Problems Control Divergence Data Divergence Data Representation Coherence Diversity

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Control Divergence Control Flow Divergence can Happen at control flow operations

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Why is Control Divergence Bad? Loss of efficiency in SIMD execution If different execution path threads are executed together Unequal path execution delays implies the “wait or stay diverged” dilemma

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Data Access via Pointers in Parallel Programs Pointers represent a major problem in parallel programs Location that a pointer references cannot be resolved until runtime struct { int x; int y; } *p; int z = p->y; LD R1,R0[4] // R0 = p

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Data Divergence SIMT magnifies the pointer problem Non-converged memory accesses = data divergence Classic scatter/gather problem

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Data Representation: The AOS versus SOA Dilemma AOS (array of structure) #define NNN nnn struct { type1 field1; type2 field2;... } data[NNN]; SOA (structure of array) #define NNN nnn struct { type1 field1[NNN]; type2 field2[NNN];... } data;

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors AOS versus SOA in Memory AOS: SOA:

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors AOS versus SOA: How to Choose? Programmer: pick AOS Natural way to think about data: group related fields In some cases, better memory access efficiency Sparse access to records SIMT: pick SOA Threads executing same code want to access same data element at the same time Very convenient for HW How to reconcile?

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Descriptors AKA “capabilities” For example, Plessey 250, Cambridge CAP, Intel 432 D3D employs a form of descriptor “Resources descriptors” are capabilities Major language issue for parallel programming?

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Diversity: CPU-GPU Détente? Really SISD vs. SIMT Sequential applications on SIMT hardware? Conversely, thread parallel applications on multi-core scalar machines? Room for both?

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Coherent Caches? Some small planes have built in parachutes Really good idea? Fact: existing GPUs don’t support cache coherency Bad? Should coherent caches be added?

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Future Revisited So what is the future in high performance computing? 1.SIMT 2.Lots of cores 3.Clouds

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors The Demise of ILP Uniprocessor performance improvements are crawling to a halt Very hard to architecturally extract more ILP from single threads 52%/year 19%/year ps/gate 19% Gates/clock 9% Clocks/inst 18%

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Parallel Processing Conjecture: most problems worth solving can be solved via a parallel program SIMT fundamentally a better model than either SIMD or MIMD

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Scaling Can a single GPU do it all? Systems have to scale to multiple boxes Programming systems have to scale with them

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors Final Thoughts The future is bright for parallel programming Future supercomputers = networked SIMT- based processing systems Thanks! mshebanow@nvidia.com

Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Similar presentations

Presentation on theme: "Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs.

Similar presentations

Presentation on theme: "Pervasive Massively Multithreaded GPU Processors Michael C. Shebanow Sr. Arch Mgr, GPUs."— Presentation transcript:

Similar presentations

About project

Feedback