Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mattan Erez The University of Texas at Austin

Similar presentations


Presentation on theme: "Mattan Erez The University of Texas at Austin"— Presentation transcript:

1 Mattan Erez The University of Texas at Austin
EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 – GPUs (II) Mattan Erez The University of Texas at Austin

2 Use many “slimmed down cores” to run in parallel
Recap Streaming model Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing 2 Kayvon Fatahalian Kayvon Fatahalian, 2008 2

3 Make the Compute Core The Focus of the Architecture
The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Manages thread blocks Used to be only one kernel at a time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

4 Graphic Processing Units (GPUs)
General-purpose many-core accelerators Use simple in-order shader cores in 10s / 100s numbers Shader core == Streaming Multiprocessor (SM) in NVIDIA GPUs Scalar frontend (fetch & decode) + parallel backend Amortizes the cost of frontend and control dsf

5 CUDA exposes hierarchy of data-parallel threads
SPMD model: single kernel executed by all scalar threads Kernel / Thread-block Multiple thread-blocks (cooperative-thread-arrays (CTAs)) compose a kernel dsf

6 CUDA exposes hierarchy of data-parallel threads
SPMD model: single kernel executed by all scalar threads Kernel / Thread-block / Warp / Thread Multiple warps compose a thread-block Multiple threads (32) compose a warp A warp is scheduled as a batch of threads dsf

7 SIMT for balanced programmability and HW-eff.
SIMT: Single-Instruction Multiple-Thread Programmer writes code to be executed by scalar threads Underlying hardware uses vector SIMD pipelines (lanes) HW/SW groups scalar threads to execute in vector lanes Enhanced programmability using SIMT Hardware/software supports conditional branches Each thread can follow its own control flow Per-thread load/store instructions are also supported Each thread can reference arbitrary address regions dsf

8 Older GPU, but high-level same
DRAM I/F HOST I/F Giga Thread L2 Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory Bring more users, more applications to the GPU C++ Visual Studio Integration ECC

9 Streaming Multiprocessor (SM)
Objective – optimize for GPU computing New ISA Revamp issue / control flow New CUDA core architecture 16 SMs per Fermi chip 32 cores per SM (512 total) 64KB of configurable L1$ / shared memory Register File Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Instruction Cache FP32 FP64 INT SFU LD/ST Ops / clk 32 16 4

10 64K Configurable Cache/Shared Mem
SM Microarchitecture Register File Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Instruction Cache New IEEE arithmetic standard Fused Multiply-Add (FMA) for SP & DP New integer ALU optimized for 64-bit and extended precision ops CUDA Core Dispatch Port Operand Collector Result Queue FP Unit INT Unit

11 Memory Hierarchy True cache hierarchy + on-chip shared RAM
On-chip shared memory: good fit for regular memory access dense linear algebra, image processing, … Caches: good fit for irregular or unpredictable memory access ray tracing, sparse matrix multiply, physics … Separate L1 Cache for each SM (16/48 KB) Improves bandwidth and reduces latency Unified L2 Cache for all SMs (768 KB) Fast, coherent data sharing across all cores in the GPU DRAM I/F Giga Thread HOST I/F L2 Each Fermi SM multiprocessor has 64 KB of high-speed, on-chip RAM configurable as 48 KB of Shared memory with 16 KB of L1 cache, or as 16 KB of Shared memory with 48 KB of L1 cache. When configured with 48 KB of shared memory, programs that make extensive use of shared memory (such as electrodynamic simulations) can perform up to three times faster. For programs whose memory accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved performance over direct access to DRAM. Fermi also features a 768 KB unified L2 cache that services all load, store, and texture requests. The L2 cache is coherent across all SMs. The L2 provides efficient, high speed data sharing across the GPU. Algorithms for which data addresses are not known beforehand, such as physics solvers, raytracing, and sparse matrix multiplication especially benefit from the cache hierarchy.

12 GigaThreadTM Hardware Thread Scheduler
Hierarchically manages tens of thousands of simultaneously active threads 10x faster context switching on Fermi Overlapping kernel execution HTS One of the most important technologies of the Fermi architecture is its two-level, distributed thread scheduler. At the chip level, a global work distribution engine schedules thread blocks to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units. The first generation GigaThread engine introduced in G80 managed up to 12,288 threads in realtime. The Fermi architecture improves on this foundation by providing not only greater thread throughput (24,576), but dramatically faster context switching, concurrent kernel execution, and improved thread block scheduling.

13 GigaThread Streaming Data Transfer Engine
SDT Dual DMA engines Simultaneous CPUGPU and GPUCPU data transfer Fully overlapped with CPU/GPU processing Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1

14 Thread Life Cycle in HW Kernel is launched on the SPA
Kernels known as grids of thread blocks Thread Blocks are serially distributed to all the SM’s Potentially >1 Thread Block per SM At least 96 threads per block Each SM launches Warps of Threads 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed SPA can distribute more Thread Blocks Host Device Grid 1 Kernel 1 Block (0, 0) (1, 0) (2, 0) Block (0, 1) (1, 1) (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign


Download ppt "Mattan Erez The University of Texas at Austin"

Similar presentations


Ads by Google