Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Presented by: Isaac Martin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
NVIDIA Fermi Architecture
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Programming Massively Parallel Processors Performance Considerations
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Introduction to Heterogeneous Parallel Computing
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 – GPUs (II) Mattan Erez The University of Texas at Austin

Use many “slimmed down cores” to run in parallel Recap Streaming model Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing 2 Kayvon Fatahalian Kayvon Fatahalian, 2008 2

Make the Compute Core The Focus of the Architecture The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Manages thread blocks Used to be only one kernel at a time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

Graphic Processing Units (GPUs) General-purpose many-core accelerators Use simple in-order shader cores in 10s / 100s numbers Shader core == Streaming Multiprocessor (SM) in NVIDIA GPUs Scalar frontend (fetch & decode) + parallel backend Amortizes the cost of frontend and control dsf

CUDA exposes hierarchy of data-parallel threads SPMD model: single kernel executed by all scalar threads Kernel / Thread-block Multiple thread-blocks (cooperative-thread-arrays (CTAs)) compose a kernel dsf

CUDA exposes hierarchy of data-parallel threads SPMD model: single kernel executed by all scalar threads Kernel / Thread-block / Warp / Thread Multiple warps compose a thread-block Multiple threads (32) compose a warp A warp is scheduled as a batch of threads dsf

SIMT for balanced programmability and HW-eff. SIMT: Single-Instruction Multiple-Thread Programmer writes code to be executed by scalar threads Underlying hardware uses vector SIMD pipelines (lanes) HW/SW groups scalar threads to execute in vector lanes Enhanced programmability using SIMT Hardware/software supports conditional branches Each thread can follow its own control flow Per-thread load/store instructions are also supported Each thread can reference arbitrary address regions dsf

Older GPU, but high-level same DRAM I/F HOST I/F Giga Thread L2 Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory Bring more users, more applications to the GPU C++ Visual Studio Integration ECC

Streaming Multiprocessor (SM) Objective – optimize for GPU computing New ISA Revamp issue / control flow New CUDA core architecture 16 SMs per Fermi chip 32 cores per SM (512 total) 64KB of configurable L1$ / shared memory Register File Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Instruction Cache FP32 FP64 INT SFU LD/ST Ops / clk 32 16 4

64K Configurable Cache/Shared Mem SM Microarchitecture Register File Scheduler Dispatch Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Instruction Cache New IEEE 754-2008 arithmetic standard Fused Multiply-Add (FMA) for SP & DP New integer ALU optimized for 64-bit and extended precision ops CUDA Core Dispatch Port Operand Collector Result Queue FP Unit INT Unit

Memory Hierarchy True cache hierarchy + on-chip shared RAM On-chip shared memory: good fit for regular memory access dense linear algebra, image processing, … Caches: good fit for irregular or unpredictable memory access ray tracing, sparse matrix multiply, physics … Separate L1 Cache for each SM (16/48 KB) Improves bandwidth and reduces latency Unified L2 Cache for all SMs (768 KB) Fast, coherent data sharing across all cores in the GPU DRAM I/F Giga Thread HOST I/F L2 Each Fermi SM multiprocessor has 64 KB of high-speed, on-chip RAM configurable as 48 KB of Shared memory with 16 KB of L1 cache, or as 16 KB of Shared memory with 48 KB of L1 cache. When configured with 48 KB of shared memory, programs that make extensive use of shared memory (such as electrodynamic simulations) can perform up to three times faster. For programs whose memory accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved performance over direct access to DRAM. Fermi also features a 768 KB unified L2 cache that services all load, store, and texture requests. The L2 cache is coherent across all SMs. The L2 provides efficient, high speed data sharing across the GPU. Algorithms for which data addresses are not known beforehand, such as physics solvers, raytracing, and sparse matrix multiplication especially benefit from the cache hierarchy.

GigaThreadTM Hardware Thread Scheduler Hierarchically manages tens of thousands of simultaneously active threads 10x faster context switching on Fermi Overlapping kernel execution HTS One of the most important technologies of the Fermi architecture is its two-level, distributed thread scheduler. At the chip level, a global work distribution engine schedules thread blocks to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units. The first generation GigaThread engine introduced in G80 managed up to 12,288 threads in realtime. The Fermi architecture improves on this foundation by providing not only greater thread throughput (24,576), but dramatically faster context switching, concurrent kernel execution, and improved thread block scheduling.

GigaThread Streaming Data Transfer Engine SDT Dual DMA engines Simultaneous CPUGPU and GPUCPU data transfer Fully overlapped with CPU/GPU processing Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1

Thread Life Cycle in HW Kernel is launched on the SPA Kernels known as grids of thread blocks Thread Blocks are serially distributed to all the SM’s Potentially >1 Thread Block per SM At least 96 threads per block Each SM launches Warps of Threads 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed SPA can distribute more Thread Blocks Host Device Grid 1 Kernel 1 Block (0, 0) (1, 0) (2, 0) Block (0, 1) (1, 1) (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign