Mattan Erez The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
1 Chapter 04 Authors: John Hennessy & David Patterson.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Computer Engg, IIT(BHU)
The Present and Future of Parallelism on GPUs
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Brief of GPU&CUDA Chun-Yuan Lin.
EECE571R -- Harnessing Massively Parallel Processors ece
A Crash Course on Programmable Graphics Hardware
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
CSC 2231: Parallel Computer Architecture and Programming GPUs
Graphics Hardware CMSC 491/691.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Mattan Erez The University of Texas at Austin
Lecture 26: Multiprocessors
NVIDIA Fermi Architecture
Graphics Processing Unit
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Lecture 27: Multiprocessors
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Mattan Erez The University of Texas at Austin EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 16 – GPUs (I) Mattan Erez The University of Texas at Austin EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

A GPU Renders 3D Scenes A Graphics Processing Unit (GPU) accelerates rendering of 3D scenes Input: description of scene Output: colored pixels to be displayed on a screen Input: Geometry (triangles), colors, lights, effects, textures Output: EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

Adding Programmability to the Graphics Pipeline 3D Application or Game 3D API Commands 3D API: OpenGL or Direct3D CPU – GPU Boundary GPU Command & Data Stream Assembled Polygons, Lines, and Points Pixel Location Stream Vertex Index Stream Pixel Updates GPU Front End Primitive Assembly Rasterization & Interpolation Raster Operations Framebuffer Pre-transformed Vertices Rasterized Pre-transformed Fragments Transformed Vertices Transformed Fragments Programmable Vertex Processor Programmable Fragment Processor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

Vertex and Fragment Processing Share Unified Processing Elements Load balancing HW is a problem Vertex Shader Pixel Shader Idle hardware Heavy Geometry Workload Perf = 4 Vertex Shader Idle hardware Pixel Shader Heavy Pixel Workload Perf = 8 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez © NVIDIA Corp., 2007

Vertex and Fragment Processing Share Unified Processing Elements Load balancing SW is easier Unified Shader Vertex Workload Pixel Heavy Geometry Workload Perf = 11 Heavy Pixel Workload Perf = 11 Unified Shader Vertex Pixel Workload EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez © NVIDIA Corp., 2007

Make the Compute Core The Focus of the Architecture The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Generates Thread grids based on kernel calls © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

The NVIDIA GeForce Graphics Pipeline Matt 20 Host Vertex Control Vertex Cache VS/T&L Triangle Setup Raster Frame Buffer Memory Texture Cache Shader ROP FBI © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

Another View of the 3D Graphics Pipeline Host VS: Transform Vertex Control stream of vertices (NV ~100K) stream of vertices (NV) VS: Geometry VS: Lighting stream of vertices (NV) stream of vertices (NV) VS: Setup Raster stream of vertices (NV) stream of fragments (NF ~10M) FS 0 FS 1 stream of fragments (NF) stream of fragments (NF) EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

Stream Execution Model Data parallel streams of data Processing kernels Unit of Execution is processing of one stream element in one kernel – defined as a thread Kernel 1 Kernel 2 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

Stream Execution Model Can partition the streams into chunks Streams are very long and elements are independent Chunks are called strips or blocks Unit of Execution is processing one block of data by one kernel – defined as a thread block Kernel 1 Kernel 2 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian CMU EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez Kayvon Fatahalian

Heterogeneous chip multi-processor (highly tuned for graphics) What’s in a GPU? Shader Core Tex Input Assembly Rasterizer Output Blend Video Decode HW or SW? Work Distributor One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores. Heterogeneous chip multi-processor (highly tuned for graphics) EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 12 Kayvon Fatahalian Kayvon Fatahalian, 2008 12

A diffuse reflectance shader sampler mySamp; Texture2D<float3>myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); return float4(kd, 1.0); } Takes as input data for one fragment Outputs shaded color of that fragment. One of the interesting things about graphics is how… Independent – data parallelism EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 13 Kayvon Fatahalian Kayvon Fatahalian, 2008 13

Compile shader 1 unshaded fragment input record sampler mySamp; Texture2D<float3>myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp ( dot(lightDir, norm), 0.0, 1.0 ); return float4(kd, 1.0); } <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 1 shaded fragment output record EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 14 Kayvon Fatahalian Kayvon Fatahalian, 2008

Execute shader ALU Fetch/ Decode Execution Context (Execute) <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context Module for fetching and decoding instructions Module for executing the instructions (I’ll call it an ALU) And some state that defines an environment in which instructions are executed. EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 15 Kayvon Fatahalian Kayvon Fatahalian, 2008 15

Execute shader ALU Fetch/ Decode Execution Context (Execute) 16 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 16 Kayvon Fatahalian Kayvon Fatahalian, 2008

Execute shader ALU Fetch/ Decode Execution Context (Execute) 17 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 17 Kayvon Fatahalian Kayvon Fatahalian, 2008

Execute shader ALU Fetch/ Decode Execution Context (Execute) 18 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 18 Kayvon Fatahalian Kayvon Fatahalian, 2008

Execute shader ALU Fetch/ Decode Execution Context (Execute) 19 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 19 Kayvon Fatahalian Kayvon Fatahalian, 2008

Execute shader ALU Fetch/ Decode Execution Context (Execute) 20 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 20 Kayvon Fatahalian Kayvon Fatahalian, 2008

CPU-“style” cores ALU Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data Cache (A big one) EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 21 Kayvon Fatahalian Kayvon Fatahalian, 2008

Slimming down Idea #1: Remove components that Fetch/ Decode Idea #1: Remove components that help a single instruction stream run fast ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 22 Kayvon Fatahalian Kayvon Fatahalian, 2008

Two cores (two fragments in parallel) <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) fragment 1 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) fragment 2 Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) Execution Context Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 23 Kayvon Fatahalian Kayvon Fatahalian, 2008

Four cores (four fragments in parallel) ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 24 Kayvon Fatahalian Kayvon Fatahalian, 2008

Sixteen cores (sixteen fragments in parallel) ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 25 Kayvon Fatahalian Kayvon Fatahalian, 2008

Instruction stream coherence But… many fragments should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 26 Kayvon Fatahalian Kayvon Fatahalian, 2008

Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 27 Kayvon Fatahalian Kayvon Fatahalian, 2008

SIMD processing Add ALUs Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Shared Ctx Data SIMD processing Pack core full of ALUs We are not going to increase our core’s ability to decode instructions We will decode 1 instruction, and execute on all 8 ALUs EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 28 Kayvon Fatahalian Kayvon Fatahalian, 2008 28

Modifying the shader Original compiled shader: Processes one fragment Fetch/ Decode Ctx Shared Ctx Data ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) How can we make use all these ALUs? Original compiled shader: Processes one fragment using scalar ops on scalar registers EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 29 Kayvon Fatahalian Kayvon Fatahalian, 2008 29

Modifying the shader New compiled shader: Processes 8 fragments Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones. Ctx New compiled shader: Shared Ctx Data Processes 8 fragments using vector ops on vector registers EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 30 Kayvon Fatahalian Kayvon Fatahalian, 2008 30

Modifying the shader Fetch/ Decode Ctx Ctx Shared Ctx Data 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers. Ctx Shared Ctx Data EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 31 Kayvon Fatahalian Kayvon Fatahalian, 2008 31

128 fragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 32 Kayvon Fatahalian Kayvon Fatahalian, 2008 32

compute shader threads vertices / fragments primitives CUDA threads OpenCL work items compute shader threads 128 [ ] in parallel primitives vertices fragments EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 33 Kayvon Fatahalian Kayvon Fatahalian, 2008

But what about branches? 2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 34 Kayvon Fatahalian Kayvon Fatahalian, 2008

But what about branches? 2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 35 Kayvon Fatahalian Kayvon Fatahalian, 2008

But what about branches? 2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> Not all ALUs do useful work! Worst case: 1/8 performance EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 36 Kayvon Fatahalian Kayvon Fatahalian, 2008

But what about branches? 2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 37 Kayvon Fatahalian Kayvon Fatahalian, 2008

In practice: 16 to 64 fragments share an instruction stream Clarification SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions Intel/AMD x86 SSE, Intel Larrabee Option 2: Scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce (“SIMT” warps), ATIRadeon architectures In practice: 16 to 64 fragments share an instruction stream EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 38 Kayvon Fatahalian Kayvon Fatahalian, 2008

Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 39 Kayvon Fatahalian Kayvon Fatahalian, 2008

But we have LOTS of independent fragments. Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations. EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 40 Kayvon Fatahalian Kayvon Fatahalian, 2008

Hiding shader stalls Time (clocks) Frag 1 … 8 Fetch/ Decode Ctx SharedCtx Data ALU 41 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez Kayvon Fatahalian Kayvon Fatahalian, 2008

Hiding shader stalls 1 2 3 4 Time (clocks) Frag 1 … 8 Frag 9… 16 Fetch/ Decode ALU EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 42 Kayvon Fatahalian Kayvon Fatahalian, 2008 42

Hiding shader stalls 1 2 3 4 Runnable Time (clocks) Frag 1 … 8 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 43 Kayvon Fatahalian Kayvon Fatahalian, 2008 43

Hiding shader stalls 1 2 3 4 Runnable Time (clocks) Frag 1 … 8 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 44 Kayvon Fatahalian Kayvon Fatahalian, 2008 44

Hiding shader stalls 1 2 3 4 Runnable Runnable Runnable Time (clocks) Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Stall Stall Stall We continue this process, moving to a new group each time we encounter a stall. If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle. Runnable Runnable Stall Runnable EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 45 Kayvon Fatahalian Kayvon Fatahalian, 2008 45

Throughput! Increase run time of one group Time (clocks) Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Start Stall Start Stall Start Stall Runnable Stall Runnable Done! Runnable Done! Runnable Increase run time of one group To maximum throughput of many groups Done! Done! EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 46 Kayvon Fatahalian Kayvon Fatahalian, 2008 46

Pool of context storage Storing contexts Fetch/ Decode ALU ALU Pool of context storage 32KB Described adding contexts In reality there’s a fixed pool of on chip storage that is partitioned to hold contexts. Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts. EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 47 Kayvon Fatahalian Kayvon Fatahalian, 2008 47

Twenty small contexts (maximal latency hiding ability) Fetch/ Decode ALU ALU 1 2 3 4 5 6 7 8 9 10 Shadingperformance relies on large scale interleaving Number of interleaved groups per core ~20-30 Could be separate hardware-managed contexts or software-managed using techniques 11 15 12 13 14 16 20 17 18 19 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 48 Kayvon Fatahalian Kayvon Fatahalian, 2008 48

Twelve medium contexts Fetch/ Decode ALU ALU 1 2 3 4 5 6 7 8 Fewer contexts fit on chip Chip can hide less latency Higher likelihood of stalls 9 10 11 12 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 49 Kayvon Fatahalian Kayvon Fatahalian, 2008 49

Four large contexts (low latency hiding ability) 1 2 3 4 Fetch/ Decode ALU ALU 1 2 Lose performance when shaders use a lot of registers 3 4 EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 50 Kayvon Fatahalian Kayvon Fatahalian, 2008 50

Summary: three key ideas for GPU architecture Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 51 Kayvon Fatahalian Kayvon Fatahalian, 2008 51

= execution context storage GPU block diagram key = single “physical” instruction stream fetch/decode (functional unit control) = SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs” = 32-bit mul-add unit = 32-bit multiply unit = execution context storage = fixed function unit EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 53 Kayvon Fatahalian Kayvon Fatahalian, 2008

NVIDIA GeForce 8800/9800 NVIDIA-speak: Generic speak: 128 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 16 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 128 mul-adds + 128 muls per clock 1.3 GHz clock 16 * 8 * (2 + 1) * 1.2 = 460 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 fragments (16 for vertices) 8 fragments run on 8 SIMD functional units in one clock Instruction repeated for 4 clocks (2 clocks for vertices) EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 54 Kayvon Fatahalian Kayvon Fatahalian, 2008

NVIDIA GeForce 8800/9800 Zcull/Clip/Rast Output Blend Work Distributor Tex … Tex … Tex … Tex … Zcull/Clip/Rast Output Blend Work Distributor EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 55 Kayvon Fatahalian Kayvon Fatahalian, 2008

NVIDIA GeForce GTX 280 NVIDIA-speak: Generic speak: 240 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 240 mul-adds + 240 muls per clock 1.3 GHz clock 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 “threads” 16 threads run on 8 SIMD functional units in one clock Instruction repeated for 4 clocks (2 clocks for vertices) EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 56 Kayvon Fatahalian Kayvon Fatahalian, 2008

NVIDIA GeForce GTX 280 Zcull/Clip/Rast Output Blend Work Distributor Tex Tex … … … … Tex Tex … … Tex Tex … … Tex Tex … … Tex Tex … … Zcull/Clip/Rast Output Blend Work Distributor EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 57 Kayvon Fatahalian Kayvon Fatahalian, 2008

NVIDIA Fermi NVIDIA-speak: Generic speak: 512 CUDA processors (formerly stream processors) “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 26 processing cores 32 SIMD functional units per core 1 FPU (FMA) + 1 INT per functional units (2 flops/clock) Best case: 512 FP mul-adds 1.3 GHz clock 16 * 32 * 2 * 1.5 = 1.5 GFLOPS (more real than earlier) Mapping data-parallelism to chip: Instruction stream shared across 32 threads 32 fragments run on 16 SIMD functional units in one clock Instruction repeated for 2 clocks ? EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez Kayvon Fatahalian Kayvon Fatahalian, 2008

ATI Radeon 4870 AMD/ATI-speak: Generic speak: 800 stream processors Automatic HW-managed sharing of scalar instruction stream (like “SIMT”) Generic speak: 10 processing cores 16 SIMD functional units per core 5 mul-adds per functional unit (5 * 2 =10 flops/clock) Best case: 800 mul-adds per clock 750 MHz clock 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS Mapping data-parallelism to chip: Instruction stream shared across 64 fragments 16 fragments run on 16 SIMD functional units in one clock Instruction repeated for 4 consecutive clocks EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 59 Kayvon Fatahalian Kayvon Fatahalian, 2008

ATI Radeon 4870 Zcull/Clip/Rast Output Blend Work Distributor Tex … … EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 60 Kayvon Fatahalian Kayvon Fatahalian, 2008

Additional information on “supplemental slides” and at http://graphics.stanford.edu/~kayvonf/gblog EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez 61 Kayvon Fatahalian Kayvon Fatahalian, 2008

Make the Compute Core The Focus of the Architecture The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Manages thread blocks Used to be only one kernel at a time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 16 (c) Mattan Erez

Next-Gen GPU Architecture: Fermi 3 billion transistors Over 2x the cores (512 total) ~2x the memory bandwidth L1 and L2 caches 8x the peak DP performance ECC C++ Announced Sept. 2009

Fermi Focus Areas Expand performance sweet spot of the GPU DRAM I/F HOST I/F Giga Thread L2 Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory Bring more users, more applications to the GPU C++ Visual Studio Integration ECC