Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mattan Erez The University of Texas at Austin

Similar presentations


Presentation on theme: "Mattan Erez The University of Texas at Austin"— Presentation transcript:

1 Mattan Erez The University of Texas at Austin
EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 16 – GPUs (I) Mattan Erez The University of Texas at Austin EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

2 A GPU Renders 3D Scenes A Graphics Processing Unit (GPU) accelerates rendering of 3D scenes Input: description of scene Output: colored pixels to be displayed on a screen Input: Geometry (triangles), colors, lights, effects, textures Output: EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

3 Adding Programmability to the Graphics Pipeline
3D Application or Game 3D API Commands 3D API: OpenGL or Direct3D CPU – GPU Boundary GPU Command & Data Stream Assembled Polygons, Lines, and Points Pixel Location Stream Vertex Index Stream Pixel Updates GPU Front End Primitive Assembly Rasterization & Interpolation Raster Operations Framebuffer Pre-transformed Vertices Rasterized Pre-transformed Fragments Transformed Vertices Transformed Fragments Programmable Vertex Processor Programmable Fragment Processor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

4 Vertex and Fragment Processing Share Unified Processing Elements
Load balancing HW is a problem Vertex Shader Pixel Shader Idle hardware Heavy Geometry Workload Perf = 4 Vertex Shader Idle hardware Pixel Shader Heavy Pixel Workload Perf = 8 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez © NVIDIA Corp., 2007

5 Vertex and Fragment Processing Share Unified Processing Elements
Load balancing SW is easier Unified Shader Vertex Workload Pixel Heavy Geometry Workload Perf = 11 Heavy Pixel Workload Perf = 11 Unified Shader Vertex Pixel Workload EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez © NVIDIA Corp., 2007

6 Make the Compute Core The Focus of the Architecture
The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Generates Thread grids based on kernel calls © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

7 The NVIDIA GeForce Graphics Pipeline
Matt 20 Host Vertex Control Vertex Cache VS/T&L Triangle Setup Raster Frame Buffer Memory Texture Cache Shader ROP FBI © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

8 Another View of the 3D Graphics Pipeline
Host VS: Transform Vertex Control stream of vertices (NV ~100K) stream of vertices (NV) VS: Geometry VS: Lighting stream of vertices (NV) stream of vertices (NV) VS: Setup Raster stream of vertices (NV) stream of fragments (NF ~10M) FS 0 FS 1 stream of fragments (NF) stream of fragments (NF) EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

9 Stream Execution Model
Data parallel streams of data Processing kernels Unit of Execution is processing of one stream element in one kernel – defined as a thread Kernel 1 Kernel 2 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

10 Stream Execution Model
Can partition the streams into chunks Streams are very long and elements are independent Chunks are called strips or blocks Unit of Execution is processing one block of data by one kernel – defined as a thread block Kernel 1 Kernel 2 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

11 From Shader Code to a Teraflop: How Shader Cores Work
Kayvon Fatahalian CMU EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez Kayvon Fatahalian

12 Heterogeneous chip multi-processor (highly tuned for graphics)
What’s in a GPU? Shader Core Tex Input Assembly Rasterizer Output Blend Video Decode HW or SW? Work Distributor One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores. Heterogeneous chip multi-processor (highly tuned for graphics) EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 12 Kayvon Fatahalian Kayvon Fatahalian, 2008 12

13 A diffuse reflectance shader
sampler mySamp; Texture2D<float3>myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); return float4(kd, 1.0); } Takes as input data for one fragment Outputs shaded color of that fragment. One of the interesting things about graphics is how… Independent – data parallelism EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 13 Kayvon Fatahalian Kayvon Fatahalian, 2008 13

14 Compile shader 1 unshaded fragment input record
sampler mySamp; Texture2D<float3>myTex; float3 lightDir; float4 diffuseShader(float3 norm, float2 uv) { float3 kd; kd = myTex.Sample(mySamp, uv); kd *= clamp ( dot(lightDir, norm), 0.0, 1.0 ); return float4(kd, 1.0); } <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 1 shaded fragment output record EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 14 Kayvon Fatahalian Kayvon Fatahalian, 2008

15 Execute shader ALU Fetch/ Decode Execution Context (Execute)
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context Module for fetching and decoding instructions Module for executing the instructions (I’ll call it an ALU) And some state that defines an environment in which instructions are executed. EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 15 Kayvon Fatahalian Kayvon Fatahalian, 2008 15

16 Execute shader ALU Fetch/ Decode Execution Context (Execute) 16
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 16 Kayvon Fatahalian Kayvon Fatahalian, 2008

17 Execute shader ALU Fetch/ Decode Execution Context (Execute) 17
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 17 Kayvon Fatahalian Kayvon Fatahalian, 2008

18 Execute shader ALU Fetch/ Decode Execution Context (Execute) 18
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 18 Kayvon Fatahalian Kayvon Fatahalian, 2008

19 Execute shader ALU Fetch/ Decode Execution Context (Execute) 19
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 19 Kayvon Fatahalian Kayvon Fatahalian, 2008

20 Execute shader ALU Fetch/ Decode Execution Context (Execute) 20
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 20 Kayvon Fatahalian Kayvon Fatahalian, 2008

21 CPU-“style” cores ALU Fetch/ Out-of-order control logic Decode
Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data Cache (A big one) EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 21 Kayvon Fatahalian Kayvon Fatahalian, 2008

22 Slimming down Idea #1: Remove components that
Fetch/ Decode Idea #1: Remove components that help a single instruction stream run fast ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 22 Kayvon Fatahalian Kayvon Fatahalian, 2008

23 Two cores (two fragments in parallel)
<diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) fragment 1 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) fragment 2 Fetch/ Decode Fetch/ Decode ALU (Execute) ALU (Execute) Execution Context Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 23 Kayvon Fatahalian Kayvon Fatahalian, 2008

24 Four cores (four fragments in parallel)
ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context ALU (Execute) Fetch/ Decode Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 24 Kayvon Fatahalian Kayvon Fatahalian, 2008

25 Sixteen cores (sixteen fragments in parallel)
ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 25 Kayvon Fatahalian Kayvon Fatahalian, 2008

26 Instruction stream coherence
But… many fragments should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 26 Kayvon Fatahalian Kayvon Fatahalian, 2008

27 Recall: simple processing core
Fetch/ Decode ALU (Execute) Execution Context EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 27 Kayvon Fatahalian Kayvon Fatahalian, 2008

28 SIMD processing Add ALUs Idea #2: Amortize cost/complexity of
managing an instruction stream across many ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Shared Ctx Data SIMD processing Pack core full of ALUs We are not going to increase our core’s ability to decode instructions We will decode 1 instruction, and execute on all 8 ALUs EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 28 Kayvon Fatahalian Kayvon Fatahalian, 2008 28

29 Modifying the shader Original compiled shader: Processes one fragment
Fetch/ Decode Ctx Shared Ctx Data ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) How can we make use all these ALUs? Original compiled shader: Processes one fragment using scalar ops on scalar registers EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 29 Kayvon Fatahalian Kayvon Fatahalian, 2008 29

30 Modifying the shader New compiled shader: Processes 8 fragments
Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones. Ctx New compiled shader: Shared Ctx Data Processes 8 fragments using vector ops on vector registers EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 30 Kayvon Fatahalian Kayvon Fatahalian, 2008 30

31 Modifying the shader Fetch/ Decode Ctx Ctx Shared Ctx Data 1 2 3 4 5 6
7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov vec_o3, l(1.0) ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers. Ctx Shared Ctx Data EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 31 Kayvon Fatahalian Kayvon Fatahalian, 2008 31

32 128 fragments in parallel 16 cores = 128 ALUs
= 16 simultaneous instruction streams EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 32 Kayvon Fatahalian Kayvon Fatahalian, 2008 32

33 compute shader threads
vertices / fragments primitives CUDA threads OpenCL work items compute shader threads 128 [ ] in parallel primitives vertices fragments EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 33 Kayvon Fatahalian Kayvon Fatahalian, 2008

34 But what about branches?
2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 34 Kayvon Fatahalian Kayvon Fatahalian, 2008

35 But what about branches?
2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 35 Kayvon Fatahalian Kayvon Fatahalian, 2008

36 But what about branches?
2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> Not all ALUs do useful work! Worst case: 1/8 performance EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 36 Kayvon Fatahalian Kayvon Fatahalian, 2008

37 But what about branches?
2 ... 1 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x> 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code> EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 37 Kayvon Fatahalian Kayvon Fatahalian, 2008

38 In practice: 16 to 64 fragments share an instruction stream
Clarification SIMD processing does not imply SIMD instructions Option 1: Explicit vector instructions Intel/AMD x86 SSE, Intel Larrabee Option 2: Scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce (“SIMT” warps), ATIRadeon architectures In practice: 16 to 64 fragments share an instruction stream EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 38 Kayvon Fatahalian Kayvon Fatahalian, 2008

39 Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls. EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 39 Kayvon Fatahalian Kayvon Fatahalian, 2008

40 But we have LOTS of independent fragments.
Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations. EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 40 Kayvon Fatahalian Kayvon Fatahalian, 2008

41 Hiding shader stalls Time (clocks) Frag 1 … 8 Fetch/ Decode Ctx
SharedCtx Data ALU 41 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez Kayvon Fatahalian Kayvon Fatahalian, 2008

42 Hiding shader stalls 1 2 3 4 Time (clocks) Frag 1 … 8 Frag 9… 16
Fetch/ Decode ALU EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 42 Kayvon Fatahalian Kayvon Fatahalian, 2008 42

43 Hiding shader stalls 1 2 3 4 Runnable Time (clocks) Frag 1 … 8
EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 43 Kayvon Fatahalian Kayvon Fatahalian, 2008 43

44 Hiding shader stalls 1 2 3 4 Runnable Time (clocks) Frag 1 … 8
EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 44 Kayvon Fatahalian Kayvon Fatahalian, 2008 44

45 Hiding shader stalls 1 2 3 4 Runnable Runnable Runnable Time (clocks)
Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Stall Stall Stall We continue this process, moving to a new group each time we encounter a stall. If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle. Runnable Runnable Stall Runnable EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 45 Kayvon Fatahalian Kayvon Fatahalian, 2008 45

46 Throughput! Increase run time of one group
Time (clocks) Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Start Stall Start Stall Start Stall Runnable Stall Runnable Done! Runnable Done! Runnable Increase run time of one group To maximum throughput of many groups Done! Done! EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 46 Kayvon Fatahalian Kayvon Fatahalian, 2008 46

47 Pool of context storage
Storing contexts Fetch/ Decode ALU ALU Pool of context storage 32KB Described adding contexts In reality there’s a fixed pool of on chip storage that is partitioned to hold contexts. Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts. EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 47 Kayvon Fatahalian Kayvon Fatahalian, 2008 47

48 Twenty small contexts (maximal latency hiding ability) Fetch/ Decode ALU ALU 1 2 3 4 5 6 7 8 9 10 Shadingperformance relies on large scale interleaving Number of interleaved groups per core ~20-30 Could be separate hardware-managed contexts or software-managed using techniques 11 15 12 13 14 16 20 17 18 19 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 48 Kayvon Fatahalian Kayvon Fatahalian, 2008 48

49 Twelve medium contexts
Fetch/ Decode ALU ALU 1 2 3 4 5 6 7 8 Fewer contexts fit on chip Chip can hide less latency Higher likelihood of stalls 9 10 11 12 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 49 Kayvon Fatahalian Kayvon Fatahalian, 2008 49

50 Four large contexts (low latency hiding ability) 1 2 3 4 Fetch/ Decode
ALU ALU 1 2 Lose performance when shaders use a lot of registers 3 4 EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 50 Kayvon Fatahalian Kayvon Fatahalian, 2008 50

51 Summary: three key ideas for GPU architecture
Use many “slimmed down cores” to run in parallel Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 51 Kayvon Fatahalian Kayvon Fatahalian, 2008 51

52 = execution context storage
GPU block diagram key = single “physical” instruction stream fetch/decode (functional unit control) = SIMD programmable functional unit (FU), control shared with other functional units. This functional unit may contain multiple 32-bit “ALUs” = 32-bit mul-add unit = 32-bit multiply unit = execution context storage = fixed function unit EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 53 Kayvon Fatahalian Kayvon Fatahalian, 2008

53 NVIDIA GeForce 8800/9800 NVIDIA-speak: Generic speak:
128 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 16 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 128 mul-adds muls per clock 1.3 GHz clock 16 * 8 * (2 + 1) * 1.2 = 460 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 fragments (16 for vertices) 8 fragments run on 8 SIMD functional units in one clock Instruction repeated for 4 clocks (2 clocks for vertices) EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 54 Kayvon Fatahalian Kayvon Fatahalian, 2008

54 NVIDIA GeForce 8800/9800 Zcull/Clip/Rast Output Blend Work Distributor
Tex Tex Tex Tex Zcull/Clip/Rast Output Blend Work Distributor EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 55 Kayvon Fatahalian Kayvon Fatahalian, 2008

55 NVIDIA GeForce GTX 280 NVIDIA-speak: Generic speak:
240 stream processors “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock) Best case: 240 mul-adds muls per clock 1.3 GHz clock 30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS Mapping data-parallelism to chip: Instruction stream shared across 32 “threads” 16 threads run on 8 SIMD functional units in one clock Instruction repeated for 4 clocks (2 clocks for vertices) EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 56 Kayvon Fatahalian Kayvon Fatahalian, 2008

56 NVIDIA GeForce GTX 280 Zcull/Clip/Rast Output Blend Work Distributor
Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex Zcull/Clip/Rast Output Blend Work Distributor EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 57 Kayvon Fatahalian Kayvon Fatahalian, 2008

57 NVIDIA Fermi NVIDIA-speak: Generic speak:
512 CUDA processors (formerly stream processors) “SIMT execution” (automatic HW-managed sharing of instruction stream) Generic speak: 26 processing cores 32 SIMD functional units per core 1 FPU (FMA) + 1 INT per functional units (2 flops/clock) Best case: 512 FP mul-adds 1.3 GHz clock 16 * 32 * 2 * 1.5 = 1.5 GFLOPS (more real than earlier) Mapping data-parallelism to chip: Instruction stream shared across 32 threads 32 fragments run on 16 SIMD functional units in one clock Instruction repeated for 2 clocks ? EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez Kayvon Fatahalian Kayvon Fatahalian, 2008

58 ATI Radeon 4870 AMD/ATI-speak: Generic speak:
800 stream processors Automatic HW-managed sharing of scalar instruction stream (like “SIMT”) Generic speak: 10 processing cores 16 SIMD functional units per core 5 mul-adds per functional unit (5 * 2 =10 flops/clock) Best case: 800 mul-adds per clock 750 MHz clock 10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS Mapping data-parallelism to chip: Instruction stream shared across 64 fragments 16 fragments run on 16 SIMD functional units in one clock Instruction repeated for 4 consecutive clocks EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 59 Kayvon Fatahalian Kayvon Fatahalian, 2008

59 ATI Radeon 4870 Zcull/Clip/Rast Output Blend Work Distributor Tex … …
EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 60 Kayvon Fatahalian Kayvon Fatahalian, 2008

60 Additional information on “supplemental slides” and at
EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez 61 Kayvon Fatahalian Kayvon Fatahalian, 2008

61 Make the Compute Core The Focus of the Architecture
The future of GPUs is programmable processing So – build the architecture around the processor Processors execute computing threads Alternative operating mode specifically for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host Manages thread blocks Used to be only one kernel at a time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign EE382N: Principles of Computer Architecture, Fall Lecture 16 (c) Mattan Erez

62 Next-Gen GPU Architecture: Fermi
3 billion transistors Over 2x the cores (512 total) ~2x the memory bandwidth L1 and L2 caches 8x the peak DP performance ECC C++ Announced Sept. 2009

63 Fermi Focus Areas Expand performance sweet spot of the GPU
DRAM I/F HOST I/F Giga Thread L2 Expand performance sweet spot of the GPU Caching Concurrent kernels FP64 512 cores GDDR5 memory Bring more users, more applications to the GPU C++ Visual Studio Integration ECC


Download ppt "Mattan Erez The University of Texas at Austin"

Similar presentations


Ads by Google