Download presentation
Presentation is loading. Please wait.
Published byΑνάκλητος Βενιζέλος Modified over 6 years ago
1
Powering the next generation of graphics: AMD GCN Architecture
Layla Mah – Developer Relations Engineer, AMD @MissQuickstep
2
Agenda GCN PRT Part 1: AMD Graphics Core Next Architecture (GCN)
Part 2: Partially Resident Textures (PRT) GCN PRT This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance. 2
3
AMD Graphics core next AMD Graphics core next
4
GPU Evolution Prior to 2002 2002 - 2006 2007 to Present
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
5
GPU Evolution Prior to 2002 2002 - 2006 2007 to Present
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
6
GPU Evolution Prior to 2002 Prior to 2002 2002 - 2006 2007 to Present
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core Prior to 2002 Graphics-specific hardware Texture mapping/filtering Multi-texturing “T&L Engines” Geometry processing (Transform) Rasterization (Lighting) Dedicated texture and pixel caches Dot product and scalar multiply-add Sufficient for basic graphics tasks No general purpose compute capability 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
7
GPU Evolution Prior to 2002 2002 - 2006 2007 to Present
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
8
GPU Evolution 2002-2006 The Rise of Shaders Prior to 2002 2002 - 2006
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core Graphics-focused programmability DirectX 8/9, OpenGL 2.0 Floating point processing (IEEE not required) Different precision per IHV ATI 24-bit full-speed NV 16-bit full-speed NV 32-bit half-speed Specialized ALUs for vertex & pixel processing Added dedicated caches The Rise of Shaders Shader Models VS and PS are distinct Minimal Instruction Sets Limited Instruction Slots Limited Shader Lengths No Dynamic Flow Control No Looping Constructs No Vertex Texture Fetch No Bitwise Operators No Native Integer ALU Etc. 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
9
GPU Evolution Prior to 2002 2002 - 2006 2007 to Present
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
10
GPU Evolution The Rise of The Unified Shader (VLIW-5) Prior to 2002
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core The Rise of The Unified Shader (VLIW-5) 5-Element Very-Long-Instruction-Word (XYZWT) Began with Xenos and utilized from R600 until “Cayman” Flexible and optimized for Graphics workloads Ideal for 4-element Vector and 4x4 Matrix Operations Vector/Vector math in a single instruction Plus One Transcendental-Unit function per Instruction More advanced caching Instruction, constant, multi-level texture/data, & later: LDS/GDS Single Precision 32-bit IEEE-Compliant Floating Point ALUs More flexible: Unified ALU, Branch Unit, Dynamic Flow Control, Vertex Texture, Geometry Shader, Tessellation Engines, etc. 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
11
GPU Evolution Prior to 2002 2002 - 2006 2007 to Present
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
12
GPU Evolution Optimized For Die Area Efficiency (VLIW-4) Prior to 2002
1st Era: Fixed Function 2nd Era: Simple Shaders 3rd Era: Graphics Parallel Core Optimized For Die Area Efficiency (VLIW-4) 4-Element Very-Long-Instruction-Word (XYZW) Profiling showed average VLIW utilization was < 3.4/5 Removed dedicated T-Unit – Optimized die area usage Each ALU has a smaller LUT, combined using 3-term Lagrange polynomial interpolation (transcendental/clock/VLIW4) Better optimized for combination of Graphics & Compute Graphics is still the primary focus, but compute is gaining attention Still ideal for 4-element Vector and 4x4 Matrix Operations Fewer ALU bubbles in transcendental-light code, better utilization Simplified programming and optimization relative to VLIW-5 Multiple dispatch processors & separate command queues Improved support for DirectCompute™ and OpenCL™ 3D Geometry Transformation VLIW5 Branch Unit FMAD+ Special Functions Stream Processing Units General Purpose Registers VLIW4 Lighting Branch Unit Stream Processing Units General Purpose Registers Prior to 2002 Graphics specific hardware Texture mapping/filtering Geometry processing Rasterization Dedicated texture and pixel caches Dot product and scalar multiply-add sufficient for basic graphics tasks No general purpose compute capability Graphics-focused programmability DirectX 8/9 Floating point processing (IEEE compliance not required) Specialized ALUs for vertex & pixel processing Limited shaders More dedicated caches (vertex, texture, color, depth) 2007 to Present Unified shader architectures VLIW5: flexible and optimized for graphics workloads VLIW4: simplified and optimized for more general workloads More advanced caching Instruction, constant, multi-level texture/data, local/global data shares Basic general purpose compute CAL, Brook, ATI Stream IEEE compliant floating point math Graphics performance still primary objective
13
Graphics Core Next Architecture
A new GPU design for a new era of computing Cutting-edge graphics performance and features High compute density with multi-tasking Built for power efficiency Optimized for heterogeneous computing Enabling the Heterogeneous System Architecture (HSA) Amazing scalability and flexibility This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance. 13
14
Graphics Core Next Architecture
A new GPU design for a new era of computing Unlimited Resources & Samplers (Including Unlimited UAV/SRV at any shader stage) All UAV formats can be read/write (vs. just single uint32 in D3D11 API spec) Simpler Assembly Language Simpler Shader Code (No More Clauses) Ability to support C/C++ (like) Architectural support for traps, exceptions & debugging Ability to share virtual x86-64 address space with CPU cores This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance. 14
15
Graphics Core Next Architecture
A new GPU design for a new era of next generation computing… This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance. 15
16
GCN Compute Unit Basic GPU building block
New instruction set architecture Non-VLIW Vector unit + scalar co-processor Distributed programmable scheduler Each compute unit can execute instructions from multiple kernels at once Increased instructions per clock per mm2 Designed for high utilization, high throughput, and multi-tasking Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Texture Filter Units (4) Local Data Share (64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (8KB)
17
GCN Compute Unit – Specifics
1 Fully Programmable Scalar ALU – Shared by all threads of a wavefront Used for flow control, pointer arithmetic, etc. Has own GPRs, scalar data cache, etc. 1 Branch & Message Unit Executes branch instructions (as dispatched by Scalar unit) 4 [16-lane] Vector ALU (SIMD) CU Total Throughput: 64 SP ops/clock 1 SP (Single-Precision) op per 4 clocks 1 DP (Double-Precision) ADD in 8 clocks 1 DP MUL/FMA/Transcendental per 16 clocks Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Texture Filter Units (4) Local Data Share (64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (8KB)
18
GCN Compute Unit – Specifics
64kb Local Data Share(LDS) 2x Larger than D3D11 TGSM Limit (32k/thread group) 32 banks, with conflict resolution Bandwidth amplification Separate Instruction Decode 16kb read/write L1 vector data cache Texture Units (Utilize L1) 4 Filter, 16 Load/Store Scheduler (2560 Threads) Separate decode/issue for VALU, SALU/SMEM, VMEM, LDS, GDS/Export + Special instructions (NOPs, barriers, etc.) and branch instructions Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Texture Filter Units (4) Local Data Share (64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (8KB)
19
GCN Compute Unit – SIMD Specifics
Each SIMD unit is assigned its own 40-bit program counter and instruction buffer for 10 wavefronts The whole CU can have 40 wavefronts in flight Each potentially from a different work-group or kernel Each SIMD is a 16-lane ALU IEEE-754 SP and DP Full-speed denormals + All Rounding Modes 32-bit FMA and 24-bit INT at full-speed DP and 32-bit INT at reduced rate (1/2 1/16) 64kb vector register file Issue 1 SP instruction per lane per clock Retire 64 lanes (1 wavefront) of SP ALU in 4 clocks A GCN GPU with 32 CUs, such as the AMD Radeon™ HD 7970, can be working on up to 81,920 work items at a time! Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Texture Filter Units (4) Local Data Share (64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (8KB) Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Local Data Share (64KB)
20
GCN Compute Unit – Scheduler Specifics
On GCN, each CU has its own dedicated Scheduler unit, supporting up to 2560 threads per CU Schedules this work between the 4 SIMDs in groups called “wavefronts” Each wavefront is a grouping of 64 “threads” which live together on a single SIMD One wavefront is executed on each SIMD every four cycles Total CU throughput: 4 wavefronts / 4 cycles That’s 256 threads executed every 4 cycles! Separate protected virtual address spaces Programmed in a purely scalar way Scheduler Limits: 40 wavefronts (theoretical max) per CU 10 wavefronts per SIMD These ideal limits may not be attained in practice Limited by number of available GPRs Limited by size of available LDS Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Texture Filter Units (4) Local Data Share (64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (8KB) Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Local Data Share (64KB)
21
GCN Compute Unit – Scheduler Specifics Cont.
Work should be grouped to support collaborative tasks All threads within a workgroup are guaranteed to be scheduled at the same time A set of synchronization primitives and shared memory (LDS) allows data to be passed between threads in a workgroup 16 Work Group Barriers supported per CU Global and Shared memory atomics Don’t forget about the L1 cache “Group discount” on memory reads As long as all threads are local to a CU! Optimized for throughput – latency is hidden by overlapping execution of wavefronts Workgroup size should be carefully chosen to balance the collaborative gain against hardware limitations such as GPR count and LDS size Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (4x 64KB) Texture Filter Units (4) Local Data Share (64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (8KB) Local Data Share (64KB) L1 Cache (16KB)
22
GCN Scheduler Arbitration and Decode
A CU is guaranteed to issue instructions for a wavefront sequentially Predication & control flow enables any single work-item a unique execution path For a given CU, every clock cycle, waves on one SIMD are considered for instruction issue Round robin scheduling algorithm At most, one instruction from each category may be issued At most, one instruction per wave may be issued Up to a maximum of 5 instructions can issue per cycle, not including “internal” instructions 1 Vector Arithmetic Logic Unit (ALU) 1 Scalar ALU or Scalar Memory Read 1 Vector memory access (Read/Write/Atomic) 1 Branch/Message - s_branch and s_cbranch_<cond> 1 Local Data Share (LDS) 1 Export or Global Data Share (GDS) 1 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional unit]
23
GCN Branch and Message Unit
Independent scalar assist unit to handle special classes of instructions concurrently Branch Unconditional Branch (s_branch) Conditional Branch (s_cbranch_<cond> ) Condition SCC==0, SCC=1, EXEC==0, EXEC!=0 , VCC==0, VCC!=0 16-bit signed immediate dword offset from PC provided Messages s_sendmsg CPU interrupt with optional halt (with shader supplied code and source), debug message (perf trace data, halt, etc) special graphics synchronization messages
24
GCN Vector Units VLIW4 SIMD GCN Quad SIMD
LANE SIMD 0 SIMD 1 SIMD 2 SIMD 3 64 Single Precision multiply-add 4 SIMDs × 1 ALU op occupancy limited No register port conflicts Standard compiler scheduling & optimizations Simplified assembly creation, analysis, & debug Simplified tool chain development and support Stable and predictable performance GCN Quad SIMD LANE 0 LANE 1 LANE 2 LANE 15 SIMD VLIW4 SIMD 64 Single Precision multiply-add 1 VLIW inst × 4 ALU ops dependency limited Compiler manages register port conflicts Specialized, complex compiler scheduling Difficult assembly creation, analysis, and debug Complicated tool chain support Careful optimization req. for peak performance Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
25
GCN Vector Units VLIW4 SIMD GCN Quad SIMD Occupancy Limited
LANE SIMD 0 SIMD 1 SIMD 2 SIMD 3 Occupancy Limited Data level parallelism Need to be able to run the same single instruction on 64 items of data Thread level parallelism 4x as many wavefronts to occupy all SIMDs GCN Quad SIMD LANE 0 LANE 1 LANE 2 LANE 15 SIMD VLIW4 SIMD Dependency Limited Instruction level paralellism Need to fill VLIW with four (or five) independent ops that can be run in parallel from the same program, each cycle! Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
26
GCN Vector ALU Characteristics
FMA (Fused Multiply Add), IEEE precise with all round modes, proper handling of Nan/Inf/Zero and full de-normal support in hardware for SP and DP MULADD single cycle issue instruction without truncation, enabling a MULieee followed by ADD ieee to be combined with round and normalization after both multiplication and subsequent addition VCMP A full set of operations designed to fully implement all the IEEE comparison predicates IEEE Rounding Modes (Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round toward zero) supported under program control anywhere in the shader. Double and single precision modes are controlled separately. De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero.
27
GCN Vector ALU Characteristics (Cont . . .)
DIVIDE ASSIST OPS IEEE 0.5 ULP Division accomplished with macro in (SP/DP ~15/41 Instruction Slots respectively) FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE 754 precision and rounding Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation 64-bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root 24 BIT INT full SP rates Heavy use for Integer thread group address calculation 32-bit Integer DP FP Mul/FMA rate
28
GCN Scalar Unit GCN Scalar Unit
LANE LANE LANE LANE Scalar Unit SIMD 0 SIMD 1 SIMD 2 SIMD 3 GCN Scalar Unit Fully Programmable Scalar Unit replaces FF Branch Logic Operations such as JMP [GPR] are now supported Opens the door to e.g. virtual function calls Has its own GPR pool and can execute normal ALU code 64-bit bitwise ops to mask thread execution 32-bit bitwise and integer arithmetic operations at full-speed Potential to offload scalar code (Vector ALU Scalar ALU) A GCN CU can dispatch 1 scalar op/clock (4 ops / 4 clocks) Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
29
4 CU Shared 16KB Scalar R/O L1
GCN Scalar Unit R/W L2 4 CU Shared 16KB Scalar R/O L1 Scalar Unit LANE LANE LANE LANE Scalar Unit 8KB Registers Integer ALU SIMD 0 SIMD 1 SIMD 2 SIMD 3 GCN Scalar Unit Scalar Decode Natively a 64-bit integer ALU Independent arbitration and instruction decode One ALU, memory or control flow op per cycle 512 Scalar GPR per SIMD shared between waves {SGPRn+1, SGPR} pair provide 64 bit register 4 CU Shared Read Only Scalar Data Cache: 16 KB – 64B lines 4 Way Assoc, LRU replacement policy Peak Bandwidth per CU is 16 bytes/cycle Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
30
GCN Compute Unit – Hardware View
LANE LANE LANE LANE Scalar Unit SIMD 0 SIMD 1 SIMD 2 SIMD 3 A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clks Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation takes 4 clocks to complete The scheduler dispatches from a different wavefront each cycle GCN Hardware View Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
31
GCN Compute Unit – Programmer View
LANE LANE LANE LANE Scalar Unit Clock 0 CLOCK 8 CLOCK 20 Clock 12 CLOCK 16 CLOCK 4 WAVEFRONT 0 WAVEFRONT 4 WAVEFRONT 1 WAVEFRONT 5 WAVEFRONT 2 WAVEFRONT 6 WAVEFRONT 3 WAVEFRONT 7 WAVEFRONT 0 WAVEFRONT 8 WAVEFRONT 1 WAVEFRONT 9 A GCN Compute Unit can perform 64 SP Vector ALU ops / clock Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation still takes 4 clocks to complete But you can PRETEND your code runs 1 op on 64-threads at once GCN Programmer View
32
GCN Shader Code Example
float fn0(float a,float b) { if(a>b) return((a-b)*a); else return((b-a)*b); } //Registers r0 contains “a”, r1 contains “b” //Value is returned in r2 v_cmp_gt_f32 r0,r1 //a > b, establish VCC s_mov_b64 s0,exec //Save current exec mask s_and_b64 exec,vcc,exec //Do “if” s_cbranch_vccz label0 //Branch if all lanes fail v_sub_f32 r2,r0,r1 //result = a – b v_mul_f32 r2,r2,r0 //result=result * a label0: s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec) s_cbranch_execz label1 //Branch if all lanes fail v_sub_f32 r2,r1,r0 //result = b – a v_mul_f32 r2,r2,r1 //result = result * b label1: s_mov_b64 exec,s0 //Restore exec mask label0: Optional: Use based on the number of instructions in conditional section Executed in branch unit label1: Purple: vector instructions Blue: scalar instructions. Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels within a primitive are processed). VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction. SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction. Shader code will be visible in GPUShaderAnalyzer to allow optimizations.
33
GCN Shader Authoring Tips
GCN has greatly improved branch performance, and it continues to improve Don’t be afraid to use it! But, remember: use it wisely – improved != free It’s at its best for highly coherent workloads (where most threads take the same path) But, the new architecture is more susceptible to register pressure Using too many registers within a shader can reduce the maximum waves per SIMD! Note: A wavefront can allocate 104 user scalar registers as several scalar registers are reserved for architectural state Take caution with respect to the following: Excessive nested branching/looping Loop Unrolling Variable declarations (especially arrays) Excessive function calls requiring storing of results GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128 Max Waves/SIMD 10 9 8 7 6 5 4 3 2 1
34
Cache Hierarchy GDS L2 L2 L2 I$ I$ K$ K$
32 KB instruction cache (I$) + 16 KB scalar data cache (K$) shared per 4 CUs with L2 backing Each CU has its own registers and local data share GDS 64 Bytes per clock L1 bandwidth per CU L1 read/write caches L1 L1 L1 L1 L1 L1 L1 L1 L1 64 Bytes per clock L2 bandwidth per partition Global data share facilitates synchronization between CUs (64 KB) L2 L2 L2 L2 read/write cache partitions 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller The new cache hierarchy was shown at AFDS. This core implements the first version of that core. It’s a full 2 level R/W cache, with 16Kbytes of L1 per CY, and 64 Kbytes per L2. Each CU has 64 Bytes per cycle of L1 BW, shared with the global data share (which is a local buffer for sharing data between wavefronts). Per L2 there’s 64 bytes of data per cycle as well. That’s nearly 2 TB/s of L1 BW, and 700 GB/s of L2 BW. Nice! Each group of four cores shares a 32KB instruction cache and a 16KB scalar data cache. Coherency is handled at the L2 level, with applications able to keep the physical L2’s updated directly with their L1’s. Never settle for enough cache bandwidth!
35
GCN Vector Memory Instructions
Vector memory instructions support variable granularity for addresses and data, ranging from 32-bit data to 128-bit pixel quads MUBUF – read from or perform write/atomic to an un-typed memory buffer/address Data type/size is specified by the instruction operation MTBUF – read from or write to a typed memory buffer/address Data type is specified in the resource constant MIMG – read/write/atomic operations on elements from an image surface Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data) Image objects use resource and sampler constants for access and filtering
36
GCN Export Memory Instruction
Exports move data from 1-4 VGPRs to Graphic Pipeline Color (MRT0-7), Depth, Position, and Parameter Global Shared Memory Ops (Utilize GDS)
37
GCN LOW-LEVEL TIPS – How GCN Exports Work
The export unit writes results from the programmable stages of the graphics pipeline to the fixed function ones, such as tessellation, rasterization and the render back-ends, via the GDS The GDS is identical to the local data shares, except that it is shared by all compute units, so it acts as an explicit global synchronization point between all wavefronts. The atomic units in the GDS additionally support ordered count operations
38
GCN Local Data Share (LDS)
64 kb, 32 bank (or 16 bank) Shared Memory, fully decoupled from ALU instructions Direct mode Vector Instruction Operand 32/16/8 bit broadcast value Graphics rate, no bank conflicts Index Mode – Load/Store/Atomic Operations Bandwidth Amplification, up-to 32 – 32 bit lanes serviced per clock peak Direct decoupled return to VGPRs Hardware conflict detection with auto scheduling Software consistency/coherency for thread groups via hardware barrier Fast & low power vector load return from R/W L1
39
GCN Local Data Share (LDS)
An LDS bank is 512 entries, each 32-bits wide A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that includes 32 atomic integer units This means that several threads can read the same LDS location at the same time for FREE Writing to the same address from multiple threads also occurs at rate, last thread to write wins Typically, the LDS will coalesce 32 lanes from one SIMD each cycle One wavefront is serviced completely every 2 cycles Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware An instruction which accesses different elements in the same bank takes additional cycles
40
GCN Local Data Share (LDS)
41
GCN R/W CACHE Reads and writes cached Bandwidth amplification
Improved behavior on more memory access patterns Improved write to read reuse performance Relaxed consistency memory model Consistency controls available to control locality of load/store GPU Coherent Acquire/Release semantics control data visibility across the machine (GLC bit on load/store) L2 coherent = all CUs can have the same view of data Global atomics Performed in L2 cache
42
GCN L1 R/W Cache Architecture
Each CU has its own L1 Cache 16 KB L1, 64B lines, 4 sets x 64 way ~64B/CLK per compute unit bandwidth Write-through – alloc on write (no read) w/dirty byte mask Write-through at end of wavefront Decompression on cache read out Instruction GLC bit defines cache behavior GLC = 0; Local caching (full lines left valid) Shader write back invalidate instructions GLC = 1; Global coherent (hits within wavefront boundaries)
43
GCN L2 R/W Cache Architecture
64-128KB L2 per Memory Controller Channel 64B lines, 16 way set associative ~64B/CLK per channel for L2/L1 bandwidth Write-back - alloc on write (no read) w/ dirty byte mask Acquire/Release semantics control data visibility across CUs L2 coherent = all CUs can have the same view of data Remote Atomic Operations Common Integer set & float Min/Max/CmpSwap
44
GCN Latency & Bandwidth
Each CU has 64 bytes per cycle of L1 bandwidth Shared with the GDS Per L2 there’s 64 bytes of data per cycle as well Peak Scalar Data Cache Bandwidth per CU is 16 bytes/cycle Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions) LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification That’s nearly 4 TB/s of LDS BW, 2 TB/s of L1 BW, and 700 GB/s of L2 BW! 384-bit GDDR5 Main Memory has over 264 GB/sec bandwidth PCI Express 3.0 x16 bus interface to system (32GBps)
45
GCN L1 Texture Cache The memory hierarchy is re-used for graphics
Some dedicated graphics hardware added Address-gen unit receives 4 texture addr/clock Calculates 16 sample addr (nearest neighbors) Reads samples from L1 data cache Decompresses samples in Texture Mapping Unit (TMU) TMU filters adjacent samples, produces <= 4 interpolated texels/clock TMU output undergoes format conversion and is written into the vector register file The format conversion hardware is also used for writing certain formats to memory from graphics shaders
46
GCN Virtual Memory and x86
The GCN cache hierarchy was designed to integrate with x86 microprocessors The GCN virtual memory system can support 4KB pages Natural mapping granularity for the x86 address space Paves the way for a shared address space in the future IOMMU used for DMA transfers can already translate requests into x86 address space GCN caches use 64B lines, which is the same size as x86 processors use The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control!
47
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Let’s look at the specifics of the 7900.
48
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Up to 32 Compute Units So we have a GCN architecture, composed of up to 32 compute units, depending on the SKU. That’s 3.8 Teraflops of compute for the top end! Nearly 4 teraflops!
49
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines Next we have our new geometry blocks. Now, realize that all geometry and tessellation shading is done in the core. But we have blocks for processing vertex indices and tessellation. We’ve refined this and introduce a new tessellation engine in here. We peak out at nearly 2 Gvertices and prims per second, but with much higher efficiency.
50
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends 32 color ROPs per clock 128 Z/stencil ROPs per clock 8 backend units, allowing for up to 32 ROPs and 128Z per clock. While matching our previous generation in theoreticals, it’s really up to 50% faster in real world benchmarks, due to the ample BW it now gets.
51
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends 32 color ROPs per clock 128 Z/stencil ROPs per clock Up to 768KB read/write L2 cache The R/W L2 cache is now 50% larger than our previous read only L2, with 50% more BW as well. That’s 768K of L2, with 512K of L1! Our texel rate is now 128 Texels per clock, or 118 GTexels/sec.
52
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends 32 color ROPs per clock 128 Z/stencil ROPs per clock Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface Up to 264 GB/sec PCI Express 3.0 x16 bus interface Now, we’ve increase the memory BW by 50%, allowing us to really push the texture and ROP rate. An example – on Cayman, we’ve measure typical pixel rates of only 23 pixels/clock, but on the 7990 we are getting nearly 32 pixels per clock. Very cool. And we are able to sink up to 264 GB/s of bw for ROPs, Z and Texture – it’s important to have this for us to be able to maintain 118 GTexels and 32 GPixels of ROPs per cycle. Finally, we introduce the first PCIe Gen3 discrete GPU on the market.
53
AMD Radeon™ HD 7900 Series – Codename “Tahiti”
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends 32 color ROPs per clock 128 Z/stencil ROPs per clock Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface Up to 264 GB/sec PCI Express 3.0 x16 bus interface 4.3 billion 28nm transistors 3.79 Peak Single-Precision TFLOPS All of this in a neat 4.3 Billion transistor package. It’s quite amazing. Never settle – it’s never enough.
54
AMD Radeon™ HD 7900 Series – Compute Architecture
Dual Asynchronous Compute Engines (ACE) Operate in parallel with graphics command processor Independent scheduling and work item dispatch for efficient multi-tasking 3 Devices with 3 Command Queues! Fast context switching Exposed in OpenCL™ Dual DMA engines Can saturate PCIe 3.0 x16 bus bandwidth (16 GB/sec bidirectional) We introduced the dual DMA engines in the 6900 series. We continue this in our new generation, and this is essential to get the maximum performance out of PCIeGen3. But on top of that, the 7900 series continues on with 2 independent compute engines, and an independent graphics/compute engine. This allows us to have up to 3 different, independent views of the chip, with 3 devices programmable to in memory. The plan is to offer this in OpenCL for now. This is independent of virtual memory differences, which allow for many apps to share one command queue. We actually have 3 devices, with 3 command queues, that are fully independent.
55
AMD Radeon™ HD 7900 Series – Compute Architecture
High performance double precision floating point processing Up to 947 DP GFLOPS Higher utilization = more usable FLOPS IEEE compliant More efficient flow control & branching Full ECC protection for DRAM & SRAM First GPU to fully support OpenCL 1.2, Direct3D + Compute 11.1, and C++ AMP New compute instructions Focus on DP rates, First GPU with OpenCL 1.2 and DirectX 11.1 New FSA-IL compatible instruction set Mention that OCL 1.2 allows multiple compute devices to be exposed on a single processor
56
GCN Architecture – ACE Intimate Details
ACEs are responsible for compute shader scheduling & resource allocation Each ACE fetches commands from cache or memory & forms task queues Tasks have a priority level for scheduling Background Realtime ACE dispatch tasks to shader arrays as resources permit Tasks complete out-of-order, tracked by ACE for correctness Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs We introduced the dual DMA engines in the 6900 series. We continue this in our new generation, and this is essential to get the maximum performance out of PCIeGen3. But on top of that, the 7900 series continues on with 2 independent compute engines, and an independent graphics/compute engine. This allows us to have up to 3 different, independent views of the chip, with 3 devices programmable to in memory. The plan is to offer this in OpenCL for now. This is independent of virtual memory differences, which allow for many apps to share one command queue. We actually have 3 devices, with 3 command queues, that are fully independent.
57
GCN Architecture – ACE Intimate Details
ACE are independent But, can synchronize and communicate via cache/mem/GDS Can form task graphs Individual tasks can have dependencies on one another Can depend on another ACE Can depend on part of the graphics pipeline Can control task switching Stop and start tasks and dispatch work to shader engines We introduced the dual DMA engines in the 6900 series. We continue this in our new generation, and this is essential to get the maximum performance out of PCIeGen3. But on top of that, the 7900 series continues on with 2 independent compute engines, and an independent graphics/compute engine. This allows us to have up to 3 different, independent views of the chip, with 3 devices programmable to in memory. The plan is to offer this in OpenCL for now. This is independent of virtual memory differences, which allow for many apps to share one command queue. We actually have 3 devices, with 3 command queues, that are fully independent.
58
GCN Architecture – Enabling Compute Workloads
Future Trends: More Compute Units ALU outpaces BW CPU + GPU Flat Mem APU + dGPU Less FF Graphics Can you write a Compute- based graphics pipeline? Start thinking about it… The focus in GPU hardware is shifting away from graphics-specific units, towards general-purpose compute units 7900 Series GCN-based ASICS already have “3:1” ratio of ACE : Graphics CP Graphics CP can dispatch compute ACE cannot dispatch graphics If you aren’t writing Compute Shaders, you’re probably not getting the absolute most out of modern GPUs Control of LDS, barriers, thread layout, etc. We introduced the dual DMA engines in the 6900 series. We continue this in our new generation, and this is essential to get the maximum performance out of PCIeGen3. But on top of that, the 7900 series continues on with 2 independent compute engines, and an independent graphics/compute engine. This allows us to have up to 3 different, independent views of the chip, with 3 devices programmable to in memory. The plan is to offer this in OpenCL for now. This is independent of virtual memory differences, which allow for many apps to share one command queue. We actually have 3 devices, with 3 command queues, that are fully independent.
59
Utilization and Efficiency
Higher utilization = higher performance per sq.mm GFLOPS increase (1.4x) Utilization improvement Let’s talk about what this all means, from a compute standpoint, which is essentially at a shader program. First off, clearly, the performance is radically higher. We get nearly 5x the performance of our previous generation, and always at least 50%. If we take into account the growth in GFLOPS, of roughly 1.4x, we see that in all cases, this core is simply just more efficient. We always do better, and sometimes many times better. It’s various mixtures – clearly non VLIW, better scheduling, etc… SmallptGPU & LuxMark are open source OpenCL ray tracers SHA256 is a secure hash function AES256 is a symmetric encryption algorithm
60
GCN Geometry Engine GS in conjunction with Tessellation is faster than before… However… memory is still the bottleneck! Minimize the number of inputs and outputs for best performance… Small expansions can be done in LDS! Each rasterizer can read in a single triangle per cycle, and write out 16 pixels Caveat: tiny triangles can mean that we don’t reach this potential, and become raster-bound! Tessellation on Tessellation off I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here. This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it. Image from Battlefield 3, EA DICE
61
GCN Tessellation Latest iteration of hardware tessellation units
Increased vertex re-use Off-chip buffering improvements Larger parameter caches Improves performance at all tessellation factors Up to 4x throughput of AMD Radeon HD 6900 series (Gen 8) Tessellation on Tessellation off I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here. This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it. Image from Battlefield 3, EA DICE
62
GCN Tessellation – Performance
This slide shows our tessellation rate versus 6900 for different tessellation factors. As you can see, we are anywhere from 1.7x to 4x. This translates to significant surges in performance on some tessellation heavy cases, as shown on the right. Now, I’m still not a proponent of excessive tessellation for no reason – it’s still not right to increase tessellation factors with no visual benefit. System configuration: Corei7 X980 (3.33Ghz) Gigabyte EX58UD5 6GB DDR Windows 7 RTM 64bit AMD Radeon™ HD 7970 AMD Radeon™ HD 6970
63
GCN Tessellation – Best Practices
While performance is much improved, it is still a potential bottleneck! Produces a great deal of ring bus traffic, starving other parts of the pipeline Best performance achieved with tessellation factors less than 15! Continue to Optimize: Pre-triangulate Frustum Culling Backface Culling Distance-adaptive Screen-space adaptive Orientation-adaptive Etc… pre-tessellate as needed in order to avoid higher tess factors.
64
Render Back-End (RBE) ON GCN ASICS
Once the pixels fragments in a tile have been shaded, they flow to the Render Back-Ends (RBEs) 16KB Color Cache Up to 8 color samples (i.e. 8x MSAA) 4KB Depth Cache Up to 16 coverage samples (i.e. 16x EQAA) Write out through the memory controllers Logic Operations as alternative to Blending Exposed in DX11.1 Already available in OpenGL Dual-Source Color Blending with MRTs Only available in OpenGL
65
DEPTH IMPROVEMENTS ON GCN ASICS
Allows fast accept of fully-visible triangles spanning one or more tile If a triangle is fully covering a tile then cost is only 1 clock/tile Depth Bounds Testing Extension Exposed in OpenGL: GL_EXT_depth_bounds_test Also Exposed in Direct3D via an extension – ask us if you’d like to try it 24-bit depth formats are internally represented as 32-bits Faster Z tile rate: if a tile is fully covered by a triangle then cost is only 1 clock/tile. NI used to be 3 clock/tile.
66
STENCIL IMPROVEMENTS ON GCN ASICS
GCN has support for new extended stencil ops compared to prior ASICS Only available in OpenGL: GL_AMD_stencil_operation_extended Additional stencil ops: AND, XOR, NOR REPLACE_VALUE_AMD etc. Also exposes additional stencil op source value Can be used as an alternative to stencil ref value Stencil ref and op source value can now be exported from pixel shader Only available in OpenGL: GL_AMD_shader_stencil_value_export
67
GCN LOW-LEVEL TIPS – GPR Utilization
GPRs and GPR pressure General Purpose Registers (GPR) are a limited resource Separate banks of GPRs for Vector and Scalar (per SIMD) Maximum of 256 VGPRs and 512 SGPRs shared across all waves (upto 10) owned by a SIMD Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for 128-bit) Number of GPRs required by a shader affects SIMD scheduling and execution efficiency Shader tools can be used to determine how many GPRs are used GPR pressure is affected by: Loop Unrolling Long lifetime of temporary variables Nested Dynamic Flow Control instructions Fetch dependencies (e.g. indexed constants)
68
GCN LOW-LEVEL TIPS – Texture Filtering
GCN Bilinear CostS Quarter-rate RGBA32 and RGBA32F All shader stages can fetch textures Point sampling is full-rate on all formats Trilinear costs up to 2x the bilinear filtering cost Anisotropic(N taps) costs <= ( N * bilinear ) Avoid cache thrashing Use MIPmapping Use Gather() where applicable Exploit implicit neighbouring pixel shader threadCU locality: Remember that sampling from neighbouring texels has a lower cost for a shader running within the same hardware tile, because it is more likely to experience a cache hit within the Compute Unit’s local texture cache Exploit this explicitly by using Compute Shaders Half-rate RG32, RG32F RGBA16, RGBA16F BC6 Full-rate Everything else!
69
GCN LOW-LEVEL TIPS – Color Output
Quarter-rate RGBA16 with blending RGBA32F with blending PS output: each additional color output increases export cost Export cost can be more costly than PS execution Each (fast) export is equivalent to 64 ALU ops on 7970 If shader is export-bound then use “free” ALU for packing instead Watch out for those cases E.g. G-Buffer parameter writes MINIMIZE SHADER INPUTS AND OUTPUTS! Pack, pack, pack, pack! Costs of outputting and blending various formats Discard/clip allow the shader hardware to skip the rest of the work! Half-rate R16, RG16 with blending RG32F with blending RGBA32, RGBA32F Full-rate Everything else!
70
GCN Media Processing Instructions
SAD = Sum of Absolute Differences Critical to many video & image processing algorithms Motion detection Gesture recognition Video & image search Stereo depth extraction Computer vision SAD (4x1) and QSAD (4 4x1) instructions New QSAD combines SAD with alignment ops for higher performance and reduced power draw Evaluate up to 256 pixels per CU per clock cycle! Maskable MQSAD instruction Allows background pixels to be ignored Accelerated isolation of moving objects Closest match SAD = 22 3 SAD = 22 5 5 5 SAD = 59 9 3 1 4 4 7 1 2 9 6 6 2 5 5 4 2 5 9 7 1 5 3 2 4 7 1 8 1 1 7 6 8 3 7 5 9 4 3 SAD = 58 2 9 9 3 SAD = 45 7 1 4 1 3 1 7 4 5 8 2 2 2 5 4 8 2 3 9 9 2 4 7 1 3 6 AMD Radeon HD 7970 can evaluate 7.6 Terapixels/sec * * Peak theoretical performance for 8-bit integer pixels
71
AMD Radeon HD 7970 can compress
GCN Video Codec Engine Video Codec Engine (VCE) Hardware H.264 Compression and Decompression Ultra-low-power, fully fixed-function mode Capable of 60 frames / second Programmable for Ultra High Quality and or Speed Entropy encoding block fully accessible to software AMD Accelerated Parallel Programming SDK OpenCL ™ Create hybrid faster-than-real-time encoders! Custom motion estimation Inverse DCT and motion compensation Combine with hardware entropy encoding! AMD Radeon HD 7970 can compress Realtime+ 1080p H.264
72
IMPORTANT GCN ARCHITECTURE IMPROVEMENTS
Increased flexibility and efficiency, with reduced complexity! Non-VLIW Architecture improves efficiency while reducing programmer burden Constants/resources are just address + offset now in the hardware UAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast GPU has virtual memory, forward looking towards x86 CPU + GPU flat memory Strong forward-looking focus on Compute Scalar ALU for complex dynamic control flow + branch & message unit 64k LDS/CU, 64k GDS, atomics at every stage, coherent cache hierarchy Multiple Asynchronous Compute Engines (ACE) for multitasking compute
73
MAIN GCN ARCHITECTURE TAKEAWAYS
GCN generally simplifies your life as a programmer Don’t: fret too much about instruction grouping, or vectorization Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts) Do: Think about thread/cache locality when you structure your algorithm Do: Pack shader inputs and outputs – aim to be as IO/bandwidth thin as possible! Unlimited number of addressable constants/resources N constants aren’t free anymore – each consume resources, use sparingly! Compute is the future – exploit its power for GPGPU work & graphics!
74
Thank you if we have time remaining, we can cover partially resident textures
Layla Mah @MissQuickstep
75
Partially Resident Textures (PRT)
MegaTexture in id Tech5
76
Partially Resident Textures (PRT) – Introduction
Enables application to manage more texture data than can physically fit in a fixed footprint A.k.a. “Virtual texturing“ and “Sparse texturing” The principle behind PRT is that not all texture contents are likely to be needed at any given time Current render view may only require selected portions of a texture to be resident in memory Or, only selected MIPMap levels… PRT textures only have a portion of their data mapped into GPU-accessible memory at a time Texture data can be streamed in on-demand Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit) OpenGL extension – GL_AMD_sparse_texture Some people have been referring to PRT as a form of texture compression but the term is misleading because texture data is stored losslessly. 76
77
Partially Resident Textures (PRT) – TEXTURE TILES
The PRT texture is chunked into 64 KB tiles Fixed memory size Not dependant on texture type or format Highlighted areas represent texture data that needs highest resolution Chunked texture Texture tiles needing to be resident in GPU memory Not dependant on texture type or format: for example for 32 bits texture a tile could be 128x128 pixels. For DXT3/BC2 it would be 256x256 pixels. Smiley texture taken from “Sparse Virtual Texture” presentation, GDC 2008. Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 77
78
PRT – Translation Table
The GPU virtual memory page table translates 64kb tiles into a resident texture tile pool Texture Map Page Table Texture Tile Pool (Video Memory) (linear storage) 64Kb tile Unmapped page entry Mapped page entry Not all arrows shown for clarity Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 78
79
PRT – Translation Table – Mip Maps
Not all tiles from the texture map are actually resident in video memory PRT hardware page table stores virtual physical mappings Texture Map Page Table Texture Tile Pool (Video Memory) 64Kb tile Unmapped page entry Not all arrows shown for clarity MIP Levels Mapped page entry Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 79
80
PRT – TILE MANAGEMENT The Application is responsible for uploading/releasing new PRT tiles! A common scenario is to upload lower MIPMaps to texture tile pool This allows a full representation of the PRT contents to be resident in memory (albeit at lower resolution) e.g. MIP LOD 6 and above for 16kx16k 32 bits texture is about 650Kb (256x256 resolution) Texture tiles corresponding to higher resolution areas are uploaded by the application as needed e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen pixels ratio increases, thus higher LOD tiles need uploading By the application: this is not an “on-demand” paging system. 80
81
PRT – “FAILED” FETCH How does the application know which texture tiles to upload? Answer: PRT-specific texture fetch instructions in pixel shader Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently not in the pool OpenGL example: int glSparseTexture( gsampler2D sampler, vec2 P, inout gvec4 texel ); This information is then stored in render target or UAV Texel fetch failed for a given (x,y) tile location ...and then copied to the CPU so that application can upload required tiles App chooses what to render until missing data gets uploaded OpenGL example: there is a “sparse” version of virtually all texture fetch instructions. ...and then copied to the CPU so that application can upload required tiles: GPU->CPU copies have a few frames delay 81
82
PRT – “LOD WARNING” TEXEL FETCH CONDITION
PRT fetch condition code can also indicate an “LOD Warning” The minimum LOD warning is specified by the application on a per texture basis OpenGL example: glTexParameteri( <target>, MIN_WARNING_LOD_AMD, <LOD warning value> ); If a fetched pixel’s LOD is < the specified LOD warning value then the condition code is returned This functionality is typically used to try to predict when higher-resolution MIP levels will be needed E.g. Camera getting closer to PRT-mapped geometry 82
83
PRT – Example Usage 1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API 2. App uploads MIP levels using API calls 3. Shader fetches PRT data at specified texcoords Two possibilities: 3.a. Texel data belongs to a resident (64KB) tile - Valid color returned, no error code 3.b. Texel data points to non-resident tile or specified LOD - Error/LOD Warning code returned - Shader writes tile location and error code to RT or UAV 4. App reads RT or UAV and upload/release new tiles as needed App allocates PRT (e.g. 16kx16k BC1) using PRT API: no video memory allocated at this stage App reads RT or UAV and upload/release new tiles as needed: UAV reading on the CPU subject to latency, typically a couple of frames when copying GPU memory to CPU-accessible memory 83
84
PRT Types, Formats and Dimensions
All texture types and formats supported 1D, 2D, cube, arrays and 3D volume textures All common texture formats Including compressed formats Maximum dimensions: 16K x 16K x 8K x 128 bit textures 16K x 16K x 8K x 128 bit textures is 32 Terrabytes of data, without including MIPMaps. 84
85
Hardware PRT > Software Implementation
Ease of implementation Complexity hidden behind HW & API Full filtering support Includes anisotropic filtering Full-speed filtering SW solution requires “manual” filtering Software anisotropic is very costly SW Implementation Don’t go overboard with PRT allocation! Page table entry size is 4 DWORDs Have to be resident in video memory Advantage of PRT over virtual texturing SW solution: reduced overhead, no need for texture borders, full aniso is possible Page entry size is 4 DWORDS (16 bytes). One for each 64Kb tile means that the maximum texture size of 16kx16kx8kx32bits = 8Tb needs 2 Gigs of page table entries (that have to be resident) PRT is a HW solution that eliminates the complexity and limitations of software solutions (e.g. Carmack’s MegaTexture) 85
86
问题? Questions? 質問がありますか? ^_^
Layla Mah @MissQuickstep 86
87
Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2012 Advanced Micro Devices, Inc. All rights reserved.
88
Backup Slides 88
89
SHADER CODE EXAMPLE 2 Purple: vector instructions
Blue: scalar instructions. Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels rwithin a primitive are processed). VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction. SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.