AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM STEPHAN HODES DEVELOPER TECHNOLOGY ENGINEER, AMD GCN PERFORMANCE „FTW“

Slides:

Advertisements

Similar presentations

Processes and Operating Systems

Advertisements

DirectCompute Performance on DX11 Hardware

Liberating GPGPU Programming

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Week 2 The Object-Oriented Approach to Requirements

SE-292 High Performance Computing

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

Database Performance Tuning and Query Optimization

DirectX11 Performance Reloaded

Project 5: Virtual Memory

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

CS Spring 2014 Prelim 2 Review

Processes Management.

ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Pointers and Arrays Chapter 12

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

Topics Left Superscalar machines IA64 / EPIC architecture

Compiler Construction Sohail Aslam Lecture Code Generation  The code generation problem is the task of mapping intermediate code to machine code.

Instruction Level Parallelism

3-Software Design Basics in Embedded Systems

ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008.

Chapter 3 General-Purpose Processors: Software

THUMB Instructions: Branching and Data Processing

Digital System Design Using Verilog

Princess Sumaya University

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

The University of Adelaide, School of Computer Science

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

HOLY SMOKE! FASTER PARTICLE RENDERING USING DIRECTCOMPUTE AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM GARETH THOMAS 2 ND JUNE 2014.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

William Stallings Computer Organization and Architecture 8th Edition

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

The Small batch (and Other) solutions in Mantle API

Vector Processing => Multimedia

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

NVIDIA Fermi Architecture

Interference from GPU System Service Requests

Interference from GPU System Service Requests

RegMutex: Inter-Warp GPU Register Time-Sharing

Compute Shaders Optimize your engine using compute

Advanced Micro Devices, Inc.

RADEON™ 9700 Architecture and 3D Performance

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM STEPHAN HODES DEVELOPER TECHNOLOGY ENGINEER, AMD GCN PERFORMANCE „FTW“

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 2 AGENDA GCN architecture explained Top 10: GCN Performance Advice Questions

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 3 AMD G RAPHICS C ORE N EXT AMD G RAPHICS C ORE N EXT  What is GCN? ‒Non VLIW architecture ‒Less dependent on manual vectorization of shaders ‒Susceptible to register pressure ‒Architecture used in: ‒AMD discrete GPUs since 2012 (HD7700 and better) ‒Kabini and Kaveri APUs ‒Future AMD hardware ‒New consoles  GCN Hardware is required for Mantle ‒ DirectX 12 API support

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 4 4 PRODUCT SPECIFICATIONS AMD RADEON ™ R9 290 SERIES R9 290XR9 290 Compute Units4440 Engine ClockUp to 1 GHzUp to 950 MHz Compute Performance5.6 TFLOPS4.9 TFLOPS Memory Configuration4GB GDDR5 / 512-bit Memory Speed5.0 Gbps AMD TrueAudio TechnologyYes API Support DirectX ® 11.2 OpenGL 4.3 Mantle DirectX ® 11.2 OpenGL 4.3 Mantle

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 5 GCN COMPUTE UNIT – SPECIFICS  Non VLIW instruction set architecture  4 [16-lane] Vector ALU (SIMD) ‒ One wavefront is 64 threads ‒ 1 SP (Single-Precision) op: 4 clocks ‒ 1 DP (Double-Precision) ADD: 8 clocks ‒ 1 DP MUL/FMA & Transcendental:16 clocks ‒ 64KB Vector GPRs  1 fully programmable scalar ALU ‒ Shared by all threads of a wavefront ‒ Used for flow control, pointer arithmetic, etc. ‒ 8KB Scalar GPRs, scalar data cache, etc. Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (VGPRs, 4x 64KB) Texture Filter Units (4) Local Data Share (LDS, 64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (SGPRs, 8KB)

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 6 GCN COMPUTE UNIT – SPECIFICS  Distributed programmable scheduler(up to 2560 threads) ‒ Each compute unit can execute instructions from multiple kernels ‒ Separate decode/issue for: ‒ 1 Vector Arithmetic Logic Unit (ALU) ‒ 1 Scalar ALU or Scalar Memory Read or 1 Branch/Message ‒ 1 Vector memory access (Read/Write/Atomic) ‒ 1 Local Data Share operation (LDS) ‒ 1 Export or Global Data Share operation (GDS) Plus 1 Special/Internal – [no functional unit] (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (VGPRs, 4x 64KB) Texture Filter Units (4) Local Data Share (LDS, 64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (SGPRs, 8KB)

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 7 GCN COMPUTE UNIT – SPECIFICS  64KB Local Data Share(LDS) ‒ 32 banks, with conflict resolution ‒ Bandwidth amplification  16KB read/write L1 vector data cache  Texture Units (utilize L1) ‒ 16 Load/Store units ‒ 4 Filter units  1 Branch & Message Unit ‒ Executes branch instructions (as dispatched by Scalar Unit) Branch & Message Unit Scalar Unit Vector Units (4x SIMD-16) Vector Registers (VGPRs, 4x 64KB) Texture Filter Units (4) Local Data Share (LDS, 64KB) L1 Cache (16KB) Scheduler Texture Fetch Load / Store Units (16) Scalar Registers (SGPRs, 8KB)

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 8 GCN COMPUTE UNIT – LATENCY HIDING  Up to 10 Wavefronts/SIMD ‒ Used to hide latency ‒ Round Robin scheduling ‒ Independent kernels ‒ Often limited by GPR or LDS usage Time (clocks) Batch 2Batch 3Batch 4Batch 1 Stall Runnable Stall Runnable Stall Runnable Stall Runnable Done!

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 9 GDC COMPUTE UNIT – REGISTER PRESSURE  Vector GPRs ‒64KB / 64 threads / 4 Byte / 10 wavefronts = 25.6 VGPR/thread => Max 24 VGPR per thread  Scalar GPRs ‒8KB / 4 SIMD / 4 Byte / 10 wavefronts = 51.2 SGPR/wavefronts => Max 48 SGPR per wavefront  LDS ‒32KB/threadgroup and threadgroup size 64 => 2 wavefronts/CU max. ‒32KB/threadgroup and threadgroup size 256 => 8 wavefronts/CU max. ‒16KB/threadgroup and threadgroup size 256 => 16 wavefronts/CU max.

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 10 GCN SHADER OPTIMIZATION STRATEGIES  Try reducing GPR count if you are slightly over a waves-per-SIMD threshold ‒Deep nesting ‒Local array declarations ‒Long-lived temporary variables  Reducing GPRs not always optimal ‒Shadercompiler might use GPRs to reduce latency ‒High number of threads/CU can thrash your caches image_load v6, v[35:38], s[4:11] v_mov_b32 v3, v35 image_load v7, v[3:6], s[4:11] v_mov_b32 v38, v36 image_load v8, v[37:40], s[4:11] v_mov_b32 v3, v37 image_load v9, v[3:6], s[4:11] s_waitcnt vmcnt(2) v_min_f32 v6, v6, v7 s_waitcnt vmcnt(1) v_min_f32 v6, v6, v8 s_waitcnt vmcnt(0) v_min_f32 v40, v6, v9 image_load v6, v[35:38], s[4:11] v_mov_b32 v3, v35 image_load v7, v[3:6], s[4:11] v_mov_b32 v38, v36 v_mov_b32 v3, v37 s_waitcnt vmcnt(0) v_min_f32 v6, v6, v7 image_load v7, v[37:40], s[4:11] s_waitcnt vmcnt(0) v_min_f32 v6, v6, v7 image_load v7, v[3:6], s[4:11] s_waitcnt vmcnt(0) v_min_f32 v6, v6, v7 Always profile your changes!

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 11 Top 10 Performance Advice

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 12 TOP 10 PERFORMANCE ADVICE 1.Use the power of DirectCompute ‒ Thread group size should be multiple of 64 ‒256 is often a good choice. ‒Don‘t underestimate the benefits of LDS ‒Use asynchronous compute ‒Don‘t switch between Compute/Rasterization too frequently

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 13 TOP 10 PERFORMANCE ADVICE 2.Don‘t over-tessellate ‒Small triangles result in poor quad occupancy ‒Use [maxtessfactor(X)] in Hull Shader declaration ‒Recommended value is 15 or less ‒Implement culling in Hull Shader ‒Use Adaptive Tessellation ‒Distance Adaptive ‒Screen Space Adaptive ‒Orientation Adaptive ! Especially when rendering Shadowmaps!!!

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 14 TOP 10 PERFORMANCE ADVICE 3.Keep your pipeline short ‒Avoid large expansion in the Geometry Shader ‒Often a Vertex Shader-only solution can replace Geometry Shader usage ‒Bokeh expansion ‒Pointsprites ‒Disable tessellation pipeline if unused 4.Pack shaderstage output ‒Limit Vertex and Domain Shader output size to 4 float4/int4 attributes for best performance. struct PS_INPUT { float3 vPosition; float3 vNormal; float2 vTexcoord1; float2 vTexcoord2; float2 vTexcoord3; }; // Unoptimal struct PS_INPUT { float4 vPositionTexcoord1U; float4 vNormalTexcoord1V; float4 vTexcoords23; }; // Good

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 15 TOP 10 PERFORMANCE ADVICE 5.Update your Data using map/unmap ‒Avoid MAP_WRITE_DISCARD ‒Prefer MAP_WRITE_NO_OVERWRITE ‒Avoid UpdateSubresource ‒Prefer Map and/or CopyResource instead ‒UpdateSubresource is ok for small (<=4KB) updates ‒CopyResource introduces GPU stalls ‒Don‘t use the updated resource immediately ‒Using data without copying it to local first sometimes can improve performance Up to +20% Performance boost

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 16 TOP 10 PERFORMANCE ADVICE 6.Use flow control with care ‒Flow control has little overhead ‒Skipping data fetches usually is good ‒Avoid non-coherent codepaths within a wavefront ‒Watch out for GPR pressure caused by loops and deep nested branches v_cmp_gt_f32r0,r1//a > b, establish VCC s_mov_b64s0,exec//Save current exec mask s_and_b64exec,vcc,exec //Do “if” s_cbranch_vcczlabel0//Branch if all lanes fail v_sub_f32r2,r0,r1//result = a – b v_mul_f32r2,r2,r0//result=result * a label0: s_andn2_b64exec,s0,exec //Do “else”(s0 & !exec) s_cbranch_execzlabel1//Branch if all lanes fail v_sub_f32r2,r1,r0//result = b – a v_mul_f32r2,r2,r1//result = result * b label1: s_mov_b64exec,s0//Restore exec mask // Branching code example float fn0(float a,float b) { if(a>b) return((a-b)*a); else return((b-a)*b); }

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 17 TOP 10 PERFORMANCE ADVICE 7.Pack your G-Buffer using RGBA16_UINT ‒Fetches from RGBA16 are full rate (without filtering) ‒Bilinear fetches to RGBA16 are half rate ‒Exports to RGBA16_INT are full rate (without blending) Caution: Blended exports to RGBA16_INT are ¼ speed 8.Depth buffer: don’t render after read ‒Binding a depth buffer as texture will decompress it, this will make subsequent Z ops more expensive. ‒Critical for shadow map atlas rendering! ‒Consider exporting depth to G-Buffer

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 18 TOP 10 PERFORMANCE ADVICE 9.Batch, Batch, Batch! ‒Add support for geometry instancing ‒Pool & batch your updates ‒Less important with Mantle/DirectX12 ‒Reduces Drawcall overhead ‒Allows better scheduling 10.(DX11) Prefer engine threading over Deferred Contexts ‒Deferred contexts are a software feature ‒… or move to Mantle/DirectX12

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 19 TOP 10 PERFORMANCE ADVICE  Avoid LDS bank conflicts ‒Accessing LDS with addresses that are 32 DWORD apart from different threads will cause bank conflicts ‒Unless if it‘s the same address  Don't use gather with offsets ‒This will result in 4 image_gather4 instructions image_gather4_c_lz v4, v[12:15], s[4:11], s[12:15] v_mov_b32 v11, 1 image_gather4_c_lz_o v5, v[11:14], s[4:11], s[12:15] v_mov_b32 v11, 0x image_gather4_c_lz_o v7, v[11:14], s[4:11], s[12:15] v_mov_b32 v11, 0x image_gather4_c_lz_o v0, v[11:14], s[4:11], s[12:15] s_waitcnt vmcnt(0) Bonus Advice image_gather4_c_lz v0, v[2:5], s[4:11], s[12:15] s_waitcnt vmcnt(0) float4 PsExample( PsInput Input ) : SV_Target { return tex.GatherCmpRed( g_SamplePointCmp, Input.vTex, Input.depth ); } float4 PsExample( PsInput Input ) : SV_Target { return tex.GatherCmpRed( g_SamplePointCmp, Input.vTex, Input.depth, int2(0,0), int2(1,0), int2(0,1), int2(1,1) ); }

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 20 Questions?

| GCN PERFORMANCE „FTW“ | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE , STOCKHOLM 21 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.