Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
OPTIMIZING AND DEBUGGING GRAPHICS APPLICATIONS WITH AMD'S GPU PERFSTUDIO 2.5 GPG Developer Tools Gordon Selley Peter Lohrmann GDC 2011.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
AMD platform security processor
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Computer Graphics Graphics Hardware
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
My Coordinates Office EM G.27 contact time:
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
Single Instruction Multiple Threads
Computer Graphics Graphics Hardware
GCSE Computing - The CPU
GPU Architecture and Its Application
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER
Central Processing Unit- CPU
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
BLIS optimized for EPYCTM Processors
Optimization with Radeon GPU Profiler
Instructions at the Lowest Level
Lecture 5: GPU Compute Architecture
The Small batch (and Other) solutions in Mantle API
Hyperthreading Technology
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Interference from GPU System Service Requests
Simulation of exascale nodes through runtime hardware monitoring
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Compute Shaders Optimize your engine using compute
Chapter 1 Introduction.
* From AMD 1996 Publication #18522 Revision E
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
UMBC Graphics for Games
Advanced Micro Devices, Inc.
AMD GPU Performance Revealed
UE4 Vulkan Updates & Tips
RADEON™ 9700 Architecture and 3D Performance
GCSE Computing - The CPU
CSE 502: Computer Architecture
Presentation transcript:

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018 Taking the Red Pill – Using Radeon GPU Profiler to look inside your game (With Bonus UE4 Vulkan Update from Epic) Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018 Taking the Red Pill – Using Radeon GPU Profiler to look inside your game (With Bonus UE4 Vulkan Update from Epic) Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Radeon GPU Profiler walkthrough with examples from UE4 Vulkan agenda Radeon GPU Profiler walkthrough with examples from UE4 Vulkan Jason Stewart (AMD) UE4 Vulkan update Rolando Caloca O. (Epic Games)

Radeon GPU Profiler walkthrough With examples from UE4 Vulkan Jason Stewart (AMD)

Analyze. Adjust. Accelerate. Quick Start Guide What is Radeon GPU Profiler? GPU Profiler Analyze. Adjust. Accelerate.

Quick Start Guide Detailed workload information What is Radeon GPU Profiler? Detailed workload information DX12 and Vulkan support Hardware level profiling features

Quick Start Guide RGP support is built directly into the production driver

Quick Start Guide RGP is available for free on GitHub as part of gpuopen

Quick Start Guide https://gpuopen.com/tag/rgp/ More information on gpuopen.com https://gpuopen.com/tag/rgp/ https://gpuopen.com/gaming-product/radeon-gpu-profiler-rgp/ Or just Google on Radeon GPU Profiler

Quick Start Guide Connect How to capture a profile Connect Radeon Developer Panel to Radeon Developer Service Set up the target application

Quick Start Guide Capture How to capture a profile Launch the target application RGP support is built directly into the production driver Capture a trace Double click to open in RGP

Quick Start Guide Refer to the built-in help for more information Help

Quick Start Guide Refer to the built-in help for more information

RGP/RenderDoc Interop One slide to talk about RGP/RenderDoc Interop Only one slide, because there will have been a full talk on this on the same day just a few hours earlier Mention the UE4 works out of the box with RGP and RenderDoc e.g. frame markers get enabled automatically Needs 18.10 or newer driver VK_EXT_debug_marker, vkCmdDebugMarkerBeginEXT, vkCmdDebugMarkerEndEXT EnableIdealGPUCaptureOptions

Viewing a capture in RGP Where to start SunTemple at 1080p with vsync enabled, present bound 18.10 driver includes present packets in the system activity display This slide is to point out the new present packet display and to emphasize that you want to get out of this situation (vsync bound) when profiling For profiling, prefer VK_PRESENT_MODE_IMMEDIATE_KHR (i.e. disable vsync)? If so, can we do this automatically in UE4 code, like we enable frame markers?

Viewing a capture in RGP Where to start Placeholder Show no vsync, but still not right Broken query system, for example

Viewing a capture in RGP Where to start You want to see something more like this Vsync disabled, no CPU-GPU sync points, no GPU idle time, GPU bound You want to make sure you have this part right, the CPU-side feeding of the GPU, before continuing

Viewing a capture in RGP Where to start Barrier reason codes Also mention that fast clear eliminate and init mask RAM are to be expected But you want to try to avoid the other cases Depth/stencil decompress, DCC decompress, etc.

Viewing a capture in RGP Where to start Talk about the other things available in the overview tab Most expensive events Context rolls Device configuration This part might be brief, depending on if we have any particularly interesting UE4 Vulkan examples to talk about

Event Timing Do before and after depth bounds test Frame it in the context of making sure any DX-only optimizations are brought over to your Vulkan path

Wavefront Occupancy Occupancy Graph GPU Events

Wavefront Occupancy What’s a wavefront? ?

GCN Architecture X Shader Engines per Chip with Y Compute Units per Shader Engine This particular diagram is for Polaris, specifically Polaris10 "Ellesmere“, i.e. RX 480/580

GCN Architecture X Shader Engines per Chip with Y Compute Units per Shader Engine This particular diagram is for Polaris, specifically Polaris10 "Ellesmere“, i.e. RX 480/580

GCN Architecture X Shader Engines per Chip RX 580 has this configuration Compute Units: 36 ROPs: 32 Stream Processors: 2304 Texture Units: 144 Once pixels have been shaded, they are typically sent to the render back-ends (RBs) for depth and stencil tests and any blending before the final output to the display 36 CUs with 4 SIMDs per CU and a 16x VALU per SIMD that can do 2 floating point ops per clock equals … 36×4×16=2304 Stream Processors 2304×2=4608 FLOPs per clock 4608 FLOPs per clock at 1340 MHz boost clock equals … 4608×1340000000=6174720000000 6.2 TFLOPs

GCN Architecture X Shader Engines per Chip with Y Compute Units per Shader Engine RX 580 has this configuration Compute Units: 36 Stream Processors: 2304 Texture Units: 144 36 CUs with 4 SIMDs per CU and a 16x VALU per SIMD that can do 2 floating point ops per clock equals … 36×4×16=2304 Stream Processors 2304×2=4608 FLOPs per clock 4608 FLOPs per clock at 1340 MHz boost clock equals … 4608×1340000000=6174720000000 6.2 TFLOPs

GCN Architecture Y Compute Units per Shader Engine RX 580 has this configuration Compute Units: 36 Stream Processors: 2304 Texture Units: 144

GCN Compute Unit 4 SIMD Units per CU Scheduler VALU LDS VGPR Tex L1 SALU SGPR VALU VGPR LDS Branch & Message Unit

GCN Compute Unit 4 SIMD Units per CU Scheduler VALU VALU LDS VALU VALU Branch & Message Unit Scheduler Tex L1 VALU VALU LDS SALU VALU VALU VGPR VGPR SGPR VGPR VGPR

GCN Compute Unit 4 SIMD Units per CU VALU VGPR VALU VGPR VALU VGPR

GCN Compute Unit SIMD SIMD SIMD SIMD 4 SIMD Units per CU VALU VALU VGPR VGPR VGPR VGPR

GCN Compute Unit What’s a wavefront? SIMD SIMD

GCN Compute Unit SIMD SIMD = Single instruction, multiple data What’s a wavefront? SIMD = Single instruction, multiple data What’s the multiple? The GPU works on 64-thread groups called wavefronts (or waves) Smallest unit of GPU work Each wavefront shares a single program counter Each SIMD is executing an independent wavefront 16-lane vector ALU per SIMD Each SIMD cycles through the 64 work items in a wavefront over four clock cycles 16-lane vector ALU SIMD SIMD A wavefront or wave is a collection of 64 work items grouped for efficient processing on the compute unit Each wavefront shares a single program counter Each SIMD is executing an independent wavefront Each SIMD includes a 16-lane vector pipeline Each SIMD simultaneously executes a single operation across 16 work items (threads) A wavefront is issued to a SIMD in a single cycle, but takes 4 cycles to execute operations for all 64 work items That is, each SIMD operates on wavefronts of 64 work-items over four clock cycles Many wavefronts can be processed in parallel 64KB VGPR Memory

GCN Compute Unit Multiple Waves in flight per SIMD A GCN Compute Unit supports having up to 8 or 10 wavefronts in flight per SIMD “at once” Helps hide memory latency Compute Unit scheduler can switch to another wavefront on a SIMD when a memory fetch stalls a shader program 16-lane vector ALU SIMD Wave 0 Wave 1 Wave 2 Wave 3 A wavefront or wave is a collection of 64 work items grouped for efficient processing on the compute unit Each wavefront shares a single program counter Each SIMD is executing an independent wavefront Each SIMD includes a 16-lane vector pipeline Each SIMD simultaneously executes a single operation across 16 work items (threads) A wavefront is issued to a SIMD in a single cycle, but takes 4 cycles to execute operations for all 64 work items That is, each SIMD operates on wavefronts of 64 work-items over four clock cycles Many wavefronts can be processed in parallel Wave 4 Wave 5 Wave 6 Wave 7 64KB VGPR Memory

GCN Compute Unit Multiple Waves in flight per SIMD A GCN Compute Unit supports having up to 8 or 10 wavefronts in flight per SIMD “at once” Helps hide memory latency Compute Unit scheduler can switch to another wavefront on a SIMD when a memory fetch stalls a shader program 16-lane vector ALU Wave 0 Wave 0 Wave 1 Wave 2 Wave 3 A wavefront or wave is a collection of 64 work items grouped for efficient processing on the compute unit Each wavefront shares a single program counter Each SIMD is executing an independent wavefront Each SIMD includes a 16-lane vector pipeline Each SIMD simultaneously executes a single operation across 16 work items (threads) A wavefront is issued to a SIMD in a single cycle, but takes 4 cycles to execute operations for all 64 work items That is, each SIMD operates on wavefronts of 64 work-items over four clock cycles Many wavefronts can be processed in parallel Wave 1 Wave 4 Wave 5 Wave 6 Wave 7

Wavefront Occupancy Live Demo Several traces from UE4 will be shown

WavefronT Occupancy One slide on why, if lower VGPR usage is desirable, why doesn’t the AMD driver shader compiler just force lower VGPR usage

Radeon GPU Profiler Help Refer to the built-in help for more information Help

Radeon GPU Profiler Refer to the built-in help for more information

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2018 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

UE4 Vulkan update Rolando Caloca O. (Epic Games)

Backup slides

Radeon RX 580 Compute Units: 36 Base Frequency: Up to 1257 MHz Specs Compute Units: 36 Base Frequency: Up to 1257 MHz Boost Frequency: Up to 1340 MHz Peak Pixel Fill-Rate: Up to 42.88 GP/s Peak Texture Fill-Rate: Up to 192.96 GT/s Max Performance: Up to 6.2 TFLOPs ROPs: 32 Stream Processors: 2304 Texture Units: 144 Transistor Count: 5.7 B