RegMutex: Inter-Warp GPU Register Time-Sharing

Slides:

Advertisements

Similar presentations

Paging: Design Issues. Readings r Silbershatz et al: ,

Advertisements

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

GPGPU platforms GP - General Purpose computation using GPU

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Understanding Operating Systems Seventh Edition

CSL718 : Superscalar Processors

Process Management Process Concept Why only the global variables?

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Lecture 2: Intro to the simd lifestyle and GPU internals

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Many-core Software Development Platforms

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

libflame optimizations with BLIS

Interference from GPU System Service Requests

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Operating Systems : Overview

Operation of the Basic SM Pipeline

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Operating Systems : Overview

Advanced Micro Devices, Inc.

6- General Purpose GPU Programming

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45th International Symposium on Computer Architecture - ISCA 2018

GPU Register File The fastest and the most-expensive type of memory To allow warp context-switching with minimal overhead, registers are: Statically assigned Exclusively dedicated Largest on-chip resource to support up to 2048 resident threads Accounts for a large fraction of chip area and power consumption Picture from Nvidia Volta architecture whitepaper.

GPU Register File Static assignment Exclusive allocation A fixed number of registers per thread are requested throughout kernel execution Exclusive allocation Allocated physical registers cannot be used by other warps Illustration on the right hand said shows the total fraction of registers used. Example from nvdisasm CUDA binary utility documentation (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#nvdisasm)

Static and Exclusive Register Allocation Results in GPU register file underutilization A large portion of physical registers are allocated but rarely used X axis: the number of instructions executed by the thread Y axis: the percentage of live registers with respect to allocated registers The utilization of a sample thread’s allocated register set during kernel execution

Static and Exclusive Register Allocation Problematic when occupancy is limited by register usage Warp A starts execution Warp A ends execution and warp B starts execution Unused registers allocated to warp A Unused registers allocated to warp B Register Allocation Time Registers allocated to and used by warp A Registers allocated to and used by warp B

Related Works Register File Virtualization (MICRO 2015) Implement a Register Renaming Table for Warp Registers Resource sharing (HPDC 2016) Warps race for one-time acquisition of shared resources Zorua (MICRO 2016) Virtualize on-chip resources and distribute them among warps on-demand RegLess (MICRO 2017) Remove register file and organize prefetching of thread-private variables

RegMutex: Main Idea Register Mutual Exclusion: a compiler-architecture co-design RegMutex shares registers between SM warps by time-multiplexing Compiler marks regions of the kernel binary with high register consumption. Registers needed in such areas form extended register set A shared register pool is designated out of register file upon kernel initiation When program needs to work with extended set, it needs to be acquired from the shared pool Think of it as communal register allocation

RF Utilization Visualization in RegMutex Warp 0 Warp 1 Warp 2 Warp 3 Warp 4 Warp 5 Warp 0 Warp 1 Warp 2 Warp 3 Warp 4 Warp 5 Warp 6 Warp 7 Rt Rt Rt Rt Rt Rt Bs Bs Bs Bs Bs Bs Bs Bs Rt = Bs + Es Es Es Es Equal resources: |Rt| * 6 = |Bs| * 8 + |SRP| Shared Register Pool (SRP) Default RegMutex

RegMutex Warp A starts execution Warp A releases its extended register set Warp A acquires its extended register set Warp A ends execution Shared pool for extended register set Warp B base register set Register Allocation Wait Warp A base register set Warp B ends execution Warp B starts execution Warp B acquires its extended register set and resumes Warp B releases its extended register set Warp B tries to acquire its extended register set but stalls

RegMutex Main contribution: Higher kernel occupancy on a fixed resource budget Sustaining approximately the same performance with a smaller RF Main advantages compared to existing approaches: Low hardware overhead by offloading the resource allocation to the compiler Does not affect kernels that do not require RegMutex capabilities Can be disabled by the compiler without side effects

RegMutex: Compiler Side 1st step: register liveness analysis of the GPU assembly code to extract the register usage information

RegMutex: Compiler Side 2nd step: extended register set size determination - Determine |Bs| and |Es|, the sizes of base and extended sets 3rd step: acquire/release primitive injection into the assembly code 4th step: architected register index compaction before and throughout the release state - Allows keeping existing hardware for architected reg. to physical reg. mapping

RegMutex: Architecture Side FFetch I-Cache Decode Scoreboard I-Buffer Issue Operand Collector ALU MEM Nw Warp Status Bitmask SRP Bitmask ceil(log2(Nw)) LUT

RegMutex: Architecture Side To acquire Es: Find an empty location in SRP Bitmask. Wait if there is no empty section Write the location index into warp’s row in the Look-Up Table Set warp status bitmask SRP Bitmask LUT Warp Status Bitmask 3 LUT[Widx]=loc Set(Widx) 1 2 loc=FFZ(SRP) Wait

RegMutex: Architecture Side To release Es: Unset the warp status bitmask Find the warp-assigned SRP location from LUT Unset the found location in SRP bitmask Warp Status Bitmask LUT SRP Bitmask 1 2 3 Unset(Widx) srpidx = LUT[Widx] Unset(srpidx)

Experimental Results: Kernel Occupancy Boost GPGPU-Sim v 3.2.2 - GTX480 specifications with 15 SMs and 128 KB regs/SM 8 apps from Rodinia, Parboil, and Nvidia CUDA SDK Compilation with NVCC 4.0, GCC 4.6 with -O3, and CC 1.3 Occupancy is limited by excessive reg usage Average execution cycle reduction of 13%

Experimental Results: RF Size Reduction GPGPU-Sim v 3.2.2 - GTX480 specifications with 15 SMs and 64 KB regs/SM Another 8 apps from Rodinia, Parboil, and Nvidia CUDA SDK Compilation with NVCC 4.0, GCC 4.6 with -O3, and CC 1.3 Occupancy is limited by excessive reg usage on the new arch. spec Without RegMutex: 23% execution cycle increase With RegMutex: 9 % execution cycle increase

Experimental Results: Performance Comparison GPGPU-Sim v 3.2.2 - GTX480 specifications with 15 SMs and 128 KB regs/SM 8 apps from Rodinia, Parboil, and Nvidia CUDA SDK Compilation with NVCC 4.0, GCC 4.6 with -O3, and CC 1.3 OWF: Owner Warp First [HPDC’16] RFV: Register File Virt. [MICRO’15] On average: OWF 1.9%, RFV 16.2%, RegMutex 12.8% RFV demands 31 kilobits while RegMutex needs 384 bits (~81x)

Summary Static and exclusive register allocation on GPUs wastes resources RegMutex offers a synergistic compiler-architecture technique to improve GPU register utilization Marks regions of the high register consumption and takes a communal approach on allocating physical registers for such regions Improves occupancy when it is limited by register usage Provides application resilience with smaller RF size Lower hardware implementation complexity compared to counterparts Reduces execution cycles by up to 23%

RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45th International Symposium on Computer Architecture - ISCA 2018

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2018 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.