Interference from GPU System Service Requests

Slides:

Advertisements

Similar presentations

Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

CSCI 4717/5717 Computer Architecture

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

CSE598C Virtual Machines and Their Applications Operating System Support for Virtual Machines Coauthored by Samuel T. King, George W. Dunlap and Peter.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Scheduling Lecture 6. What is Scheduling? An O/S often has many pending tasks. –Threads, async callbacks, device input. The order may matter. –Policy,

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Interrupts and Interrupt Handling David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis St. Louis, MO

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

Multiprogramming. Readings r Chapter 2.1 of the textbook.

Introduction to Operating Systems Concepts

Chapter 4 – Thread Concepts

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Gwangsun Kim, Jiyun Jeong, John Kim

Module 12: I/O Systems I/O hardware Application I/O Interface

Process Management Process Concept Why only the global variables?

William Stallings Computer Organization and Architecture 8th Edition

OPERATING SYSTEMS CS3502 Fall 2017

Chapter 4 – Thread Concepts

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Virtual frame buffer and VSYNC

Intro to Processes CSSE 332 Operating Systems

Semester Review Chris Gill CSE 422S - Operating Systems Organization

Real-time Software Design

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

Chapter 15, Exploring the Digital Domain

Certifying graphics experiences on Windows 8

Interference from GPU System Service Requests

Chapter 2: The Linux System Part 3

Simulation of exascale nodes through runtime hardware monitoring

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Pedro Miguel Teixeira Senior Software Developer Microsoft Corporation

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Multithreaded Programming

Processes and operating systems

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Advanced Micro Devices, Inc.

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:

System Virtualization

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Interference from GPU System Service Requests ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC.

Modern Systems are Powered by Heterogeneity 6th Gen. AMD A-Series Processor “Carrizo” Accelerators from Industry and Academia Machine Learning Databases Computer Vision Regular Expressions Physics Graph Analytics Finite State Machines Genome Sequencing Reconfigurable (e.g., FPGA) Video Decode Video Encode CPUs GPU

Traditional Heterogeneous Computing Driven by CPUs Data Transfer Accelerator Work launched by the CPU to accelerators Data marshalled by the CPU Communication and computation are coarse-grained

Modern Accelerators are Equal Partners CPU Data Transfer Accelerator Accelerators and CPUs can launch work to one another (and themselves) Data transfer implicit based on usage Communication and computation are fine-grained

Accelerators can now Request OS Services GPU-to-CPU Callbacks (Owens et al., UCHPC 2010) Page Faults / Demand Paging (Veselý et al., ISPASS 2016) GPU-Initiated Network Requests (GPUnet, USENIX 2014) GPU-Initiated File System Requests (GPUfs, ASPLOS 2013) “Generic” System Calls (Genesys, ISCA 2018) ioctl() for other devices Memory management (e.g., sbrk(), mmap()) Signals

How a GPU System Service Request Is Handled 3b CPU 2 CPU 1 CPU 0 GPU 2 Step 1: Set up request arguments in memory Step 2: Send request interrupt to a CPU Step 3: (a) Schedule bottom-half handler (b) ACK request to GPU 1 Memory

How a GPU System Service Request Is Handled 4b 3a 3b CPU 2 CPU 1 CPU 0 GPU 2 Step 4: (a) Set up kernel work queue arguments (b) Queue worker thread 1 4a Memory

How a GPU System Service Request Is Handled 4b 3a 3b CPU 2 CPU 1 CPU 0 GPU 2 6 Step 5: Perform SSR Step 6: Send response to GPU 1 5 4a Memory

System Service Requests Can Lead to Difficulties GPUs and accelerators can request OS (system) services These SSRs can interfere with unrelated CPU-based work Unrelated CPU-Based work can slow down GPU SSR handling CPUs lose opportunity to sleep because of GPU SSRs

Modes of Interference from GPU SSRs GPU Stalls…. 2 3 CPU0 3a 4 CPU1 4b 5 CPU2 1 2 3a 3b 4a 4b 5 6 CPU User Work GPU User Work Mode/Task Switch Kernel Task Operation Low-Performance User-Mode Operation

Modes of Interference from GPU SSRs 2 3 3a 4 4b GPU Stalls…. 5 CPU0 CPU1 CPU2 Direct CPU Overheads Indirect CPU Overheads GPU Overheads

Test Platform Used to Demonstrate this AMD A10-7850K APU 4x 3.7 GHz CPU cores. AMD Family 15h Model 30h 720 MHz AMD GCN 1.1 (“Sea Islands”, gfx700) GPU, 8 CUs 32 GB Dual-Channel DDR3-1866 Ubuntu 14.04.3 LTS (AMD64) Linux® kernel 4.0.0 with HSA Drivers (amdkfd 1.6.1) PARSEC benchmarks for “CPU work” OpenCL™ benchmark applications modified to create SSRs (page faults) BPT [Veselý et al., ISPASS 2016] XSBench [Veselý et al., ISPASS 2016] SHOC BFS SHOC SpMV Pannotia SSSP µBenchmark

CPU Performance W/ SSR-Using GPU Applications

Indirect CPU Overheads from GPU SSRs

GPU Performance w/ Concurrent CPU Apps CPU cores avoid going to sleep, which can increase responsiveness to GPU requests

Driver for Enforcing CPU QoS Interrupt Coalescing Mitigation Strategies Interrupt Steering take inspiration from other domains (e.g. high performance Networking) Merged SSR Handlers Driver for Enforcing CPU QoS

Mitigation: Interrupt Coalescing CPU0 GPU CPU1 CPU2 2 3 3a 4 4b GPU Stalls…. 5 Take fewer interrupts and see less indirect user-level overhead But more likely to stall the GPU 1 2 3a 3b 4a 4b 5 6 Wait until multiple SSR requests arrive before interrupting CPU core

Mitigation: Interrupt Steering CPU0 GPU CPU1 CPU2 2 3 3a 4 4b GPU Stalls…. 5 Limit how many cores are affected. Reduce global indirect overheads. But harms fairness! 1 2 3a 3b 4a 4b 5 6 Ensure that all interrupts go to the same core, rather than round-robin to all cores

Mitigation: Merged SSR Handler CPU0 GPU CPU1 CPU2 2 3 3a 4 4b GPU Stalls…. 5 Fewer interrupts and less indirect overhead. Handle GPU requests with much less latency. But runs more CPU code in high-priority context! 1 2 3a 3b 4a 4b 5 6 Merge SSR pre-proceeding (4) with interrupt handler (3)

Pareto Curve of Mitigation Strategy Tradeoffs Merged Handler Interrupt Steering + Interrupt Coalescing Interrupt Steering + Merged Handler Default

Driver Modifications To Ensure CPU QoS 4b 3a 3b CPU 2 CPU 1 CPU 0 GPU 2 6 1 5 4a Memory

Driver Modifications To Ensure CPU QoS 4b 3a 3b CPU 2 CPU 1 CPU 0 GPU 2 Governor in Driver 5 6 Step 5: Track SSR overheads + delay step 6 to stall the GPU 1 4a Memory

Governor to Control CPU QOS CPU Cycles Handling SSRs > Threshold N Y Delay == 0? Delay  Delay*2 N Delay  0 Y Delay  10 µsec Sleep ‘Delay’ µsec Service SSR and Return GPU Result

CPU Performance at Different QOS Levels

GPU Performance Suffers for CPU QOS

Summary Heterogeneous systems can include increasingly more accelerators GPUs and accelerators now request system services These can cause interference between accelerators & unrelated CPU work Problem may worsen in the future Existing mitigation strategies help, but are not complete solution

Questions? Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Linux is a registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names used herein are for identification purposes only and may be trademarks of their respective companies.

Backup Slides

Pareto Curve of Mitigation Strategy Tradeoffs Interrupt Steering + Interrupt Coalescing Merged Handler Merged Handler + Interrupt Coalescing Merged Handler + Interrupt Steering + Coalescing Interrupt Coalescing Merged Handler + Interrupt Steering Default Interrupt Steering

SSRs Limit Low-Power Sleep States

Mitigation Effect on Sleep States