Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Conditions and Terms of Use

Computer Graphics Graphics Hardware

Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.

Hosting an Enterprise Financial Forecasting Application with Terminal Server Published: June 2003.

A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

PPEP: online Performance, power, and energy prediction framework

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Use this title slide only with an image SAP LoadRunner by HP Speaker’s Name/Department (delete if not needed) Month 00, 2015 Public.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

Computer Graphics Graphics Hardware

PPEP: online Performance, power, and energy prediction framework

CS203 – Advanced Computer Architecture

µC-States: Fine-grained GPU Datapath Power Management

Data Platform Modernization

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Parallelspace PowerPoint Template for ArchiMate® 2.1 version 1.1

Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0

Test Upgrade Name Title Company 9/18/2018 Microsoft SharePoint

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Many-core Software Development Platforms

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

Data Platform Modernization

Virtualization Techniques

Certifying graphics experiences on Windows 8

Interference from GPU System Service Requests

Microsoft Virtual Academy

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Chapter 1 Introduction.

Pedro Miguel Teixeira Senior Software Developer Microsoft Corporation

Computer Graphics Graphics Hardware

Best practices for packaging and distributing device drivers

Advanced Micro Devices, Inc.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Ajit Mathews Corp. VP Software Development ML Software Engineering

Advanced app and driver debugging

Presentation transcript:

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Large Design Space for Heterogeneous Systems 6th Gen. AMD A-Series Processor “Carrizo” High-Level System Design Points Some CPU design spaces: How large? How many CPUs? How fast should the CPUs run? How much power should it use? Some GPU design spaces: How much parallelism? How fast should it run? What accelerators are needed? How much bandwidth needed? Video Decode Video Encode CPUs GPU

Example of a Design Space Exploration Various Designs Using AMD GPUs Name CUs Max Freq. (MHz) Max DRAM BW (GB/s) AMD E1-6010 APU 2 350 11 AMD A10-7850K APU 8 720 35 Microsoft Xbox One™ Processor 12 853 68 Sony PlayStation® 4 Processor 18 800 176 AMD Radeon™ R9-280X 32 1000 286 AMD Radeon™ R9-290X 44 352 AMD Radeon™ R9 Fury X 64 512

Design Space Explorations Require Long Runtimes Power and thermals on a real heterogeneous processor ~2.5 trillion CPU instructions, ~60 trillion GPU operations Real applications of interest are large and complex Not microbenchmarks, can run for minutes or hours Performance and power can rely on what happened in the past Complex interactions between various heterogeneous devices

Design-Space exploration Hard on Existing Platforms Microarchitecture simulators (e.g., gem5, Multi2Sim, SESC, GPGPU-Sim) Excellent for low-level details. Too slow for broad design space explorations of full applications: 6 minutes * 60s/min * 4 CPU cores * ~1.75GHz (AMD A8-4555M) + 6 minutes * 60s/min * 6 CUs * 64 FPU/CU * 425MHz (AMD Radeon™ HD 7600G) = ~60 trillion operations ≈ 1 year of simulation time @ 2 MIPS Emulators (e.g., Cadence Palladium, Synopsys ZeBu, Mentor Veloce, IBM AWAN) Much faster than SW simulation, just as detailed, but require a nearly-complete design Functional simulators (e.g., AMD SimNow, Simics, QEMU, etc.) Faster than microarchitectural simulators, good for software bring-up No relation to hardware performance Spreadsheet analyses Great for first-order analyses. Much faster and easier than lower-level simulators Difficult to analyze application differences.

Drive Design Space Exploration from Real Hardware Gather from running application of interest on real hardware: Measured performance and power Information about how the application used the hardware (e.g., performance counters) Estimate performance and power for different hardware design points Target Hardware Configuration GPU Kernel Execution Time/power Performance Counters Performance and Power Models Target Execution Time/Power GPU Hardware 6

The “High-Level” Design Space Explored In this Talk Changes to the following GPU parameters: Number of parallel compute units (CUs) Core frequency Memory bandwidth Changes to GPU kernel: Performance Dynamic power

GPU Kernel Performance Scaling with HW Designs (1)

GPU Kernel Performance Scaling with HW Designs (2)

GPU Kernel Performance Scaling with HW Designs (3)

GPU Kernel Performance Scaling with HW Designs (4)

GPU Kernel Performance Scaling with HW Designs (5)

GPU Kernel Performance Scaling with HW Designs (6)

Automatically Clustering Scaling Curves Training Set Kernel 1 Kernel 2 Kernel 3 Kernel 4 Kernel 5 Kernel 6 Cluster 1 Cluster 2 Cluster 3

Automatically Clustering Scaling Curves

Finding a Scaling cluster of an Unknown Kernel Performance counters indicate how the kernel uses the hardware Counters may thus help indicate which cluster a kernel is in with only one measurement … Perf. Counter 0 Perf. Counter 1 Perf. Counter N Cluster 0 Cluster 1 Cluster M

Estimate Kernel Changes Using Cluster’s Curve Half the CUs but same bandwidth? ~60% Performance

Estimate Kernel Changes Using Cluster’s Curve Half the CUs but same bandwidth? ~100% Performance

Estimate Kernel Changes Using Cluster’s Curve Half the CUs but same bandwidth? ~115% Performance

Accuracy from one HW point to All Others

Preparing Good Data is Challenging This model needs a lot of data: Multiple applications, many kernels Numerous design points per application What kernels are representative? How to get clean performance and power data?

Choosing Representative Kernels Kernel Runtime Kernel Invocation

Choosing Representative Kernels Kernel Runtime Kernel Invocation

Choosing Representative Kernels Kernel Runtime Kernel Invocation

Power Measurements Must Account for Static Power

Summary Modern systems have a large heterogeneous hardware design space We need tools to do early design space exploration to help guide designs ML techniques help perform hardware-driven scaling studies Reasonable accuracy and very fast! But building clean, meaningful data sets presents a challenge

Papers About This Topic J. L. Greathouse, A. Lyashevsky, M. Meswani, N. Jayasena, M. Ignatowski, “Simulation of Exascale Nodes through Runtime Hardware Monitoring,” ModSim 2013 B. Su, J. L. Greathouse, J. Gu, M. Boyer, L. Shen, Z. Wang, “Implementing a Leading Loads Performance Predictor on Commodity Processors,” USENIX ATC 2014 D. P. Zhang, N, Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, M. Ignatowski, “TOP-PIM: Throughput-Oriented Programmable Processing in Memory,” HPDC 2014 G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, D. Chiou, “GPGPU Performance and Power Estimation Using Machine Learning,” HPCA 2015 A. Majumdar, G. Wu, K. Dev, J. L. Greathouse, I. Paul, W. Huang, A. K. Venugopal, L. Piga, C. Freitag, S. Puthoor, “A Taxonomy of GPGPU Performance Scaling,” IISWC 2015 T. Vijayaraghavan, Y. Eckert, G. H. Loh, M. J. Schulte, M. Ignatowski, B M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang, A. Karunanithi, O. Kayiran, M. Meswani, I. Paul, M. Poremba, S. Raasch, S. K. Reinhardt, G. Sadowski, V. Sridharan, “Design and Analysis of an APU for Exascale Computing,” HPCA 2017

Questions? Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Xbox One is a trademark of the Microsoft Corporation. PlayStation is a trademark or registered trademark of Sony Computer Entertainment, Inc. Other names used herein are for identification purposes only and may be trademarks of their respective companies.

Backup Slides

Use Hardware Measurements to Guide Exploration Step 1: Measure application on real hardware Step 2: Estimate how application will work on future hardware Phase 1 User Application Software Tasks Phase 2 Future Machine Description Performance, Power, and Thermal Models User-Level Software Stack Test Operating System Trace Post-processor Test System APU, CPU, GPU Traces of HW and SW events related to performance and power Performance and Power Estimates

Application’s Use of all Hardware Matters Time CPU0 CPU1 GPU Example when everything except CPU1 gets 2x faster: CPU0 CPU1 Performance Difference GPU Less than 2x gain because CPU1 now on new critical path

Reconstructing Application Critical Paths In phase 1, gather SW-level relationship between each segments Use these relationships to build a legal execution order on simulated system Gather ordering from library calls like pthread_create(), clWaitForEvents(), etc. Could also split segments on user API calls, program phases ① ② ⑦ CPU0 ③ ④ ⑥ CPU1 ⑤ GPU