SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

Slides:



Advertisements
Similar presentations
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Ritu Varma Roshanak Roshandel Manu Prasanna
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
Filtering Approaches for Real-Time Anti-Aliasing /
AMD platform security processor
OpenCL Introduction A TECHNICAL REVIEW LU OCT
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Chapter 6 : Software Metrics
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Sunpyo Hong, Hyesoon Kim
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
CS203 – Advanced Computer Architecture Computer Architecture Simulators.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
IDC Says, "Don't Move To The Cloud" Richard Whitehead Director, Intelligent Workload Management August, 2010 Ben Goodman Principal.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
CS203 – Advanced Computer Architecture
Topic 2: Hardware and Software
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
for the Offline and Computing groups
Presented by Munezero Immaculee Joselyne PhD in Software Engineering
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Structural Simulation Toolkit / Gem5 Integration
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 1.1
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0
The Small batch (and Other) solutions in Mantle API
Heterogeneous System coherence for Integrated CPU-GPU Systems
Many-core Software Development Platforms
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Interference from GPU System Service Requests
Simulation of exascale nodes through runtime hardware monitoring
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Chapter 1 Introduction.
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI 9/18/2013

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC2 CHALLENGES OF EXASCALE NODE RESEARCH MANY DESIGN DECISIONS Heterogeneous Cores Composition? Size? Speed? Stacked Memories Useful? Compute/BW Ratio? Latency? Capacity? Non-Volatile? Thermal Constraints Power Sharing? Heat dissipation? Sprinting? Software Co-Design New algorithms? Data placement? Programming models? Exascale: Huge Design Space to Explore

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC3 CHALLENGES OF EXASCALE NODE RESEARCH  Power and Thermals on a real heterogeneous processor: ‒~2.5 trillion CPU instructions, ~60 trillion GPU operations  Exascale Proxy Applications are Large ‒Large initialization phases, many long iterations ‒Not microbenchmarks ‒Already reduced inputs and computation from real HPC applications INTERESTING QUESTION REQUIRE LONG RUNTIMES Exascale: Huge Execution Times

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC4 EXISTING SIMULATORS  Microarchitecture Sims: e.g. gem5, Multi2Sim, MARSSx86, SESC, GPGPU-Sim ‒Excellent for low-level details. We need these! ‒Too slow for design space explorations: ~60 trillion operations = 1 year of sim time  Functional Simulators: e.g. SimNow, Simics, QEMU, etc. ‒Faster than microarchitectural simulators, good for things like access patterns ‒No relation to hardware performance  High-Level Simulators: e.g. Sniper, Graphite, CPR ‒Break operations down into timing models, e.g. core interactions, pipeline stalls, etc. ‒Faster, easier to parallelize. ‒Runtimes and complexity still constrained by desire to achieve accuracy.

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC5 TRADE OFF INDIVIDUAL TEST ACCURACY  Doctors do not start with:

Fast Simulation Using Hardware Monitoring

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC7 HIGH-LEVEL SIMULATION METHODOLOGY MULTI-STAGE PERFORMANCE ESTIMATION PROCESS Phase 2 Phase 1 User Application Simulator Software Stack Performance, Power, and Thermal Models Trace Post-processor Software Tasks Test System APU, CPU, GPU Test System OS Traces of HW and SW events related to performance and power Machine Description Performance & Power Estimates

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC8 BENEFITS OF MULTI-STEP PROCESS MANY DESIGN SPACE CHANGES ONLY NEED SECOND STEP Phase 2 Phase 1 User Application Simulator Software Stack Performance, Power, and Thermal Models Trace Post-processor Software Tasks Test System APU, CPU, GPU Test System OS HW & SW Traces Multiple Machine Descriptions Performance & Power Estimates

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC9 PERFORMANCE SCALING A SINGLE-THREADED APP Current Processor Gather statistics and performance counters about: Instructions Committed, stall cycles Memory operations, cache misses, etc. Power usage Simulated Processor Start Time End Time Analytical Performance Scaling Model New Runtime Time

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC10 ANALYTIC PERFORMANCE SCALING  CPU Performance Scaling: ‒Find stall events using HW perf. counters. Scale based on new machine parameters. ‒Large amount of work in the literature on DVFS performance scaling, interval models.. ‒Some events can be scaled by fiat: “If IPC when not stalled on memory doubled, what would happen to performance?”  GPU Performance Scaling: ‒Watch HW perf. counters that indicate work to do, memory usage, and GPU efficiency ‒Scale values based on estimations of parameters to test

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC11 OTHER ANALYTIC MODELS  Cache/Memory Access Model ‒Observe memory access traces using binary instrumentation or hardware ‒Send traces through high-level cache simulation system ‒Find knees in the curve, feed this back to CPU/GPU performance scaling model  Power and thermal models ‒Power can be correlated to hardware events or directly measured ‒Scaled to future technology points ‒Any number of thermal models will work at this point  Thermal and power models can feed into control algorithms that change system performance ‒This is another HW/SW co-design point. Fast operation is essential.

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC12 SCALING PARALLEL SEGMENTS REQUIRES MORE INFO CPU0 Time CPU1 GPU Example when everything except CPU1 gets faster: Performance Difference CPU0 CPU1 GPU

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC13 MUST RECONSTRUCT CRITICAL PATHS  Gather program-level relationship between individually scaled segments  Use these happens-before relationships to build a legal execution order  Gather ordering from library calls like pthread_create() and clWaitForEvents()  Can also split segments based on program phases CPU0 CPU1 GPU ① ②⑦ ⑥ ③ ④ ⑤

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC14 CONCLUSION Exascale node design space is huge Trade off some accuracy for faster simulation Use analytic models based on information from existing hardware

Research Related Questions

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC16 MODSIM RELATED QUESTIONS  Major Contributions: ‒A fast, high-level performance, power, and thermal analysis infrastructure ‒Enables large design space exploration and HW/SW co-design with good feedback  Limitations: ‒Trace-based simulation has known limitations w/r/t multiple paths of execution, wrong-path operations, etc. ‒It can be difficult and slow to model something if your hardware can’t measure values that are correlated to it.  Bigger Picture: ‒Node-level performance model for datacenter/cluster performance modeling ‒First pass model for APU power sharing algorithms. ‒Exascale application co-design ‒Complementary work to broad projects like SST

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC17 MODSIM RELATED QUESTIONS  What is the one thing that would make it easier to leverage the results of other projects to further your own research ‒Theoretical bounds and analytic reasoning behind performance numbers. Even “good enough” guesses may help, vs. only giving the output of a simulator  What are important thing to address in future work? ‒Better analytic scaling models. There are a lot in the literature, but many rely on simulation to propose new hardware that would gather correct statistics. ‒It would be great if open source performance monitoring software were better funded, had more people, etc.

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC18 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

Backup

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING | FEBRUARY 26, 2016 | PUBLIC20 AN EXASCALE NODE EXAMPLE QUESTION  Is 3D-Stacked Memory Beneficial for Application X? Baseline Performance Bandwidth Difference Latency Difference Thermal Model Core Changes due to Heat Simulation Run(s) Redesign Modify software to better utilize new hardware Performance results based on design point(s)