IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

Lecture 12 Reduce Miss Penalty and Hit Time

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Computer Organization and Architecture

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

Filtering Approaches for Real-Time Anti-Aliasing /

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

The contents of these materials are provided "as is" without any warranties of any kind and the Independent Electricity Market Operator (IMO) makes no.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Low Latency Clock Domain Transfer for Simultaneously Mesochronous, Plesiochronous and Heterochronous Interfaces Wade Williams Philip Madrid, Scott C. Johnson.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

PPEP: online Performance, power, and energy prediction framework

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

Instruction-Based Sampling and AMD CodeAnalyst ISPASS 2010 poster session Paul J. Drongowski | March 29, 2010.

Sunpyo Hong, Hyesoon Kim

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

E-MOS: Efficient Energy Management Policies in Operating Systems

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Wi-Fi BT/BLE Combo Module WINC3400 hands-on

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

PPEP: online Performance, power, and energy prediction framework

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

PPEP: online Performance, power, and energy prediction framework

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Parallelspace PowerPoint Template for ArchiMate® 2.1 version 1.1

Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Interference from GPU System Service Requests

Processor Fundamentals

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING WANG † † NUDT ‡ AMD RESEARCH JUNE 19, 2014

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, How fast is your application at different CPU frequencies?

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, WHAT HAPPENS WHEN YOU CHANGE FREQUENCY? Estimate #1 Nothing. Who cares about frequency? Estimate #2 Performance difference is equal to frequency change. Estimate #3 Something in between. Rare bzip2 milc

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, WHY DON’T NAÏVE ESTIMATES ALWAYS WORK? bzip2 CPU DRAM Time Working in the CPU core. Scales with frequency. milc CPU DRAM Stalled on Memory. Core Frequency doesn’t matter. Core and memory time both matter

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, HOW DO YOU ESTIMATE “MEMORY TIME”? MODERN CORES MAKE THIS DIFFICULT CPU Work DRAM  Count the amount of time with an outstanding load?  Count last-level cache misses? Multiple Parallel Accesses Accesses Overlap Computation Variable Latencies

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, “LEADING LOADS” MEMORY TIME ESTIMATION Described by 3 separate groups in CF 2010, IEEE TOC, and IGCC 2011 Simulation: ~0.2% estimation error across 2x change in frequency CPU Work DRAM Leading Load Not Leading Leading Load Memory time approximately time that a leading loads is active Memory Time

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, LEADING LOADS ON AMD PROCESSORS L2 cache misses held in Miss Address Buffer (MAB) ‒MAB entries have a static priority (e.g. MAB0 is highest priority) ‒Highest priority empty MAB holds the miss until it returns from memory Performance event 0x69 allows SW to count # of cycles with filled MABs CPU Core & L1 Cache L2 Cache L3 Cache & DRAM Core Clock Domain NB and Memory Clock Domains Miss Address Buffers MAB0 MAB1 MAB2 … MABn MAB0 MAB1 MAB2

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, PERFORMANCE ESTIMATION MODEL: LL-MAB Measure occupancy time of the highest-priority MAB ‒HW event 1: CPU Clocks not Halted (for Execution Time) ‒Performance Event 0x76 ‒HW event 2: MAB Wait Cycles (for Memory Time) ‒Performance Event 0x69 ‒Family 15h Processors: Unit Mask 0

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, EVALUATING PERFORMANCE PREDICTORS  Run benchmarks at frequency 1, estimate runtime at frequency 2  Run benchmark at frequency 2. ‒Difference between observed and estimated is estimation error.  Estimation mechanisms: ‒Linear: Performance scales exactly with frequency (like bzip2) ‒Green Governor: ‒Count L3 cache misses ‒Assign delay to each cache miss ‒# Cache Misses * delay = “memory time” ‒LL-MAB: Count MAB0 cycles at “memory time”

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, EXPERIMENTAL SETUP AMD Opteron™ 4386 Processor ‒2 nd Generation Family 15h “Piledriver” CPU ‒Minimum Frequency: 1.4 GHz, Maximum non-boost frequency: 3.1 GHz Fedora® 19 Desktop (kernel version ) ‒Locked benchmarks to single core with numactl ‒Used msr-tools to read performance counters around benchmark runs. ‒CPUFreq userspace governor to manually control DVFS state. Boosting disabled. 66 Single-threaded benchmarks from: ‒SPEC® CPU 2006 ‒NAS Parallel Benchmarks ‒PARSEC ‒Rodinia OTHER PROCESSORS TESTED IN THE PAPER

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, MEASURE AT 3.1 GHZ, ESTIMATE 1.4 GHZ RUNTIME (LOWER IS BETTER)

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, STANDARD DEVIATION IS IMPORTANT FOR PREDICTIONS (LOWER IS STILL BETTER)

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

Backup Slides

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, WHY IS FREQUENCY IMPORTANT? Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, HOW DO YOU ESTIMATE “MEMORY TIME”? MODERN CORES MAKE THIS DIFFICULT CPU Work DRAM  Count last-level cache misses?  Count the amount of time with an outstanding load?  Count time that the pipeline is stalled? Multiple Parallel Accesses Accesses Overlap Computation Can Return Out of Order Pipeline flushes in core clock domain

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, PREDICTION ACCURACY PER BENCHMARK Memory Boundedness = Ratio of execution cycles at two frequencies ‒1.0 = no change in cycles (completely compute bound, e.g. bzip2)

| IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS | JUNE 19, CONCLUSION  First leading loads implementation on real processors  Higher accuracy than existing predictors  Lower accuracy than simulation due to HW complexity  Lightweight estimation mechanism (only requires 2 counters) ‒Path to better performance and power prediction