© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.

Slides:



Advertisements
Similar presentations
IMA 2.5: Software Architecture and Development Environment Roberto Olivares M.S. Electrical Engineering Vanderbilt University, Spring 2003.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
What you wanted to know about the iSeries hardware POWER 5, POWER 6 and POWER 7 Bill Fuller Natco Products Corporation
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Chapter 13 Embedded Systems
Performance Evaluation of Load Sharing Policies on a Beowulf Cluster James Nichols Marc Lemaire Advisor: Mark Claypool.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
QTIP Version 0.2 4th August 2015.
Microprocessors SUBTITLE Team 3: David Meadows David Foster Sichao Ni Khareem Gordon.
Framework for Automated Builds Natalia Ratnikova CHEP’03.
Software Testing. Definition To test a program is to try to make it fail.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Automatic Generation Tools UNICOS Application Builder Overview 11/02/2014 Ivan Prieto Barreiro - EN-ICE1.
PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.
© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Lightning Session Monday, December 3, 2012 Systematic Energy Characterization of CMP/SMT.
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Presenter : Ching-Hua Huang 2013/7/15 A Unified Methodology for Pre-Silicon Verification and Post-Silicon Validation Citation : 15 Adir, A., Copty, S.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Application Heartbeats Henry Hoffmann, Jonathan Eastep, Marco Santambrogio, Jason Miller, Anant Agarwal CSAIL Massachusetts Institute of Technology Cambridge,
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Runtime Software Power Estimation and Minimization Tao Li.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Automating Configuration Troubleshooting with Dynamic Information Flow Analysis Mona Attariyan Jason Flinn University of Michigan.
Sunpyo Hong, Hyesoon Kim
E-MOS: Efficient Energy Management Policies in Operating Systems
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Best detection scheme achieves 100% hit detection with
Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
1
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
Computer System Structures
Building Enterprise Applications Using Visual Studio®
Adaptive Cache Partitioning on a Composite Core
Effective Data-Race Detection for the Kernel
Presented by Mahadevan Vasudevan + Microsoft , *UC-Berkeley
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Many-core Software Development Platforms
Overview Introduction VPS Understanding VPS Architecture
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
A Talk on Adaptive History-Based Memory Scheduling
Automatic Tuning of Two-Level Caches to Embedded Applications
ADSP 21065L.
Rajeev Balasubramonian
Funded by the Horizon 2020 Framework Programme of the European Union
Presentation transcript:

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks R. Bertran* +, A. Buyuktosunoglu*, M. Gupta*, M. Gonzalez +, P. Bose* *IBM T.J. Watson Research Center + Barcelona Supercomputing Center

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, What is the maximum power consumption? Any performance bug? Any reliability issues? …  Time consuming and tedious – Error prone task Trial and error process – Several micro- benchmarks are required  Deep expertise limited to few designers – Detailed knowledge of the underlying architecture is required Why do we need micro-benchmarks? Micro-benchmarks! AUTOMATED SOLUTION NEEDED!

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 MicroProbe: a micro-benchmark generation framework

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 MicroProbe Workflow MicroProbe Framework User Micro- Bench- mark InputsOutputs Micro- benchmark generation policy Architecture Definition files Endless loop 50% INT 50% FP Endless loop for each instruction of the ISA Micro- Bench- mark Micro- Bench- mark Micro- Bench- mark Max Power stressmark External tools Real platforms SimulatorsModels

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 MicroProbe: Distinguishing Features 5 FeaturePrevious worksMicroProbe ISA queries - Instruction type - Operand length, binary codification etc.  (manual) Micro-architecture queries - Functional unit, latency, throughput, energy per instruction, average instruction power etc.  (manual) Micro-architecture models - Set-associative cache model  (no) Code generation - Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass. - Configurable passes  (no) Design space exploration - Integrated  (no) - GA-based search - Exhaustive search  (manual) - Customizable search  (manual)

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 MicroProbe Usage and Design Overview Research idea Micro-benchmark generation policies (user-defined scripts) Loop stressing the floating point unit Sequence of loads hitting 50% L1 and 50% L2 Generate a stress- mark for each functional unit of the architecture Search for the sequence of 2 loads and 2 integer operations with maximum IPC MicroProbe Framework (Python API) Architecture module Code generation module Design space exploration module ISA definitions ISA definitions ISA definitions Micro-architecture analytical models Micro-architecture analytical models Micro-architecture analytical models Micro-architecture definitions Micro-architecture definitions Micro-architecture definitions Micro-benchmark synthesizer Passes Search drivers Search drivers Search drivers Properties Micro-benchmark Automatic bootstrap process External tools

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Max-power Stressmark Generation 7 Use MicroProbe to generate max- power stressmark Characterize energy per instruction (EPI) and IPC (Architecture Module) Select N instructions with max (IPC* EPI) Form a basic endless loop (e.g. 4K) using selected instructions (Code Generation Module) Generate micro-benchmarks with different orders of the selected N instructions Evaluate using Design Space Exploration Module Pick the highest power microbenchmark Loop: … mulldo lxvw4x xvnmsubmdp … mulldo xvnmsubmdp lxvw4x Loop: … mulldo lxvw4x mulldo xvnmsubmdp lxvw4x xvnmsubmdp …

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 CASE STUDIES MicroProbe: A Micro-benchmark Generation Framework 8

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Experimental Methodology  Platform: – Processor: 3GHz 8-core 4-way SMT 32KB L1, 256KB L2 and 4MB L3 per core – Memory: 32 GB DDR3 800MHz – OS: RHEL Linux – EnergyScale architecture Power measurements in miliwatts Sampling rate up to 1ms  In-house software collects power and performance counter traces [C. Lefurgy et al, IBM] 9

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Case Study 1: EPI Characterization 10 High differences in EPI across instructions stressing different micro- architecture components High differences in EPI across instructions stressing the same micro- architecture components and at the same rate (IPC)

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 MicroProbe Heuristic: Max(EPI * IPC) Selected instructions: mulldo, xvnmsubmdp, lxvw4x Case Study 2: Max-power Stressmark Generation 11 ? Use a computational intensive kernel Use complex instructions accessing different functional units with high IPC Generate all possible combinations of complex instructions stressing different units Use MicroProbe DAXPY Selected intructions: mullw xvmaddadp lxvd2x Loop: … mullw xvmaddadp lxvd2x … Loop: … mullw lxvd2x mullw xvmaddadp lxvd2x … Loop: … mullw lxvd2x mullw xvmaddadp lxvd2x xvmaddadp … MicroProbe Loops Expert DSE Expert manual MicroProbe

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Max-power Stressmark Generation 12

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Case Study 3: Counter-based Processor Power Model 13 Bottom- up Power modeling method Dynamic Power f(PMCs) Intercept SMT1 Intercept SMT2-4 SMT effect Linear Regression f(CMP) CMP effect Uncore power Func.Unit micro- Benchmarks CMP1–SMT1 Random micro- Benchmarks CMP1–SMT1 Random micro- Benchmarks CMP1–SMT2/4 Random micro- Benchmarks CMP1/8–SMT2/4 Model: Dynamic Power f(PMCs) SMT effect CMP effect Uncore power SMT enabled # cores 1 2 3

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Counter-based Processor Power Model Validation  Within acceptable error margins: < 4% on average

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Counter-based Processor Power Model Validation on Corner Cases  Models trained using non-micro-architecture aware training sets show high errors and variability  Models trained using the micro-architecture aware training set show acceptable error margins: < 5% on average

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Conclusions  MicroProbe is a productive micro-benchmark generation framework – Adaptive and flexible – Includes micro-architecture semantics – Integrates design space exploration  Presented three case studies: – Instruction-based EPI characterization – Automated max-power stressmark generation – CMP/SMT-aware bottom-up counter-based processor power model 16

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 QUESTIONS? MicroProbe: A Micro-benchmark Generation Framework 17