A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Slides:



Advertisements
Similar presentations
Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.
Power Reduction Techniques For Microprocessor Systems
Ensuring Robustness via Early- Stage Formal Verification Multicore Power Management: Anita Lungu *, Pradip Bose **, Daniel Sorin *, Steven German **, Geert.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Trevor Burton6/19/2015 Multiprocessors for DSP SYSC5603 Digital Signal Processing Microprocessors, Software and Applications.
On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Energy Scale-down July 3, 2003 Partha Ranganathan E-scale project, HP Labs Page 1 Energy Scale-Down in System Design: Optimizations for Reducing Power.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
DTM and Reliability High temperature greatly degrades reliability
Critical Power Slope: Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi †,Charles Lefurgy ‡, Eric Van Hensbergen ‡, Ram Rajamony ‡,
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.
E-MOS: Efficient Energy Management Policies in Operating Systems
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
PipeliningPipelining Computer Architecture (Fall 2006)
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Temperature and Power Management
Adaptive Cache Partitioning on a Composite Core
Multi-core processors
Multi-core processors
Improved schedulability on the ρVEX polymorphic VLIW processor
University of Michigan
Lecture 2: Performance Today’s topics: Technology wrap-up
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Phase based adaptive Branch predictor: Seeing the forest for the trees
Presentation transcript:

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha Ranganathan (HP Labs) Dean Tullsen (UCSD)

Motivation  Power is an important issue for processors  Going up every successive generation (with complexity) -Up to 150W for Alpha 21464!

Past Techniques for Power Reduction  Voltage/frequency scaling Limitation: Limited by technology. Also, not possible below a certain feature-size.  Architectural Adaptation -shut off portions of core when not needed -dynamic speculation control -reconfigurable caches Limitations: -Very few choices to make -Only dynamic power being saved -Has associated overhead

Single-ISA Heterogeneous Multi-Core Architectures

Our Proposal Have multiple heterogeneous cores on the same die Match workload (or workload phase) to core that achieves best efficiency according to some objective function (Ensure that the new core has acceptable performance) Power down the unused cores

Motivation  Hypotheses  Performance difference between cores varies based on workload or workload phases  Different cores have varying relative energy efficiencies for the same workload  Implication: possibility of dynamically changing “best” core

Goals of the Paper  Validate the hypotheses  Get an idea of the design space  Get an idea of the potential benefits

Outline of Talk  Motivation  Past Work  Our Work Assumptions Decisions Methodology  Results and Conclusions  Summary and Future Work

Choice of Cores on the Die Five Cores on the die: In-order: QED R4700, EV4(Alpha 21064), EV5(Alpha 21164) Out-of-order: EV6 (Alpha 21264),"EV8-“ All cores assumed to be without L2-cache. “EV8-”: Issue width is same as EV8(Alpha 21464) - Resources reduced to account for a single thread. - Core-power dissipation: 100W

Properties of the Cores ProcessorR4700EV4EV5EV6EV8- Issue-width1246(OOO)8(OOO) I-Cache2-way 16KBDM, 8KB 2-way 64KB4-way 64KB D-Cache2-way 16KBDM, 8KB 2-way 64KB4-way 64KB Branch Pred. No2KB/1-bit2K-gshareHybrid 2-level MSHR Notice the gradation!

Properties of Cores (contd.)  Assume all cores implemented in 0.1um -Scaled area and power accordingly  Clock Speed? -All Alpha cores assumed to run at 2.1GHz (EV6 frequency at 0.10 micron) -R4700 assumed to run at 1GHz

Core Power and Area  peak power of core estimated from data sheets - minus that used by L2 caches and pins - then scaled for.1um process  area of core estimated from die photos - minus that of i/o pad, wires, L2 cache & control - then scaled for.1um process  L2 cache area and power - estimated using CACTI

Core Power and Area (contd.) ProcessorCore-power (in W)Core-area (in mm^2) R EV EV EV EV EV8- consumes 200 times more power than R4700! It is more than 85 times bigger too!

Core Power and Area (contd.)

Methodology  Simulator used: SMTSIM  ROB-size, Activelist-size and Load-store queue always kept big enough to ensure no conflicts.  Benchmarks used: 14 chosen randomly out of SPEC2000 suite  Fast-forwarded for 2 billion instructions, simulated for 1 billion instructions.  Data collected after every 1 million instructions.

Validating Hypotheses Performance difference between cores varies based on workload or workload phases (IPS) Different cores have varying relative energy efficiencies for the same workload (IPS/W)

Performance Variation with Time Ah! Those clear, distinct phases!

Variation of Energy Efficiency with Time Power dominates IPS/W numbers!

How does a composite objective function fare?

Energy-delay Product Profile

So why not run on the “best” core at all points of time??

Choosing Dynamically the Core with Best Energy-Delay Product (perf. loss<50%) Notice the regions where best-path is not along the best energy-delay product!

Choosing Dynamically the Core with Best Energy-Delay product (perf. loss<50%) [Summary of Results] Energy-Delay Savings(%) Performance Degradation(%) Maximum Minimum0.1 Mean Number of Switchings: Maximum=387(art) Minimum=0 Median=1

Dissecting the Results  More improvements possible – locally-best decisions not necessarily globally-best there was a performance constraint choice of cores not the best for this objective-function cache-configurations not necessarily the best  Even for present improvements, beats voltage scaling handsomely(44.2% ED 2 improvement)

Conclusion  Enormous potential for power-savings  No leakage-power solution  Does considerable IP reuse  Complexity-appropriate - every application match to the “appropriate” complexity core

Tip of the iceberg? Current/Future Work  Cores can be non-ordered  Some cores can be multithreaded  Throughput impact of the architecture

Questions?