1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Slides:



Advertisements
Similar presentations
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Advertisements

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.
A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.
Power Reduction Techniques For Microprocessor Systems
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
Power Management in Multicores Minshu Zhao. Outline Introduction Review of Power management technique Power management in Multicore ◦ Identify Multicores.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Low Power Techniques in Processor Design
Chalmers University of Technology FlexSoC Seminar Series – Page 1 Power Estimation FlexSoc Seminar Series – Daniel Eckerbert
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz November, 2004.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.
11/15/05ELEC / Lecture 191 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic Circuits.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
CS203 – Advanced Computer Architecture
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
PipeliningPipelining Computer Architecture (Fall 2006)
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
Lynn Choi School of Electrical Engineering
Temperature and Power Management
Adaptive Cache Partitioning on a Composite Core
SECTIONS 1-7 By Astha Chawla
Application-Specific Customization of Soft Processor Microarchitecture
Multi-core processors
Lynn Choi School of Electrical Engineering
Multi-core processors
Morgan Kaufmann Publishers
Multi-core processors
/ Computer Architecture and Design
Department of Electrical & Computer Engineering
University of Michigan
Computer Architecture Lecture 4 17th May, 2006
A High Performance SoC: PkunityTM
Introduction to Heterogeneous Parallel Computing
The University of Adelaide, School of Computer Science
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen Presenter: Borys Bradel

2 Introduction Different programs have different requirements (e.g. ILP) Extends to phases of a single program Heterogeneous cores Use core that matches the requirements Reuse existing cores Use multiple generations of the same family of processors

3 Outline Methodology Hardware Assumptions Power Experiments Optimal – energy/energy delay product Heuristic based – static/dynamic Related Work Conclusion

4 Single ISA Multi-Core Benefits Small area overhead because of the growth in core sizes between generations Clock frequencies of older cores would scale with technology P3 1 GHz = P4 1.4 GHz Increased pipeline depth precisely because could not scale

5 Hardware – Alpha Family 2 in order cores EV4=21064 EV5= out of order cores EV6=21264 EV8-=21464 (multi thread support removed)

6 Hardware Size 15% more area than just using 21464

7 Assumptions Can switch cores dynamically Private L1 cache and common L2 cache All cores use 0.10 micron technology Single process executing on a single core at any one time 2.1 GHz clock (= micron 600 MHz) Input voltage 1.2V Cores shut down when idle 1000 cycle restart cost (staged, phase lock loop left alone) 150 ms memory access Stall cycles through CACTI

8 Core Configurations

9 Power Model Use Wattch to account for activity based dissipation Use scaling and offset factors to account for other factors This hybrid model is closer to manufacturer’s data points Peak power: data sheets less L2 cache and output pins Typical power: scaled based on Intel chips

10 Power and Area Statistics

11 Performance Modeling Use SMTSIM, a cycle accurate simulator simpoint is used to identify representative instructions of programs and how many instructions need to be fast forwarded

12 Varying Performance Ratio

13 Varying Energy Efficiency Ratio

14 Oracle Switching for Energy Performance always within 10% of EV8-

15 Oracle Switching for Energy

16 Oracle Switching for Energy Delay Product Performance always within 50% of EV8-

17 Oracle Switching for Energy Delay Product

18 Others Voltage/frequency scaling – not as good Static core selection only EV6 and EV8- are used Dynamic heuristic Running average performance within 10% Every 100 time intervals (100 million instructions) cores are sampled for 5 intervals Select best core based on sampling

19 Results for Heuristics

20 Results for Heuristics/Static Core

21 Related Work Gating based power optimization Cannot gate at a fine enough granularity May still have leakage This could be thought of as gating to reduce capabilities of different units Voltage and frequency scaling Chip wide – one size does not fit all Fine grained – granularity problems

22 Conclusions Heterogeneous multi core architectures reduce the energy-delay product More fine grained than other approaches Using several cores from the same family is good Reduces development/testing costs Is it scalable? Just use EV6??