Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Slides:

Advertisements

Similar presentations

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

To Include or Not to Include? Natalie Enright Dana Vantrease.

Lecture 6: Multicore Systems

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Thoughts on Shared Caches Jeff Odom University of Maryland.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Ensuring Robustness via Early- Stage Formal Verification Multicore Power Management: Anita Lungu *, Pradip Bose **, Daniel Sorin *, Steven German **, Geert.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

How Multi-threading can increase on-chip parallelism

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

Core Architecture Optimization for Heterogeneous Chip Multiprocessors Rakesh Kumar, Deam M Tullsen, UCSD Norman P Jouppi, HP Labs, Palo Alto, CA PACT’06.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Computer Performance Computer Engineering Department.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Multi-core architectures. Single-core computer Single-core CPU chip.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Authors: Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen.

By Islam Atta Supervised by Dr. Ihab Talkhan

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Lynn Choi School of Electrical Engineering

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Improved schedulability on the ρVEX polymorphic VLIW processor

Lecture 14: Reducing Cache Misses

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Many-Core Graph Workload Analysis

Exploring Core Designs for Chip Multiprocessors

Presentation transcript:

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Outline Introduction Related Work Using Workloads for Multi-core Design Monotonicity vs Non-monotonicity Methodology Analysis and Results Conclusion

Introduction Multiple-core processors are becoming more popular. More flexibility in design. Heterogeneity across cores. But how to design CMPs of such heterogeneity?

Introduction Workloads, power and area constraints, level of threading, etc. The best design: Combination of good general purpose cores vs Combination of specialized cores?

Related Work Related work on heterogenous CMP design In terms of power efficiency. Improved processor performance. No one touched how to come up with such a design. They all assume a given design to go off with.

Using Workloads for Design Best design for what? On a set of applications. Certainly applications with a representative set of workloads. Searching for one optimum CPU is already expensive. It explodes for Multi-processors. Assume: Sum of performance = Performance of sum. Private caches. Consider only major blocks to be configurable. Consider only a fixed number (4) of cores.

Monotonicity vs Non-monotonicity Cores of a CMP posses monotonicity if they can be fully ordered. In terms of performance In terms of voltage/frequency Non-monotonicity; when there is no full ordering, but partial ordering. One is good for memory required jobs The other is good for many # of instrs.

Methodology Modeling of CPU cores Modeling Power and Area Modeling Performance

CPU cores 4-core multiprocessors, 0.10 micron, 1.2 V technology. Private L2 caches. In-order cores (Alpha EV5), Out-of- order cores (MIPS R10000) Evaluate 480 (96 in-order, 384 out-of- order) cores. Possible # of distinct 4 cores over 2.2 billion.

Power and Area Area budget = Sum of areas of 4 cores. Consider only peak activity power. Each core ranges b/w W of power and mm2 of area. Aggregate: 13.2 to 88mm2 of area and 16.4 to 65.2 W of power.

Performance - Workloads SPEC: The standard performance evaluation corporation is a non profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to newest generation of high- performance computers. Combination of Processor bound Bandwidth bound All different (a,b,c,d) All same (a,a,a,a) A wide range of workload

Performance - Evaluation 2.2 billion distinct CMP using 480 distinct cores. Performance of 4-core = Sum of performance of each core. Each with its own private L2 cache. Evaluation of performance for each core: 480 #distinct cores x #benchmarks x #cycles million cyclesxx Metric: Weighted speed up: arithmetic sum of each running thread’s IPC over its IPC on the simplest core considered.

Analysis and Results Analyzing multi-core processors for a given workload Analyzing multi-core processors for a given budget Quantifying inefficiency due to monotonicity Varying Thread-level Parallelism Efficient Search Techniques

Analyzing for a Given Workload All different: eon, mesa, deltablue, mcf. Observe non-monotonicity.

mesa mcf deltablue eon

Analyzing for a given budget Extended analysis in two ways: Any combination of workloads. Different area and power budgets.

All same case: Heterogeneity captures diversity among different homogeneous workloads. Performance depends on power budget. Heterogeneity achieves specialized cores, whereas homogeneity brings envelope cores.

Significant benefit, as long as power and area budget are constrained. The diversity required is related with available budgets. The stronger the constraints, more the diversity. Large difference b/w the best heterogeneous and homogeneous CMP designs. Best design is not composition of the same best performing core. Rather it is the combination of tuned cores.

Quantifying inefficiency due to monotonicity The best non-monotonic design of 2 cores is better than: The best monotonic design of 2 cores with 7.5%; The best homogeneous design of 2 cores with 15.4%. The more constrained the higher cost of monotonicity.

Varying Thread-Level Parallelism Again heterogeneity has benefits. When less competition, performance is better.

Again observe the benefits of heterogeneity. Interestingly, for TLP=1, rather than a huge monolithic core and tiny complementary cores, a design of heterogenous and tuned cores it better.

Efficient Search Techniques A huge search space: 2.2 billion distinct core combinations. Thousands of 4-thread workloads. Not scalable, what if there are more than 10 different applications, or more than 4 cores? A smarter solution: Hill climbing which is likely to stay at a local maxima. Results are still better when compared to that of homogenous designs.

Conclusions How to do good heterogeneous CMP for a given power and area budget and a set of workloads. The best is not to combine cores that are good general-purpose ones. The best way is to combine tuned heterogeneous cores. Such tuning results in non-monotonicity. In the sense that they can only be partially ordered. Heterogeneous design performs better also for homogeneous workloads.