1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Slides:

Advertisements

Similar presentations

A Novel 3D Layer-Multiplexed On-Chip Network

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

1 On Handling QoS Traffic in Wireless Sensor Networks 吳勇慶.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg.

Pipelining and Parallelism Mark Staveley

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Adaptive Cache Partitioning on a Composite Core

Simultaneous Multithreading

Simultaneous Multithreading

Computer Structure Multi-Threading

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Department of Computer Science University of California, Santa Barbara

Simultaneous Multithreading in Superscalar Processors

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Fine-grained vs Coarse-grained multithreading

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Patrick Akl and Andreas Moshovos AENAO Research Group

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman Homayoun National Science Foundation CI Fellow University of California San Diego

Why Heterogeneity? 2 Existing General Purpose CMP designs use only homogeneous cores A general purpose one-size-fits-all core is not necessarily the most efficient One processor optimized for each application! Core 1 Core 2

Static vs. Dynamic Heterogeneity 3 Prior proposals (e.g., Kumar 2003) propose static heterogeneity. Increases chance of finding an appropriate core Does not guarantee perfect match Others have proposed solutions for dynamic heterogeneity (Core Fusion, TFlex). Due to the difficult of sharing resources at a fine granularity, they enable only coarse- grain sharing. Big (combined) cores or small cores.

4 Outline Resource Pooling Why 3D? Design Solutions Adaptive Policies Results Conclusion

Application Resource Utilization 5

6 ROBLDSQRFIQ Application Resource Utilization

7 Application 1 Application 2 underutilized ROBLDSQRFIQ ROBLDSQRFIQ Application Resource Utilization Dual-Core Machine

Dynamic Heterogeneity Through Resource Pooling 8 Register File ROB Register File ROB Core 2 Core 1

9 Outline Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion

Why NOT Sharing in 2D? 10 Long wire delay in 2D In 2D, it is not efficient Demanding 500 psec 5 nsec

11 Our Solution: 3D

12 Our Solution: 3D Fast interconnection network As fast as few ps (three order of magnitude smaller than 2D) Minimize the Communication Latency 5 psec 5000 psec A principal advantage No change to the fundamental pipeline design of 2D architectures, yet still exploits the 3D to provide greater energy proportionality and core customization

13 Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion Outline

Stackable Structures for Resource Pooling Performance bottleneck and power hungry resources Reorder Buffer and Register File (SRAM) Instruction Queue and Load and Store Queue (CAM+SRAM) Our goal: share units across multiple cores with minimal impact on design spec (latency, number of ports and power) Use previously proposed modular design Each partition is a self-standing and independently usable unit Effective in reducing power and access delay 14 Independent partition Part 1 Part 2 Part 3 Part 4 Register File

Example of Resource Sharing 15 Decoder MUX TSV Register File in Core 0 Register File in Core 1 Free Partition Additional logic to decide whether partition is empty Additional logic to route the signal to the right partition

16 Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion Outline

Adaptive Policies for Resource Pooling Several issues need to be considered Ownership Fast releasing Fast reallocation Cycle by cycle adaptation Prevent starvation A simple adaptive policy specification (MinMax policy) Set limit for the size of resources how much they can grow up to (MAX) or they can shrink down to (MIN) Use free list Use central arbitration 17

18 Arbitration Unit Core 1 Core 2Core 3 Core 4 Free List Application 1Application 2 Application 3Application 4 Register File MinMax Policy Example MIN

19 Arbitration Unit Core 1 Core 2Core 3 Core 4 Free List Application 1Application 2 Application 3Application 4 Register File MinMax Policy Example MIN

20 Arbitration Unit Core 1 Core 2Core 3 Core 4 Free List Application 1Application 2 Application 3Application 4 Register File MinMax Policy Example MIN

21 Arbitration Unit Core 1 Core 2Core 3 Core 4 Free List Application 1Application 2 Application 3Application 4 Register File MinMax Policy Example MIN

22 Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion Outline

Baseline Architecture 23 Processor Model High-end architecture, four OoO cores with issue width of 4 Medium-end architecture, four OoO cores with issue width of 2 3D Floorplans (different performance, flexibility, and temperature tradeoff) (1) Conventional (Thermal-Optimized Design) (2) Proposed (Performance-Optimized Design) (1)(2)

Evaluation 24 1 Thread 4 Thread 2 Thread Power Performance Temperature Energy-Delay Core 1 Core 2 Core 3 Core 4 Active core Idle core Link

Single Thread Performance 25 Speed Up Standard SPEC2K and SPEC2006 Benchmark Single benchmark (3 out of 4 cores are idle)

Multi-Thread Performance 2Thr: 2 idle cores + underutilized resources in the active cores 4Thr: No idle cores, only underutilized resources 26 Normalized Weighted Speedup (%) gains are dramatic when some cores are idle

Medium-end vs High-end Resource pooling makes the medium core significantly more competitive with the high-end. 27 Normalized Weighted Speedup (%) 28% 14% Only 3%! 0 Idle Core 2 Idle Core3 Idle Core Increase Resource Sharing

28 power (Watt) 3X 4X Pooling pay a small price in power Because of the enhanced throughput. Large speedups on low-IPC threads and high average speedup, but smaller increase in total instruction throughput and thus smaller increase in power Power

29 temperature (Celsius) Temperature Interestingly, the temperature of the medium resource-pooling core is comparable to the high-end core

Efficiency 30 Even still, at equal temperature, the more modest cores have a significant advantage in energy efficiency measured in MIPS 2 /W (MIPS 2 /W is the inverse of energy-delay product) Normalized 2X

Conclusions Homogeneous cores are inherently inefficient for a diverse workload. Cores are typically overprovisioned as a result 3D stacking of cores enables fine-grain sharing (pooling) of resources not possible in 2D designs. Our dynamically heterogeneous 3D architecture allows the processor to construct the right core for each application dynamically, maximizing energy efficiency. Our 3D pooling architecture Leverages our experience in 2D pipeline design, yet still gains significant benefit from 3D Adapts to the specific demands of an application within a few cycles. Reduces reliance on overprovisioned cores, instead grabbing larger resources only when needed. 31

End of presentation