Adaptive Single-Chip Multiprocessing

Slides:



Advertisements
Similar presentations
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Structure of Computer Systems
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Multi Core Processor Submitted by: Lizolen Pradhan
Last Time Performance Analysis It’s all relative
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Hyper-Threading Technology Architecture and Microarchitecture
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 5a: CPU architecture 101 boris.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture & Operations I
COMP 740: Computer Architecture and Implementation
CS203 – Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
Temperature and Power Management
Prof. Onur Mutlu Carnegie Mellon University
SECTIONS 1-7 By Astha Chawla
Multi-core processors
Lynn Choi School of Electrical Engineering
Hot Chips, Slow Wires, Leaky Transistors
Computer Structure Multi-Threading
Assembly Language for Intel-Based Computers, 5th Edition
Multi-core processors
Architecture & Organization 1
Scalable Processor Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Multi-Core Computing Osama Awwad Department of Computer Science
Hyperthreading Technology
Computer Architecture: Multithreading (I)
Milad Hashemi, Onur Mutlu, Yale N. Patt
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
CS/EE 6810: Computer Architecture
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Chapter 1 Introduction.
15-740/ Computer Architecture Lecture 10: Runahead and MLP
Multithreaded Programming
Computer Evolution and Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 11: Alternative Architectures
CSC3050 – Computer Architecture
Operating System Overview
Presentation transcript:

Adaptive Single-Chip Multiprocessing Dan Gibson degibson@wisc.edu University of Wisconsin-Madison Department of Electrical and Computer Engineering

Introduction Moore’s Law continues to provide more transistors Devices are getting smaller Devices are getting faster Leads to increases in clock frequency Memories are getting bigger Large memories often require more time to access RC Circuits continue to charge exponentially Long-wire signal propagation time is not improving as rapidly as switching speed On-chip communication time is slower relative to processor clock speeds ECE Qualifying Exam

The Memory Wall Processors grow faster, memory grows slower Off-chip cache misses can halt even aggressive out-of-order processors On-chip cache accesses are becoming long-latency events Latency can sometimes be tolerated Caching Perfecting Speculation Out-of-order execution Multithreading ECE Qualifying Exam

The “Power” Wall More devices, faster clocks => More power Power supply accounts for lots of pins in chip packaging (3,057 of 5,370 pins on the POWER5) Heat dissipation increases total cost of ownership (~34W cooling power required to remove 100W of heat) Dynamic Power in CMOS Devices get smaller, faster, and more numerous More Capacitance Higher Frequency Architects can constrain α, CL, and f ECE Qualifying Exam

Enter Chip Multiprocessors (CMPs) One chip, many processors Multiple cores per chip Often multiple threads per core Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 ECE Qualifying Exam

CMPs CMPs can have good performance Explicit thread-level parallelism Related threads experience constructive prefetching CMPs can tolerate long-latency events well Many concurrent threads => long-latency memory accesses can be overlapped CMPs can be power-efficient Enables use of simpler cores Distributes “hot spots” ECE Qualifying Exam

CMPs CMPs are very specialized Parallel machines are difficult to use Assumes (highly) threaded workload Parallel machines are difficult to use Parallel programming is not (yet) commonplace Many problems similar to traditional multiprocessors Cache coherence Memory consistency Many new opportunities Cache sharing More integration ECE Qualifying Exam

Adaptive CMPs To combat specialization, adapt a CMP dynamically to its current workload and system: Adapt caching policy ( Beckmann et. al., Chang et. al., and more ) Adapt cache structure ( Alameldeen et. al., and more ) Adapt thread scheduling ( Kihm et. Al., in the SMT space) Current idea: Adaptive thread scheduling from the space of un-stalled and stalled threads A union of single-core multithreading and runahead execution in the context of CMPs ECE Qualifying Exam

Single-Core Multithreading Allow multiple (HW) threads within the same execution pipeline Shares processor resources: FUs, Decode, ROB, etc. Shares local memory resources: L1 caches, LSQ, etc. Can increase processor and memory utilization Sun’s Niagara pipeline block diagram ( Kongetira et. al.) ECE Qualifying Exam

Runahead Execution Continue execution in the face of a cache miss “Checkpoint” architectural state Continue execution speculatively Convert memory accesses to prefetches “Runahead” prefetches can be highly accurate, and can greatly improve cache performance ( Mutlu, et. al.) It is possible to issue useless prefetches Can be power-inefficient (Mutlu, et. al.) ECE Qualifying Exam

Runahead/Multithreaded Core Interaction Similar Hardware Requirements: Additional register files Additional LSQ entries Competition for Similar Resources: Execution time (Processor pipeline, Functional units, etc) Memory bandwidth TLB Entries, cache space, etc. ECE Qualifying Exam

Runahead/Multithreaded Core Interaction A multithreaded core in a CMP, with runahead, must make a difficult scheduling decisions: Thread scheduling considerations: Which thread should run? Should the thread use runahead? How long should the thread run/runahead? Scheduling implications: Is an idle thread making foreword progress at the expense of a useful thread? Is a thread spinning on a lock held by another thread? Is runahead effective for a given thread? Is a given thread causing performance problems elsewhere in the CMP? ECE Qualifying Exam

Proposed Mechanism Track per-thread state on: Selection criteria: Runahead prefetching accuracy High accuracy favors allowing thread to runahead HW-assigned thread priority Highly “useful” threads are preferred Selection criteria: Heuristic-guided Select the best priority/accuracy pair Probabilistically-guided Select a thread with likelihood proportional to its priority/accuracy Useful-first Select non-runahead threads first, then select runahead threads ECE Qualifying Exam

Future Directions Dynamically Adaptable CMPs offer several future areas of research: Adapt for power savings / heat dissipation Computation relocation, load balancing, automatic low-power modes, etc. Adapt to error conditions Dynamically allocate backup threads Automatically relocate threads to improve resource sharing Combined HW/SW/VM approach ECE Qualifying Exam

Summary Latency now dominates off-chip communication On-chip communication isn’t far behind Many techniques to tolerate latency, including multithreading CMPs provide new challenges and opportunities to computer architects Latency tolerance Potential for power savings Can adapt a CMP’s behavior to its workload Dynamic management of shared resources ECE Qualifying Exam