18-742 Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Lecture 6: Multicore Systems
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 17 Parallel Processing.
CS Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
Computer System Architectures Computer System Software
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
ECE/CS 552: Multithreading and Multicore © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University.
CS Lecture 20 The Case for a Single-Chip Multiprocessor
18-447: Computer Architecture Lecture 30B: Multiprocessors
Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics
Prof. Onur Mutlu Carnegie Mellon University
Samira Khan University of Virginia Sep 6, 2017
Lynn Choi School of Electrical Engineering
Simultaneous Multithreading
Multi-core processors
15-740/ Computer Architecture Lecture 7: Pipelining
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Architecture & Organization 1
Hyperthreading Technology
Multi-Processing in High Performance Computer Architecture:
Computer Architecture: Multithreading (I)
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Samira Khan University of Virginia Sep 12, 2018
Adaptive Single-Chip Multiprocessing
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Computer Evolution and Performance
Samira Khan University of Virginia Jan 30, 2019
Presentation transcript:

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012

Reminder: Reviews Due Sunday Sunday, September 16, 11:59pm. Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS Suleman et al., “Data Marshaling for Multi-core Architectures,” ISCA Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS

Multi-Core Processors 3

Moore’s Law 4 Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.

5

Multi-Core Idea: Put multiple processors on the same die. Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors?  Have a bigger, more powerful core  Have larger caches in the memory hierarchy  Simultaneous multithreading  Integrate platform components on chip (e.g., network interface, memory controllers) 6

Why Multi-Core? Alternative: Bigger, more powerful single core  Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive) 7

Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS

Multi-Core vs. Large Superscalar Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand 9

Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS Technology push  Instruction issue queue size limits the cycle time of the superscalar, OoO processor  diminishing performance Quadratic increase in complexity with issue width  Large, multi-ported register files to support large instruction windows and issue widths  reduced frequency or longer RF access, diminishing performance Application pull  Integer applications: little parallelism?  FP applications: abundant loop-level parallelism  Others (transaction proc., multiprogramming): CMP better fit 10

Comparison Points… 11

Why Multi-Core? Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy 12

Cache vs. Core 13

Why Multi-Core? Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) + Good single-thread performance with SMT + No need to have an entire core for another thread + Parallel performance aided by tight sharing of caches - Scalability is limited: need bigger register files, larger issue width (and associated costs) to have many threads  complex with many threads - Parallel performance limited by shared fetch bandwidth - Extensive resource sharing at the pipeline and memory system reduces both single-thread and parallel application performance 14

Why Multi-Core? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections) 15

Why Multi-Core? Alternative: More scalable superscalar, out-of-order engines  Clustered superscalar processors (with multithreading) + Simpler to design than superscalar, more scalable than simultaneous multithreading (less resource sharing) + Can improve both single-thread and parallel application performance - Diminishing performance returns on single thread: Clustering reduces IPC performance compared to monolithic superscalar. Why? - Parallel performance limited by shared fetch bandwidth - Difficult to design 16

Clustered Superscalar+OoO Processors Clustering (e.g., Alpha integer units)  Divide the scheduling window (and register file) into multiple clusters  Instructions steered into clusters (e.g. based on dependence)  Clusters schedule instructions out-of-order, within cluster scheduling can be in-order  Inter-cluster communication happens via register files (no full bypass) + Smaller scheduling windows, simpler wakeup algorithms + Smaller ports into register files + Faster within-cluster bypass -- Extra delay when instructions require across-cluster communication 17

Clustering (I) Scheduling within each cluster can be out of order 18

Clustering (II) 19

Palacharla et al., “Complexity Effective Superscalar Processors,” ISCA Clustering (III) 20 Each scheduler is a FIFO + Simpler + Can have N FIFOs (OoO w.r.t. each other) + Reduces scheduling complexity -- More dispatch stalls Inter-cluster bypass: Results produced by an FU in Cluster 0 is not individually forwarded to each FU in another cluster.

Why Multi-Core? Alternative: Traditional symmetric multiprocessors + Smaller die size (for the same processing core) + More memory bandwidth (no pin bottleneck) + Fewer shared resources  less contention between threads - Long latencies between cores (need to go off chip)  shared data accesses limit performance  parallel application scalability is limited - Worse resource efficiency due to less sharing  worse power/energy efficiency 21

Why Multi-Core? Other alternatives?  Dataflow?  Vector processors (SIMD)?  Integrating DRAM on chip?  Reconfigurable logic? (general purpose?) 22

Review: Multi-Core Alternatives Bigger, more powerful single core Bigger caches (Simultaneous) multithreading Integrate platform components on chip instead More scalable superscalar, out-of-order engines Traditional symmetric multiprocessors Dataflow? Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?) Other alternatives? 23