A few issues on the design of future multicores André Seznec IRISA/INRIA.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.

How Multi-threading can increase on-chip parallelism

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Computer performance.

Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.

Computer System Architectures Computer System Software

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

André Seznec Caps Team IRISA/INRIA 1 High Performance Microprocessors André Seznec IRISA/INRIA

Chapter 1 Introduction to the Systems Approach

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

15-740/ Computer Architecture Lecture 2: ISA, Tradeoffs, Performance Prof. Onur Mutlu Carnegie Mellon University.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

Lynn Choi School of Electrical Engineering

CS Lecture 20 The Case for a Single-Chip Multiprocessor

Parallel Processing - introduction

Multi-core processors

Morgan Kaufmann Publishers

CS775: Computer Architecture

Multi-Processing in High Performance Computer Architecture:

Dynamically Reconfigurable Architectures: An Overview

Introduction to Heterogeneous Parallel Computing

8 – Simultaneous Multithreading

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

CSE 502: Computer Architecture

Spring’19 Prof. Eric Rotenberg

Presentation transcript:

A few issues on the design of future multicores André Seznec IRISA/INRIA

André Seznec CAPS project-team Irisa-Inria 2 Single Chip Uniprocessor: the end of the road  (Very) wide issue superscalar processors are not cost effective:  More than quadratic complexity on many key components: Register file Bypass network Issue logic  Limited performance return Failure of EV8 = end of very wide issue superscalar processors

André Seznec CAPS project-team Irisa-Inria 3 Hardware thread parallelism  High-end single chip component:  Chip multiprocessors: IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64 Many CMP SoCs for embedded markets Cell  (Simultaneous) Multithreading: Pentium 4, Power 5, Multithreading

André Seznec CAPS project-team Irisa-Inria 4 Thread parallelism  Expressed by the application developer:  Depends on the application itself  Depends on the programming language or paradigm  Depends on the programmer  Discovered by the compiler:  Automatic (static) parallelization  Exploited by the runtime:  Task scheduling  Dynamically discovered/exploited by hardware or software:  Speculative hardware/software threading

André Seznec CAPS project-team Irisa-Inria 5 Direction of (single chip) architecture: betting on parallelism success  (Future) applications are intrinsically parallel:  As much as possible simple cores  (Future) applications are moderately parallel  A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores

André Seznec CAPS project-team Irisa-Inria 6 SSC: Sea of Simple Cores

André Seznec CAPS project-team Irisa-Inria 7 FCC: Few Complex Cores 4-way O-O-O superscalar 4-way O-O-O superscalar Shared L3 cache 4-way O-O-O superscalar

André Seznec CAPS project-team Irisa-Inria 8 Common architectural design issues

André Seznec CAPS project-team Irisa-Inria 9 Instruction Set Architecture  Single ISAs ?  Extension of “conventional” multiprocessors Shared or distributed memory ?  Hetorogeneous ISAs:  A la CELL ?: (master processor + slave processors) x N  A la SoC ? : specialized coprocessors  Radically new architecture ? Which one ?

André Seznec CAPS project-team Irisa-Inria 10 Hardware accelerators ?  SIMD extensions:  Seems to be accepted, report the burden to applications developers and compilers  Reconfigurable datapaths:  Popular when you get a well defined intrinsically parallel application  Vector extensions:  Might be the right move when targeting essentially scientific computing

André Seznec CAPS project-team Irisa-Inria 11 On-chip memory/processors/memory bandwidth  The uniprocessor credo was: “Use the remaining silicon for caches”  New issue:  An extra processor or more cache Extra processing power =  increased memory bandwidth demand  Increased power consumption, more temperature hot spots Extra cache = decreased (external) memory demand

André Seznec CAPS project-team Irisa-Inria 12 Memory hierarchy organization ?

André Seznec CAPS project-team Irisa-Inria 13 Flat: sharing a big L2/L3 cache? μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 cache

André Seznec CAPS project-team Irisa-Inria 14 Flat: communication issues? through the big cache μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 cache

André Seznec CAPS project-team Irisa-Inria 15 Flat: communication issues? Grid-like ? μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 cache

André Seznec CAPS project-team Irisa-Inria 16 Hierarchical organization ? μP μP$ μP μP$ L2 $ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ μP μP$ L3 $

André Seznec CAPS project-team Irisa-Inria 17 Hierarchical organization ?  Arbitration at all levels  Coherency at all levels  Interleaving at all levels  Bandwidth dimensioning

André Seznec CAPS project-team Irisa-Inria 18 NoC structure  Very dependent of the memory hierarchy organization !!  + sharing coprocessors/hardware accelerators  + I/O buses/(processors ?)  + memory interface  + network interface

André Seznec CAPS project-team Irisa-Inria 19 Example μP μP$ μP μP$ L2 $ μP μP$ μP μP$ μP μP$ μP μP$ L3 $ Memory Int. IO

André Seznec CAPS project-team Irisa-Inria 20 Multithreading ?  An extra level thread parallelism !!  Might be an interesting alternative to prefetching on massively parallel applications

André Seznec CAPS project-team Irisa-Inria 21 Power and thermal issues  Voltage/frequency scaling to adapt to the workload ?  Adapting the workload to the available power ?  Adapting/dimensioning the architecture to the power budget  Activity migration for managing temperatures ?

André Seznec CAPS project-team Irisa-Inria 22 General issues for software/compiler  Parallelism detection and partitioning:  find the correct granularity  Memory bandwidth mastering  Non-uniform memory latency  Optimizing sequential code portions

André Seznec CAPS project-team Irisa-Inria 23 SSC design specificities

André Seznec CAPS project-team Irisa-Inria 24 Basic core granularity  RISC cores  VLIW cores  In-order superscalar cores

André Seznec CAPS project-team Irisa-Inria 25 Homogeneous vs. heterogeneous ISAs  Core specialization:  RISC + VLIW or DSP slaves ?  Master core + a set of special purpose cores ?

André Seznec CAPS project-team Irisa-Inria 26 Sharing issue  Simple cores:  Lot of duplications and lots of unused resources at any time  Adjacent cores can share:  Caches  Functional units: FP, mult/div, multimedia,  Hardware accelerators

André Seznec CAPS project-team Irisa-Inria 27 An example of sharing μP μPFP μP μP DL1 $ Inst. fetch IL1 $ μP μPFP μP μP DL1 $ Inst. fetch IL1 $ Hardware accelerator L2 cache

André Seznec CAPS project-team Irisa-Inria 28 Multithreading/prefetching  Multithreading:  Is the extra complexity worth for simple cores ?  Prefetching:  Is it worth ?  Sharing prefetch engines ?

André Seznec CAPS project-team Irisa-Inria 29 Vision of a SSC (my own vision )

André Seznec CAPS project-team Irisa-Inria 30 SSC: the basic brick μP μPFP μP μP D $ I $ μP μPFP μP μP D $ I $ L2 cache μP μPFP μP μP D $ I $ μP μPFP μP μP D $ I $

André Seznec CAPS project-team Irisa-Inria 31 Memory interface network interface System interface L3 cache μP μP FP μP μP D $ I $ μP μPFP μP μP D $ I $ L2 cache μP μPFP μP μP D $ I $ μP μP FP μP μP D $ I $ μP μP FP μP μP D $ I $ μP μPFP μP μP D $ I $ L2 cache μP μPFP μP μP D $ I $ μP μP FP μP μP D $ I $ μP μP FP μP μP D $ I $ μP μPFP μP μP D $ I $ L2 cache μP μPFP μP μP D $ I $ μP μP FP μP μP D $ I $ μP μP FP μP μP D $ I $ μP μPFP μP μP D $ I $ L2 cache μP μPFP μP μP D $ I $ μP μP FP μP μP D $ I $

André Seznec CAPS project-team Irisa-Inria 32 FCC design specificities

André Seznec CAPS project-team Irisa-Inria 33 Only limited available thread parallelism ?  Focus on uniprocessor architecture:  Find the correct tradeoff between complexity and performance  Power and temperature issues  Vector extensions ?  Contiguous vectors ( a la SSE) ?  Strided vectors in L2 caches ( Tarantula-like)

André Seznec CAPS project-team Irisa-Inria 34 Performance enablers  SMT for parallel workloads ?  Helper threads ?  Run ahead threads  Speculative multithreading hardware support

André Seznec CAPS project-team Irisa-Inria 35 Intermediate design ?  SCCs:  Shine on massively parallel applications  Poor/ limited performance on sequential sections  FCCs:  Moderate performance on parallel applications  Good performance on sequential sections

André Seznec CAPS project-team Irisa-Inria 36 Amdahl’s law Mix of FCC and SSC

André Seznec CAPS project-team Irisa-Inria 37 The basic brick L2 cache μP μPFP μP μP D $ I $ μP μPFP μP μP D $ I $ Ultimate Out-of-order Superscalar

André Seznec CAPS project-team Irisa-Inria 38 L2 $ D $ I $ D $ I $ Ult. O-O-O L2 $ D $ I $ D $ I $ Ult. O-O-O L2 $ D $ I $ D $ I $ Ult. O-O-O L2 $ D $ I $ D $ I $ Ult. O-O-O L3 cache Memory interface network interface System interface

André Seznec CAPS project-team Irisa-Inria 39 Conclusion  The era of uniprocessor has come to the end  No clear trend to continue  Might be time for more architecture diversity