Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Structure of Computer Systems
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Computer performance.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
History of Microprocessor MPIntroductionData BusAddress Bus
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
PipeliningPipelining Computer Architecture (Fall 2006)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 6th Edition
Itanium® 2 Processor Architecture
COMP 740: Computer Architecture and Implementation
Lynn Choi School of Electrical Engineering
Lynn Choi School of Electrical Engineering
Multi-core processors
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
Hyperthreading Technology
Technology and Historical Perspective: A peek of the microprocessor Evolution 11/14/2018 cpeg323\Topic1a.ppt.
Architecture & Organization 1
Chapter 1 Introduction.
William Stallings Computer Organization and Architecture 8th Edition
Lecture: Cache Hierarchies
William Stallings Computer Organization and Architecture 8th Edition
The University of Adelaide, School of Computer Science
Presentation transcript:

Multicore Architectures Michael Gerndt

Development of Microprocessors Transistor capacity doubles every 18 months © Intel

Development of Microprocessors Moore’s Law Estimated to stay at least in next 10 years But: Transistor count ≠ Power How to use transistor resources? Better execution core –Enhance pipelining, superscalarity, … –Better vector processing (SIMD, like MMX/SSE) –Problem: Gap to memory speed Larger Caches –Improves memory access speed More execution cores –Problem: Gap to memory speed …

Development of Microprocessors Objective for manufactures As much profit as possible: Sell processors … Customers only buy when applications run faster Increase CPU power How to increase CPU power Higher clock rate More parallelism –Instruction Level Parallelism (ILP) –Thread Level Parallelism (TLP)

Development of Microprocessors Higher clock rates increase power consumption –proportional to f and U² –higher frequency needs higher voltage –Small structures: Energy loss by leakage increase heat output and cooling requirements limit chip size (speed of light) at fixed technology (e.g. 60 nm) –Smaller number of transistor levels per pipeline stage possible –More, simplified pipeline stages (P4: >30 stages) –Higher penalty of pipeline stalls (on conflicts, e.g. branch misprediction)

Development of Microprocessors More parallelism Increased bit width (now: 64 bit architectures) –SIMD Instruction Level Parallelism (ILP) –exploits parallelism found in a instruction stream –limited by data/control dependencies –can be increased by speculation –average of ILP in typical programs: 6-7 –modern superscalar processors can not get better…

Development of Microprocessors More parallelism Thread Level Parallelism (TLP) –Hardware multithreaded (e.g. SMT: Hyperthreading) –better exploitation of superscalar execution units –Multiple cores –Legacy software must be parallelized –Challenge for whole software industry –Intel moved into the tools business

Multicore Architectures SMPs on a single chip Chip Multi-Processors (CMP) Advantage Efficient exploitation of available transistor budget Improves throughput and speed of parallelized applications Allows tight coupling of cores –better communication between cores than in SMP –shared caches Low power consumption –low clock rates –idle cores can be suspended Disadvantage Only improves speed of parallelized applications Increased gap to memory speed

Multicore Architectures Design decisions homogeneous vs. heterogeneous –specialized accelerator cores –SIMD –GPU operations –cryptography –DSP functions (e.g. FFT) –FPGA (programmable circuits) –access to memory –own memory area (distributed memory) –via cache hierarchy (shared memory) Connection of cores –internal bus / cross bar connection –Cache architecture

Multicore Architectures: Examples Core L1 L2 L3 Memory Module 1 Memory Module 2 I/O Homogeneous with shared caches and cross bar Core (2x SMT) Core L1 L2 Core Local Store Local Store Core Local Store Local Store I/O Memory Module Heterogeneous with caches, local store and ring bus

Shared Cache Design Memory Core L1 L2 Switch Memory Core L1 L2 Switch Traditional design Multiple single-cores with shared cache off-chip Core L1 L2 Switch Multicore Architecture Shared Caches on-chip

Shared Cache Design Memory Core L1 L2 Switch Core L1 Multicore Architecture Shared Caches on-chip

Shared Caches: Advantages No coherence protocol at shared cache level Less latency of communication Processors with overlapping working set One processor may prefetch data for the other Smaller cache size needed Better usage of loaded cache lines before eviction (spatial locality) Less congestion on limited memory connection Dynamic sharing if one processor needs less space, the other can use more Avoidance of false sharing

Shared Caches: Disadvantages Multiple CPUs  higher requirements higher bandwidth Cache should be larger (larger  higher latency) Hit latency higher due to switch logic above cache Design more complex One CPU can evict data of other CPU

Multicore Processors SUN UltraSparc IV / IV+ –dual core –2x multithreaded per core UltraSparc T1 (Niagara): –8 cores –4x multithreaded per core –one FPU for all cores –low power UltraSparc T2 (Niagara 2)

Intel Itanium 2 Dual Core - Montecito Two Itanium 2 cores Multi-threading (2 Threads) –Simultaneous multi-threading for memory hierarchy resources –Temporal multi-threading for core resources –Besides end of time slice, an event, typically an L3 cache miss, might lead to a thread switch. Caches –L1D 16 KB, L1I 16 KB –L2D 256 KB, L2I 1 MB –L3 9 MB Caches private to cores 1,7 Billion transistors

Itanium 2 Dual Core

Intel Core Duo 2 mobile-optimized execution cores No multi-threading Cache hierarchy Private 32-KB L1I and L1D Shared 2 MB L2 cache Provides efficient data sharing between both cores Power reduction Some states individually by each processor Deeper and enhanced deeper sleep states only for die Dynamic Cache Sizing feature –Flushes entire cache –This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity 151 Million transistors

IBM Cell IBM, Sony, Toshiba Playstation 3 (Q1 2006) 256 GFlops Bei 3 GHz nur ~30W ganze PS3 nur $

Cell: Architecture 9 parallele processors Specialized for different tasks 1 large PPE - 8 SPEs Synergistic Processing Element

Cell: SPE Synergistic Processing Element 128 registers 128-Bit SIMD Single Thread 256KByte local memory not cache DMA execute memory transfers Simple ISA Less functionality to save space Limitations can become a problem if memory access is too slow. 25,6 GFlops single precision für multiply-add operations

Intel Westmere EX Processor of the fat node of LRZ 2,4 GHz 9.6 Gflop/s per core 96 Gflop/s per socket 10 hyperthreaded cores, i.e. two logical cores each Caches 32 KB L1 private 256 KB L2 private 30 MB L3 shared 2,9 billion transistors Xeon E (2,4 GHz, 10 Kerne, 30 MByte L3)

NUMA On-chip NUMA L3 Cache organized in 10 slices Interconnection via a bidirectional ring bus 10-way physical address hashing to avoid hot spots, and can handle five parallel cache requests per clock cycle Mapping algorithm is not known, no migration support Off-chip NUMA Glueless combination of up to 8 sockets into SMP 4 Quick Path Interconnect (QPI) interfaces 2 on-chip memory controllers

Cache Coherency Cbox Connects core to ringbus and one memory bank Responsible for processor read/write/writeback and external snoops, and returning cached data to core and QuickPath agents. Distribution of physical addresses is determined by hash function Sbox Caching Agent Each associated with 5 Cboxes

Cache Coherency Bbox Home agent Responsible for cache coherency of the cache line in this memory. Keeps track of the Cbox replies due to coherence messages. Directory Assited Snoopy (DAS) Keeps states per cache line (I – Idle or no remote sharers, R – may be present on remote socket, E/D owned by IO Hub) If line is in I state it can be forwarded without waiting for snoop replies.

Summary High frequency -> high power consumption Trend towards multiple cores on chip Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, … Problem: memory latency and bandwidth