Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
CPE 731 Advanced Computer Architecture Thread Level Parallelism Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
CSE 502 Graduate Computer Architecture Lec 10 –Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Larry Wittie Computer Science, StonyBrook University and ~lw
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Chapter 3.4: Loop-Level Parallelism and Thread-Level Parallelism
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Ch3. Limits on Instruction-Level Parallelism 1. ILP Limits 2. SMT (Simultaneous Multithreading) ECE562/468 Advanced Computer Architecture Prof. Honggang.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
Limits to ILP How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor.
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture Thread Level Parallelism
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Multi-core processors
Limits on ILP and Multithreading
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Chapter 3: ILP and Its Exploitation
/ Computer Architecture and Design
Lec 10 – Example Architectures - Simultaneous Multithreading
David Patterson Electrical Engineering and Computer Sciences
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
John Kubiatowicz Electrical Engineering and Computer Sciences
CSCE 430/830 Computer Architecture Advanced HW Approaches: Speculation
Department of Electrical and Computer Engineering
Hyperthreading Technology
Electrical and Computer Engineering
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Limits to ILP Conflicting studies of amount
Lec 9 – Limits to ILP and Simultaneous Multithreading
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
Hardware Multithreading
8 – Simultaneous Multithreading
Lecture 22: Multithreading
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Chapter 3: Limits of Instruction-Level Parallelism
Presentation transcript:

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal www.ics.ele.tue.nl/~heco/courses/ECA h.corporaal@tue.nl TUEindhoven 2016-2017

Welcome back Chip Multi-Processors 8.3: Multi-threading: Material: Coarse grain multi-threading Fine-grain multi-threading SMT (simultaneous multi-threading) called hyperthreading by Intel Material: Book chapter 8 8.3: SMT

New Approach: Multi-Threaded Multithreading = multiple threads share the functional units of 1 processor duplicate independent state of each thread e.g., a separate copy of register file, a separate PC HW for fast thread switch; much faster than full process switch  100s to 1000s of clocks When to switch? Next instruction next thread (fine grain), or When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Multithreaded Categories Simultaneous Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage: it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage: may slow down execution of individual threads Used in e.g. Sun’s Niagara

Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Relieves need to have very fast thread-switching Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time Used in e.g. IBM AS/400

Simultaneous Multi-threading ... One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

Simultaneous Multithreading (SMT) SMT: dynamically scheduled processors already has many HW mechanisms to support multithreading: Large set of virtual registers that can be used to hold the register sets of independent threads Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Just adding a per thread renaming table and keeping separate PCs

IBM Power4 Single threaded 8 FUs 4-issue out-of-order

IBM Power5: supports 2 threads 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes

Changes in Power 5 to support SMT Increased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

Power 5 thread performance ... Relative priority of each thread controllable in hardware. For balanced operation, both threads run slower than if they “owned” the machine.

Do you like Multi-Threading? Do we need it? when? Can you compare pros and cons of MT and Multi-Processing ?