Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.
CS25212 Coarse Grain Multithreading Learning Objectives: – To be able to describe a coarse grain multithreading implementation – To be able to estimate.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
COMP 740: Computer Architecture and Implementation
/ Computer Architecture and Design
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Multi-core processors
CS203 – Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Morgan Kaufmann Publishers The Processor
Lecture: SMT, Cache Hierarchies
From before the Break Classic 5-stage pipeline
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
How to improve (decrease) CPI
Control unit extension for data hazards
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Hardware Multithreading
Control unit extension for data hazards
8 – Simultaneous Multithreading
Control unit extension for data hazards
Lecture 22: Multithreading
The University of Adelaide, School of Computer Science
Presentation transcript:

Hardware Multithreading

Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact – data cache Maximising Inst issue rate – branch prediction Maximising Inst issue rate – superscalar Maximising pipeline utilisation – avoid instruction dependencies – out of order execution

Increasing Parallelism Amount of parallelism that we can exploit is limited by the programs – Some areas exhibit great parallelism – Some others are essentially sequential In the later case, where can we find additional independent instructions? – In a different program!

Hardware Multithreading Allow multiple threads to share a single processor Requires replicating the independent state of each thread Virtual memory can be used to share memory among threads

CPU Support for Multithreading Data Cache Fetch Logic Decode LogicFetch LogicExec LogicFetch LogicMem LogicWrite Logic Inst Cache PC A PC B VA Mapping A VA Mapping B Address Translation Reg A Reg B

Hardware Multithreading Different ways to exploit this new source of parallelism – Coarse-grain parallelism – Fine-grain parallelism – Simultaneous Multithreading

Coarse-Grain Multithreading

Issue instructions from a single thread Operate like a simple pipeline Switch Thread on “expensive” operation: – E.g. I-cache miss – E.g. D-cache miss

Switch Threads on Icache miss Inst aIFIDEXMEMWB Inst bIFIDEXMEMWB Inst cIF MISSIDEXMEMWB Inst dIFIDEXMEM Inst eIFIDEX Inst fIFID Inst X Inst Y Inst Z ---- Remove Inst c and switch to other thread The next thread will continue its execution until there is another I-cache or D-cache miss

Switch Threads on Dcache miss Inst aIFIDEXM-MissWB Inst bIFIDEXMEMWB Inst cIFIDEXMEMWB Inst dIFIDEXMEM Inst eIFIDEX Inst fIFID MISS Inst X Inst Y Abort these Remove Inst a and switch to other thread – Remove the rest of instructions from ‘blue’ thread – Roll back ‘blue’ PC to point to Inst a

Coarse Grain Multithreading Good to compensate for infrequent, but expensive pipeline disruption Minimal pipeline changes – Need to abort all the instructions in “shadow” of Dcache miss  overhead – Resume instruction stream to recover Short stalls (data/control hazards) are not solved

Fine-Grain Multithreading

Overlap in time the execution of several threads Usually using Round Robin among all the threads in a ‘ready’ state Requires instantaneous thread switching

Fine-Grain Multithreading Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?) Inst aIFIDEXMEMWB Inst MIFIDEXMEMWB Inst bIFIDEXMEMWB Inst NIFIDEXMEM Inst cIFIDEX Inst PIFID

I-cache misses in Fine Grain Multithreading An I-cache miss is overcome transparently Inst aIFIDEXMEMWB Inst MIFIDEXMEMWB Inst bIF-MISS---- Inst NIFIDEXMEM Inst PIFIDEX Inst QIFID Inst b is removed and the thread is marked as not ‘ready’ ‘Blue’ thread is not ready so ‘orange’ is executed

D-cache misses in Fine Grain Multithreading Mark the thread as not ‘ready’ and issue only from the other thread Inst aIFIDEXM-MISSMiss WB Inst MIFIDEXMEMWB Inst bIFID--- Inst NIFIDEXMEM Inst PIFIDEX Inst QIFID Thread marked as not ‘ready’. Remove Inst b. Update PC. ‘Blue’ thread is not ready so ‘orange’ is executed

Inst aIFROEXMEMWB Inst MIFROEXMEMWB Inst bIFIDEXMEMWB Inst NIFIDEXMEM Inst cIFIDEX Inst PIFID D-cache misses in Fine Grain Multithreading In an out of order processor we may continue issuing instructions from both threads 4567 M MISS EXMEMWB ID IFIDEXMEM IFID 4567 M MISSMiss WB EXMEMWB RO(RO) EX IFROEXMEM IF(RO) IFRO

Fine Grain Multithreading Improves the utilisation of pipeline resources Impact of short stalls is alleviated by executing instructions from other threads Single thread execution is slowed Requires an instantaneous thread switching mechanism

Simultaneous Multi-Threading

The main idea is to exploit instructions level parallelism and thread level parallelism at the same time In a superscalar processor issue instructions from different threads Instructions from different threads can be using the same stage of the pipeline

Simultaneous MultiThreading Let’s look simply at instruction issue: Inst aIFIDEXMEMWB Inst bIFIDEXMEMWB Inst MIFIDEXMEMWB Inst NIFIDEXMEMWB Inst cIFIDEXMEMWB Inst PIFIDEXMEMWB Inst QIFIDEXMEMWB Inst dIFIDEXMEMWB Inst eIFIDEXMEMWB Inst RIFIDEXMEMWB

SMT issues Asymmetric pipeline stall – One part of pipeline stalls – we want other pipeline to continue Overtaking – want unstalled thread to make progress Existing implementations on O-o-O, register renamed architectures (similar to tomasulo)

SMT: Glimpse Into The Future? Scout threads? – A thread to prefetch memory – reduce cache miss overhead Speculative threads? – Allow a thread to execute speculatively way past branch/jump/call/miss/etc – Needs revised O-o-O logic – Needs and extra memory support – See Transactional Memory

Simultaneous Multi Threading Extracts the most parallelism from instructions and threads Implemented only in out-of-order processors because they are the only able to exploit that much parallelism Has a significant hardware overhead

Hardware Multithreading

Benefits of Hardware Multithreading All multithreading techniques improve the utilisation of processor resources and, hence, the performance If the different threads are accessing the same input data they may be using the same regions of memory – Cache efficiency improves in these cases

Disadvantages of Hardware Multithreading The perceived performance may be degraded when comparing with a single-thread CPU – Multiple threads interfering with each other The cache has to be shared among several threads so effectively they would use a smaller cache Thread scheduling at hardware level adds high complexity to processor design – Thread state, managing priorities, OS-level information, …

Comparison of Multithreading Techniques

Multithreading Summary A cost-effective way of finding additional parallelism for the CPU pipeline Available in x86, Itanium, Power and SPARC (Most architectures) Present additional CPU thread as additional CPU to Operating System Operating Systems Beware!!! (why?)