Hardware Multithreading

Slides:

Advertisements

Similar presentations

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Advertisements

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

COMP25212 Advanced Pipelining Out of Order Processors.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Instruction-Level Parallelism (ILP)

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

EKT303/4 Superscalar vs Super-pipelined.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

COMP 740: Computer Architecture and Implementation

/ Computer Architecture and Design

Simultaneous Multithreading

Multi-core processors

Computer Structure Multi-Threading

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Instruction Scheduling for Instruction-Level Parallelism

Lecture 6: Advanced Pipelines

Out of Order Processors

From before the Break Classic 5-stage pipeline

Levels of Parallelism within a Single Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Control unit extension for data hazards

/ Computer Architecture and Design

Instruction Execution Cycle

Chapter 8. Pipelining.

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture: SMT, Cache Hierarchies

Levels of Parallelism within a Single Processor

Hardware Multithreading

CSC3050 – Computer Architecture

8 – Simultaneous Multithreading

The University of Adelaide, School of Computer Science

Instruction Level Parallelism

Presentation transcript:

Hardware Multithreading COMP25212 1

…from Wednesday What are the differences between software multithreading and hardware multithreading? Software: OS support for several concurrent threads Large number of threads (effectively unlimited) ‘Heavy’ context switching Hardware: CPU support for several instructions flows Limited number of threads (typically 2 or 4) ‘Light’/’Immediate’ context switching

…from Wednesday Describe Trashing in the context of Multithreading Two threads are accessing independent regions of memory which occupy the same cache lines and keep retiring each other’s data Why is it a problem? Both threads will have a high cache miss rate, which will slow their execution down a lot Describe Coarse-grain multithreading Threads are switched upon ‘expensive’ operations Describe fine-grain multithreading Threads are switched every single cycle among the ‘ready’ threads

Simultaneous Multi-Threading

Simultaneous Multi-Threading The main idea is to exploit instructions level parallelism and thread level parallelism at the same time In a superscalar processor issue instructions from different threads in the same cycle Schedule as many ‘ready’ instructions as possible Operand reading and result saving becomes much more complex Note that coarse-grain and fine-grain MT can also be implemented in superscalar processors

Simultaneous MultiThreading Let’s look simply at instruction issue: 1 2 3 4 5 6 7 8 9 10 Inst a IF ID EX MEM WB Inst b Inst M Inst N Inst c Inst P Inst Q Inst d Inst e Inst R

Simultaneous Multithreading We want to run these two Threads Issue as many Ready instrs. as possible

SMT issues with in-order processors Asymmetric pipeline stall One part of pipeline stalls – we want other pipeline to continue Overtaking – non-stalled threads should progress What happens if a ready thread Cache misses – Abort instruction (and instructions in the shadow if Dcache miss) upon cache miss Most existing implementations are for O-o-O, register-renamed architectures (akin to tomasulo) e.g. PowerPC, Intel Hyperthreading

Simultaneous Multi Threading Extracts the most parallelism from instructions and threads Implemented mostly in out-of-order processors because they are the only able to exploit that much parallelism Has a significant hardware overhead Replicate (and MUX) thread state (registers, TLBs, etc) Operand reading and result saving increases datapath complexity Per-thread instruction handling/scheduling engine in out-of-order implementations

Hardware Multithreading Summary

Benefits of HW MT Multithreading techniques improve the utilisation of processor resources and, hence, the overall performance If the different threads are accessing the same input data they may be using the same regions of memory Cache efficiency improves in these cases

Disadvantages of HW MT Single-thread performance may be degraded when compared to a single-thread CPU Multiple threads interfere with each other Shared caches mean that, effectively, threads would use a fraction of the whole cache Trashing may exacerbate this issue Thread scheduling at hardware level adds high complexity to processor design Thread state, managing priorities, OS-level information, …

Multithreading Summary A cost-effective way of finding additional parallelism for the CPU pipeline Available in x86, Itanium, Power and SPARC Intel Hyperthreading (SMT) PowerPC uses SMT UltraSparc T1/T2 used fine-grain, later models used SMT Sparc64 VI used coarse-grain, later models moved to SMT Present additional hardware thread as an additional virtual CPU to Operating System Multiprocessor OS is required

Multithreading in 4-way superscalar

Some Advanced Uses of Multithreading

Speculative Execution When reaching a conditional branch we could spawn 2 threads One runs the true path Another runs the false Once we know which one is correct kill the other thread Effects of Control Hazards alleviated Supported by current OoO cpus But not as a full-fledged thread Can reach several levels of nested conditions Requires memory support (e.g. reordering buffers) Branch Kill Thread

Memory Prefetching Compile applications into two threads One runs the whole application The other thread (scout thread) only has the memory accesses The scout thread runs ahead and fetches memory in advance Ensures data will be in the cache when the original thread needs it cache hit rate increases Synchronization is needed Scout has to run ahead enough so that memory delay is hidden … But not too much so that it does not replace useful data from the cache Beware trashing!!! Single threaded Original thread Scout thread xCM xCH xCM Data in cache

Slipstreaming Compile sequential applications into two threads One runs the application itself The slipstream thread only has a critical path of the application The slipstream thread runs ahead and passes results Delay of slow operations (e.g. float point division) is improved Synchronization and communication among the threads is needed Requires extra hardware to deal with this ‘special’ behaviour Could be used in multicore as well Single threaded Original thread Slipstream thread Non-critical Results Critical

Questions

Multithreading Example We want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #31, and the second program another at instruction #71. Assume that: + There is parallelism enough to execute all instructions independently (no hazards, apart from the two cache misses highlighted) + Switching threads can be done instantaneously + A cache miss requires 20 cycles to get the instruction to the cache. + The two programs would not interfere with each other’s caches lines Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload a) Sequentially (no multithreading) b) With coarse-grain multithreading c) With fine-grain multithreading d) With 2-way simultaneous multithreading

Superscalar Example (from pipeline-2) Consider the following program which implements R = A^2 + B^2 + C^2 + D^2 LD r1, A MUL r2, r1, r1 -- A^2 LD r3, B MUL r4, r3, r3 -- B^2 ADD r11, r2, r4 -- A^2 + B^2 LD r5, C MUL r6, r5, r5 -- C^2 LD r7, D MUL r8, r7, r7 -- D^2 ADD r12, r6, r8 -- C^2 + D^2 ADD r21, r11, r12 -- A^2 + B^2 + C^2 + D^2 ST r21, R The current code is not really suitable for a superscalar pipeline because of its low instruction-level parallelism Draw the dependency graph of the application Based on the graph above, discuss the suitability of the code to be run in a 2-way superscalar Simulate the execution of the original and the reordered code in a 5-stage 2-way superscalar pipeline

Questions