Electrical and Computer Engineering

Slides:

Advertisements

Similar presentations

Multithreaded Processors

Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.

CS 252 Graduate Computer Architecture Lecture 13: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,

April 6, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer.

CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,

CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading I Steve Ko Computer Sciences and Engineering University at Buffalo.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

March 16, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture Thread Level Parallelism Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

© Krste Asanovic, 2015CS252, Spring 2015, Lecture 13 CS252 Graduate Computer Architecture Fall 2015 Lecture 13: Multithreading Krste Asanovic

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

CENG331 Computer Organization Multithreading

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

CS203 – Advanced Computer Architecture TLP – Multithreaded Architectures.

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

COSC3330 Computer Architecture Lecture 21. TLP and Multi-core

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

CS203 – Advanced Computer Architecture

Mehmet Altan Açıkgöz Ercan Saraç

William Stallings Computer Organization and Architecture 8th Edition

Krste Asanovic Electrical Engineering and Computer Sciences

CPE 731 Advanced Computer Architecture Thread Level Parallelism

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Lynn Choi School of Electrical Engineering

Simultaneous Multithreading

Computer Structure Multi-Threading

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Hyperthreading Technology

Electrical and Computer Engineering

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Pipelining: Advanced ILP

Lecture: SMT, Cache Hierarchies

Computer Architecture: Multithreading (I)

Krste Asanovic Electrical Engineering and Computer Sciences

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Limits to ILP Conflicting studies of amount

Hardware Multithreading

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

CSC3050 – Computer Architecture

Levels of Parallelism within a Single Processor

Hardware Multithreading

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 14 – Multithreading Krste Asanovic Speaker: David Biancolin.

8 – Simultaneous Multithreading

Lecture 22: Multithreading

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Presentation transcript:

Electrical and Computer Engineering ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 16 Multi-threading Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

ECE252 Administrivia 15 November – Homework #4 Due Project Proposals ECE 299 – Energy-Efficient Computer Systems www.duke.edu/~bcl15/class/class_ece299fall10.html Technology, architectures, systems, applications Seminar for Spring 2012. Class is paper reading, discussion, research project In Fall 2010, students read >35 research papers. In Fall 2012, read research papers. In Fall 2012, also considering textbook “The Datacenter as a Computer: An Introduction to the Design of Warehouse-scale Machines.” ECE 252 / CPS 220

Last Time Out-of-order Superscalar Very Long Instruction Word (VLIW) Hardware complexity increases super-linearly with issue width Very Long Instruction Word (VLIW) Compiler explicitly schedules parallel instructions Simple hardware, complex compiler Later VLIWs added more dynamic interlocks Compiler Analysis Use loop unrolling and software pipelining for loops, trace scheduling for more irregular code Static compiler scheduling is difficult in presence of unpredictable branches and variable memory latency ECE 252 / CPS 220

Multi-threading Instruction-level Parallelism Multi-threaded Processor Objective: Extract instruction-level parallelism (ILP) Difficulty: Limited ILP from sequential thread of control Multi-threaded Processor Processor fetches instructions from multiple threads of control If instructions from one thread stalls, issue instructions from other threads Thread-level Parallelism Thread-level parallelism (TLP) provides independent threads Multi-programming – run multiple, independent, sequential jobs Multi-threading – run single job faster using multiple, parallel threads ECE 252 / CPS 220

Increasing Pipeline Utilization In an out-of-order superscalar processor, many apps cannot fully use execution units Consider percentage of issue cycles in which processor is busy. Tullsen, Eggers and Levy. “Simultaneous multi-threading,” ISCA 1995. ECE 252 / CPS 220

Superscalar (In)Efficiency Issue width Time Instruction issue Completely idle cycle (vertical waste) Partially filled cycle, i.e., IPC < 4 (horizontal waste) ECE 252 / CPS 220

Pipeline Hazards F D X M W LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 Data Dependencies Dependencies exist between instructions Example: LW-LW (via r1), LW-ADDI (via r5), ADDI-SW (via r5) Solutions Interlocks stall the pipeline (slow) Forwarding paths (requires hardware, limited applicability) ECE 252 / CPS 220

Vertical Multithreading Issue width Instruction issue Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) With cycle-by-cycle interleaving, remove vertical waste ECE 252 / CPS 220

Multithreading F D X M W T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t9 Interleave instructions from multiple threads in pipeline Example: Interleave threads (T1, T2, T3, T4) in 5-stage pipeline For any given thread, earlier instruction writes-back (W) before later instruction reads register file (D). Example: [T1: LW r1, 0(r2)] writes back before [T1: LW r5, 12(r1)] decodes ECE 252 / CPS 220

CDC 6600 Peripheral Processors Cray (1964) built first multithreaded hardware architecture Pipeline with100ns clock period 10 “virtual” I/O processors  provides thread-level parallelism Each virtual processor executes one instruction every 1000ns (= 1ms) Minimize processor state with accumulator-based instruction set ECE 252 / CPS 220

Simple Multithreaded Pipeline +1 2 Thread select PC 1 I$ IR GPR1 X Y D$ Additional State: One copy of architected state per thread (e.g., PC, GPR) Thread Select: Round-robin logic; Propagate thread ID down pipeline to access correct state (e.g., GPR1 versus GPR2) Software (e.g., OS) perceives multiple, slower CPUs ECE 252 / CPS 220

Multithreading Costs User State (per thread) System State (per thread) Program counters Physical register files System State (per thread) Page table base register Exception handling registers Overheads and Contention Threads contend for shared caches, TLB Alternatively, threads require additional cache, TLB capacity Scheduler (OS or HW) manages threads ECE 252 / CPS 220

Fine-Grain Multithreading Switch threads at instruction-granularity Fixed Interleave (CDC 6600 PPUs, 1964) PPU – peripheral processing unit Given N threads, each thread executes one instruction every N cycles Insert pipeline bubble (e.g., NOP) if thread not ready for its slot Software-controlled Interleave (TI ASC PPUs, 1971) OS explicitly controls thread interleaving Example: blue thread scheduled 2x as often as orange, purple thread Why was thread scheduling introduced for peripheral processing units first? ECE 252 / CPS 220

Denelcor HEP (1982) First commercial hardware-threading for main CPU Architected by Burton Smith Multithreading previously used to hide long I/O, memory latencies in PPUs Up to 8 processors, 10 MHz Clock 120 threads per processor Precursor to Tera MTA ECE 252 / CPS 220

Tera MTA (1990) Up to 256 processors Up to 128 threads per processor Processors and memories communicate via a 3D torus interconnect Nodes linked to nearest 6 neighbors Main memory is flat and shared Flat memory  no data cache Memory sustains 1 memory access per cycle per processor Why does this make sense for a multi-threaded machine? ECE 252 / CPS 220

MIT Alewife (1990) Anant Agarwal at MIT SPARC Multithreading RISC instruction set architecture from Sun Microsystems Alewife modifies SPARC processor Multithreading Up to 4 threads per processor Threads switch on cache miss ECE 252 / CPS 220

Coarse-Grain Multithreading Switch threads on long-latency operation Tera MTA designed for supercomputing applications with large data sets and little locality Little locality  no data cache Many parallel threads needed to hide long memory latency If one thread accesses memory, schedule another thread in its place Other applications may be more cache friendly Good locality  data cache hits Provide small number of threads to hide cache miss penalty If one thread misses cache, schedule another thread in its place ECE 252 / CPS 220

Multithreading & “Simple” Cores IBM PowerPC RS64-IV (2000) RISC instruction set architecture from IBM Implements in-order, quad-issue, 5-stage pipeline Up to 2 threads per processor Oracle/Sun Niagara Processors (2004-2009) Targets datacenter web and database servers. SPARC instruction set architecture from Sun Implements simple, in-order core Niagara-1 (2004) – 8 cores, 4 threads/core Niagara-2 (2007) – 8 cores, 8 threads/core Niagara-3 (2009) – 16 cores, 8 threads/core ECE 252 / CPS 220

Why Simple Cores? Switch threads on cache miss Simple Core Advantages If a thread accesses memory, flush pipeline switch to another thread Simple Core Advantages Minimize flush penalty with short pipeline (4 cycles in 5-stage pipeline) Reduce energy cost per op (out-of-order execution consumes energy) Simple Core Trade-off Lower single-thread performance Higher multi-thread performance and energy efficiency ECE 252 / CPS 220

Oracle/Sun Niagara-3 (2009) ECE 252 / CPS 220

Superscalar (In)Efficiency Issue width Time Instruction issue Completely idle cycle (vertical waste) Partially filled cycle, i.e., IPC < 4 (horizontal waste) ECE 252 / CPS 220

Vertical Multithreading Issue width Instruction issue Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) Vertical multithreading reduces vertical waste with cycle-by-cycle interleaving. However, horizontal waste remains. ECE 252 / CPS 220

Chip Multiprocessing (CMP) Issue width Completely idle cycle (vertical waste) Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) Chip multiprocessing reduces horizontal waste with simple (narrower) cores. However, vertical waste remains. And ILP is bounded. ECE 252 / CPS 220

Simultaneous Multithreading (SMT) Issue width Time Interleave multi-threaded instructions with no restrictions. Tullsen, Eggers, Levy. University of Washington, 1995. ECE 252 / CPS 220

IBM Power4  IBM Power5 Power4 - Single-threaded processor - Out-of-order execution, superscalar (8 execution units) ECE 252 / CPS 220

IBM Power4  IBM Power5 Power4 – single-threaded processor, 8 execution units Power5 – dual-threaded processor, 8 execution units ECE 252 / CPS 220

IBM Power5 Data Flow Program Counter: duplicate and alternate fetch between two threads Return Stack: duplicate since different threads call different sub-routines Instruction Buffer: queue instructions separately Decode: de-queue instructions depending on thread priority ECE 252 / CPS 220

IBM Power4  IBM Power5 Support multiple instruction streams - Increase L1 instruction cache associativity - Increase instruction TLB associativity - Mitigate cache contention from instruction streams - Separate instruction prefetch and buffering per thread - Allow thread prioritization Support more instructions in-flight - Increase the number of physical registers (e.g., 152 to 240) - Mitigate register renaming bottleneck Support larger cache footprints - Increase L2 cache size (e.g., 1.44MB to 1.92MB) - Increase L3 cache size - Mitigate cache contention from data footprints Power5 core is 24% larger than Power4 core ECE 252 / CPS 220

Instruction Scheduling ICOUNT: Schedule thread with fewest instructions in flight (1) prevents one thread from filling issue queue (2) prioritizes threads that efficiently move instructions through datapath (3) provides an even mix of threads, maximizing parallelism ECE 252 / CPS 220

SMT Performance Intel Pentium4 Extreme SMT IBM Power 5 Intuition - Single program - 1.01x speedup for SPECint and 1.07x speedup for SPECfp - Multiprogram (pairs of SPEC workloads) - 0.9-1.6x for various pairs IBM Power 5 - 1.23x speedup for SPECint and 1.16x speedup for SPECfp - 0.89-1.41x for various pairs Intuition - SPECint has complex control flow, idles processor, benefits from SMT - SPECfp has large data sets, cache conflicts, fewer benefits from SMT ECE 252 / CPS 220

SMT Flexibility For SW regions with high thread level parallelism (TLP), share machine width across all threads For SW regions with low thread level parallelism (TLP), reserve machine width for single thread instruction level parallelism (ILP) Issue width Time Issue width Time ECE 252 / CPS 220

Types of Multithreading Simultaneous Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot ECE 252 / CPS 220

Summary Out-of-order Superscalar Thread-level Parallelism Processor pipeline is under-utilized due to data dependencies Thread-level Parallelism Independent threads more fully use processor resources Multithreading Reduce vertical waste by scheduling threads to hide long latency operations (e.g., cache misses) Reduce horizontal waste by scheduling threads to more fully use superscalar issue width ECE 252 / CPS 220