Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Slides:

Advertisements

Similar presentations

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.

Lecture 6: Multicore Systems

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

How Multi-threading can increase on-chip parallelism

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.

Architecture Basics ECE 454 Computer Systems Programming

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Alpha Supplement CS 740 Oct. 14, 1998

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Computer Architecture Lec 10 –Simultaneous Multithreading.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Electrical and Computer Engineering

Simultaneous Multithreading

Lynn Choi School of Electrical Engineering

Simultaneous Multithreading

Multi-core processors

Computer Structure Multi-Threading

Lynn Choi Dept. Of Computer and Electronics Engineering

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Hyperthreading Technology

Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Levels of Parallelism within a Single Processor

Limits to ILP Conflicting studies of amount

Hardware Multithreading

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Presentation transcript:

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering

Limitations of Superscalar Processors  Hardware complexity of wide-issue processors  Limited instruction fetch bandwidth  Taken branches and branch prediction throughput  Quadratic (or more) increase in hardware complexity in  Renaming logic  Wakeup and selection logic  Bypass logic  Register file access time  On-chip wire delays prevent centralized shared resources  End-to-end on-chip wire delay grows rapidly from 2-3 clock cycles in 0.25  to 20 clock cycles in sub 0.1  technology  This prevents centralized shared resources  Limitations of available ILP  Even with aggressive wide-issue implementations  The amount of ILP exploitable is less than 5~6 instructions per cycle

Today’s Microprocessor  CPU 2013 – looking back to year 2001 according to Moore’s law  256X increase in terms of transistors  256X performance improvement, however,  Wider issue rate increases the clock cycle time  Limited amount of ILP in applications  Diminishing return in terms of  Performance and resource utilization  Intel i7 Processor  Technology  32nm process, 130W, 239 mm² die, 1.17B transistors  3.46 GHz, 64-bit 6-core 12-thread processor  159 Ispec, 103 Fspec on SPEC CPU 2006 (296MHz UltraSparc II processor as a reference machine)  14-stage 4-issue out-of-order (OOO) pipeline optimized for multicore and low power consumption  64bit Intel architecture (x86-64)  256KB L2 cache/core, 12MB L3 Caches  Goals  Scalable performance and more efficient resource utilization

Approaches  MP (Multiprocessor) approach  Decentralize all resources  Multiprocessing on a single chip  Communicate through shared-memory: Stanford Hydra  Communicate through messages: MIT RAW  MT (Multithreaded) approach  More tightly coupled than MP  Dependent threads vs. independent threads  Dependent threads require HW for inter-thread synchronization and communication  Examples: Multiscalar (U of Wisconsin), Superthreading (U of Minnesota), DMT, Trace Processor  Independent threads: Fine-grain multithreading, SMT  Centralized vs. decentralized architectures  Decentralized multithreaded architectures  Each thread has a separate pipeline  Multiscalar, Superthreading  Centralized multithreaded architectures  Share pipelines among multiple threads  TERA, SMT (throughput-oriented), Trace Processor, DMT (performance-oriented)

MT Approach  Multithreading of Independent Threads  No inter-thread dependency checking and no inter-thread communication  Threads can be generated from  A single program (parallelizing compiler)  Multiple programs (multiprogramming workloads)  Fine-grain Multithreading  Only a single thread active at a time  Switch thread on a long latency operation (cache miss, stall)  MIT April, Elementary Multithreading (Japan)  Switch thread every cycle – TERA, HEP  Simultaneous Multithreading (SMT)  Multiple threads active at a time  Issue from multiple threads each cycle  Multithreading of Dependent Threads  Not adopted by commercial processors due to complexity and only marginal performance gain

SMT (Simultaneous Multithreading)  Motivation  Existing multiple-issue superscalar architectures do not utilize resources efficiently  Intel Pentium III, DEC Alpha 21264, PowerPC, MIPS R10000  Exhibit horizontal and vertical pipeline wastes

SMT Motivation  Fine-grain Multithreading  HEP, Tera, MASA, MIT Alewife  Fast context switching among multiple independent threads  Switch threads on cache miss stalls – Alewife  Switch threads on every cycle – Tera, HEP  Target vertical wastes only  At any cycle, issue instructions from only a single thread  Single-chip MP  Coarse-grain parallelism among independent threads in a different processor  Also exhibit both vertical and horizontal wastes in each individual processor pipeline

SMT Idea  Idea  Interleave multiple independent threads into the pipeline every cycle  Eliminate both horizontal and vertical pipeline bubbles  Increase processor utilization  Require added hardware resources  Each thread needs its own PC, register file, instruction retirement & exception mechanism  How about branch predictors? - RSB, BTB, BPT  Multithreaded scheduling of instruction fetch and issue  More complex and larger shared cache structures (I/D caches)  Share functional units and instruction windows  How about instruction pipeline?  Can be applied to MP architectures

Multithreading of Independent Threads Comparison of pipeline issue slots in three different architectures Superscalar Fine-grained Multithreading Simultaneous Multithreading

Experimentation  Simulation  Based on Alpha with following differences  Augmented for wider-issue superscalar and SMT  Larger on-chip L1 and L2 caches  Multiple hardware contexts for SMT  2K-entry bimodal predictor, 12-entry RSB  SPEC92 benchmarks  Compiled by Multiflow trace scheduling compiler  No extra pipeline stage for SMT  Less than 5% impact  Due to the increased (1 extra cycle) misprediction penalty  SMT scheduling  Context 0 can schedule onto any unit; context 1 can schedule on to any unit unutilized by context 0, etc.

Where the wastes come from? 8-issue superscalar processor execution time distribution - 19% busy time (~ 1.5 IPC) (1) 37% short FP dependences (2) Dcache misses (3) Long FP dependences (4) Load delays (5) Short integer dependences (6) DTLB misses (7) Branch misprediction occupies 60% - 61% wasted cycles are vertical - 39% are horizontal

Machine Models  Fine-grain multithreading - one thread each cycle  SMT - multiple threads each cycle  full simultaneous issue - each thread can issue up to 8 each cycle  four issue - each thread can issue up to 4 each cycle  dual issue - each thread can issue up to 2 each cycle  single issue - each thread issue 1 each cycle  limited connection - partition FUs to threads  8 threads, 4 INT, each INT can receive from 2 threads

Performance Saturated at 3 IPC bounded by vertical wastes Sharing degrades performance: 35%slow down of 1st priority thread due to competition Each thread need not utilize all resources; dual issue is almost as effective as full issue

SMT vs. MP MP’s advantage: simple scheduling, faster private cache access - both are not modeled

Exercises and Discussion  Compare SMT versus MP on a single chip in terms of cost/performance and machine scalability.  Discuss the bottleneck in each stage of a OOO superscalar pipeline.  What is the additional hardware/complexity required for SMT implementation?