Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 Operating Systems and Protection CS Goals of Today’s Lecture How multiple programs can run at once  Processes  Context switching  Process.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Introduction to Operating Systems Chapter 1. cs431 -cotter2 Lecture Objectives Understand the relationship between computing hardware, operating system,
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture: Multithreading (II)
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
Distributed Processors
Processes and Threads Processes and their scheduling
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture 21: Synchronization & Consistency
Lecture 22: Multithreading
Presentation transcript:

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading

Caltech CS184 Spring DeHon 2 Today Multitasking/Multithreading model Fine-Grained Multithreading SMT (Symmetric Multi-Threading)

Caltech CS184 Spring DeHon 3 Problem Long latency of operations –IO or page-fault –Non-local memory fetch Main memory, L3, remote node in distributed memory –Long latency operations (mpy, fp) Wastes processor cycles while stalled If processor stalls on return –Latency problem turns into a throughput (utilization) problem –CPU sits idle

Caltech CS184 Spring DeHon 4 Idea Run something else useful while stalled In particular, another process/thread –Another PC Use parallelism to “tolerate” latency

Caltech CS184 Spring DeHon 5 Old Idea Share expensive machine among multiple users (jobs) When one user task must wait on IO –Run another one Time multiplex machine among users

Caltech CS184 Spring DeHon 6 Mandatory Concurrency Some tasks must be run “concurrently” (interleaved) with user tasks –DRAM Refresh –IO Keyboard, network, … –Window system (xclock…) –Autosave –Clippy 

Caltech CS184 Spring DeHon 7 Other Useful Concurrency Print spooler Web browser –Download images in parallel Instant Messanger/Zephyr (Gale) biff/xbiff/xfaces…

Caltech CS184 Spring DeHon 8 Multitasking Single machine run multiple tasks Machine provides same ISA/sequential semantics to each task –Task believes it own machines –Same as if other tasks running on different machines Tasks isolated from one another –Cannot affect each other functionally –(may impact each other’s performance)

Caltech CS184 Spring DeHon 9 Each task/process Process – virtualization of the CPU –Has own unique set of state: PC Registers VM Page Table (hence memory image)

Caltech CS184 Spring DeHon 10 Sharing the CPU Save/Restore –PC/Registers/Page Table Virtual Memory Isolation Privileged system software –User/System mode execution Functionally, task not notice that it gave up the CPU for period of time

Caltech CS184 Spring DeHon 11 Threads Threads – separate PC, but shares and address space Has own processor state: –PC –Registers Shares –Memory –VM Page Table Process may have multiple threads

Caltech CS184 Spring DeHon 12 Multitasking/Multithreading Gives us an initial model for parallelism So far, parallelism of unrelated tasks Eventually, cooperating –Have to address concurrent memory model –(next time)

Caltech CS184 Spring DeHon 13 Fine Grained

Caltech CS184 Spring DeHon 14 HEP/  Unity/Tera Provide a number of contexts –Separate PCs, register files, … Number of contexts  operation latency –Pipeline depth –Roundtrip time to main memory Run each context in round-robin fashion

Caltech CS184 Spring DeHon 15 HEP Pipeline [figure: Arvind+Innucci, DFVLR’87]

Caltech CS184 Spring DeHon 16 Strict Interleaved Threading Uses parallelism to get throughput –Avoid interlocking, bypass… –Cover memory latency –Essentially C-slow transformation of processor Potentially poor single-threaded performance –Increases end-to-end latency of thread

Caltech CS184 Spring DeHon 17 SMT

Caltech CS184 Spring DeHon 18 Superscalar and Multithreading? Do both? Issue from multiple threads into pipeline No worse than (super)scalar on single thread More throughput with multiple threads –Fill in what would have been empty issue slots with instructions from different threads

Caltech CS184 Spring DeHon 19 SuperScalar Inefficiency Unused Slot Recall: limited Scalar IPC

Caltech CS184 Spring DeHon 20 SMT Promise Fill in empty slots with other threads

Caltech CS184 Spring DeHon 21 SMT Estimates (ideal) [Tullsen et. al. ISCA ’95]

Caltech CS184 Spring DeHon 22 SMT Estimates (ideal) [Tullsen et. al. ISCA ’95]

Caltech CS184 Spring DeHon 23 SMT uArch Observation: exploit register renaming –Get small modifications to existing superscalar architecture

Caltech CS184 Spring DeHon 24 SMT uArch N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96]

Caltech CS184 Spring DeHon 25 Alpha: Basic Out-of-order Pipeline Fetch Decod e/Map Queue Reg Read Execute Dcach e/Stor e Buffer Reg Write Retire PC Icache Register Map Dcache Regs Thread- blind [Src: Tryggve Fossum, Compaq 2000]

Caltech CS184 Spring DeHon 26 Alpha SMT Pipeline Fetch Decod e/Map Queue Reg Read Execute Dcach e/Stor e Buffer Reg Write Retire Icache Dcache PC Register Map Regs [Src: Tryggve Fossum, Compaq]

Caltech CS184 Spring DeHon 27 SMT uArch Changes: –Multiple PCs –Control to decide how to fetch from –Separate return stacks per thread –Per-thread reorder/commit/flush/trap –Thread id w/ BTB –Larger register file More things outstanding

Caltech CS184 Spring DeHon 28 Performance [Tullsen et. al. ISCA ’96]

Caltech CS184 Spring DeHon 29 Relative Performance (Alpha) Relative Multithreaded Performance Int95Fp95Int95/FP95SQL Programs Relative Performance 1-T 2-T 3-T 4-T [Src: Tryggve Fossum, Compaq]

Caltech CS184 Spring DeHon 30 Alpha SMT Cost-effective Multiprocessing--increased throughput 4 X architectural registers 2 X performance gain with little additional cost and complexity

Caltech CS184 Spring DeHon 31 Optimizing: fetch freedom RR=Round Robin RR.X.Y –X – threads do fetch in cycle –Y – instructions fetched/thread [Tullsen et. al. ISCA ’96]

Caltech CS184 Spring DeHon 32 Optimizing: Fetch Alg. ICOUNT – priority to thread w/ fewest pending instrs BRCOUNT MISSCOUNT IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96]

Caltech CS184 Spring DeHon 33 Throughput Improvement 8-issue superscalar –Achieves little over 2 instructions per cycle Optimized SMT –Achieves 5.4 instructions per cycle on 8 threads 2.5x throughput increase

Caltech CS184 Spring DeHon 34 Costs [Burns+Gaudiot HPCA’2000]

Caltech CS184 Spring DeHon 35 Costs [Burns+Gaudiot HPCA’2000] Single and double cache line fetches

Caltech CS184 Spring DeHon 36 Check on B5

Caltech CS184 Spring DeHon 37 Machines Pull machines off grid? –Matrix1? –Matrix37—39 Off for someone else, leave off and use if they finish in timely fashion? –Not necessary for single processor machine Department grid cluster

Caltech CS184 Spring DeHon 38 Big Ideas 0, 1, Infinity  virtualize resources –Processes virtualize CPU Latency Hiding –Processes, Threads –Find something else useful to do while wait…