CALTECH cs184c Spring2001 -- DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 23: May 23, 2005 Dataflow.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.
EKT303/4 Superscalar vs Super-pipelined.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
ECE Dept., Univ. Maryland, College Park
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Latency Tolerance: what to do when it just won’t go away
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture 22: Multithreading
Guest Lecturer: Justin Hsia
Presentation transcript:

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous Multi-Threading (SMT)

CALTECH cs184c Spring DeHon Reading Shared Memory –Focus: H&P Ch 8 At least read this… –Retrospectives Valuable and short –ISCA papers Good primary sources

CALTECH cs184c Spring DeHon Today TAM SMT

CALTECH cs184c Spring DeHon Threaded Abstract Machine

CALTECH cs184c Spring DeHon TAM Parallel Assembly Language Fine-Grained Threading Hybrid Dataflow Scheduling Hierarchy

CALTECH cs184c Spring DeHon TL0 Model Activition Frame (like stack frame) –Variables –Synchronization –Thread stack (continuation vectors) Heap Storage –I-structures

CALTECH cs184c Spring DeHon TL0 Ops RISC-like ALU Ops FORK SWITCH STOP POST FALLOC FFREE SWAP

CALTECH cs184c Spring DeHon Scheduling Hierarchy Intra-frame –Related threads in same frame –Frame runs on single processor –Schedule together, exploit locality (cache, maybe regs) Inter-frame –Only swap when exhaust work in current frame

CALTECH cs184c Spring DeHon Intra-Frame Scheduling Simple (local) stack of pending threads Fork places new PC on stack STOP pops next PC off stack Stack initialized with code to exit activation frame –Including schedule next frame –Save live registers

CALTECH cs184c Spring DeHon TL0/CM5 Intra-frame Fork on thread –Fall through 0 inst –Unsynch branch 3 inst –Successful synch 4 inst –Unsuccessful synch 8 inst Push thread onto LCV 3-6 inst

CALTECH cs184c Spring DeHon Fib Example [look at how this turns into TL0 code]

CALTECH cs184c Spring DeHon Multiprocessor Parallelism Comes from frame allocations Runtime policy where allocate frames –Maybe use work stealing?

CALTECH cs184c Spring DeHon Frame Scheduling Inlets to non-active frames initiate pending thread stack (RCV) First inlet may place frame on processor’s runable frame queue SWAP instruction picks next frame branches to its enter thread

CALTECH cs184c Spring DeHon CM5 Frame Scheduling Costs Inlet Posts on non-running thread –10-15 instructions Swap to next frame –14 instructions Average thread cost 7 cycles –Constitutes 15-30% TL0 instr

CALTECH cs184c Spring DeHon Instruction Mix [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Cycle Breakdown [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Speedup Example [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Thread Stats Thread lengths 3—17 Threads run per “quantum” 7—530 [Culler et. Al. JPDC, July 1993]

CALTECH cs184c Spring DeHon Great Project Develop optimized  Arch for TAM –Hardware support/architecture for single- cycle thread-switch/post

CALTECH cs184c Spring DeHon Multithreaded Architectures

CALTECH cs184c Spring DeHon Problem Long latency of operations –Non-local memory fetch –Long latency operations (mpy, fp) Wastes processor cycles while stalled If processor stalls on return –Latency problem turns into a throughput (utilization) problem –CPU sits idle

CALTECH cs184c Spring DeHon Idea Run something else useful while stalled In particular, another thread –Another PC Again, use parallelism to “tolerate” latency

CALTECH cs184c Spring DeHon HEP/  Unity/Tera Provide a number of contexts –Copies of register file… Number of contexts  operation latency –Pipeline depth –Roundtrip time to main memory Run each round-robin

CALTECH cs184c Spring DeHon HEP Pipeline [figure: Arvind+Innucci, DFVLR’87]

CALTECH cs184c Spring DeHon Strict Interleaved Threading Uses parallelism to get throughput Potentially poor single-threaded performance –Increases end-to-end latency of thread

CALTECH cs184c Spring DeHon SMT

CALTECH cs184c Spring DeHon Can we do both? Issue from multiple threads into pipeline No worse than (super)scalar on single thread More throughput with multiple threads –Fill in what would have been empty issue slots with instructions from different threads

CALTECH cs184c Spring DeHon SuperScalar Inefficiency Unused Slot Recall: limited Scalar IPC

CALTECH cs184c Spring DeHon SMT Promise Fill in empty slots with other threads

CALTECH cs184c Spring DeHon SMT Estimates (ideal) [Tullsen et. al. ISCA ’95]

CALTECH cs184c Spring DeHon SMT Estimates (ideal) [Tullsen et. al. ISCA ’95]

CALTECH cs184c Spring DeHon SMT uArch Observation: exploit register renaming –Get small modifications to existing superscalar architecture

CALTECH cs184c Spring DeHon Stopped Here 4/24/01

CALTECH cs184c Spring DeHon SMT uArch N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96]

CALTECH cs184c Spring DeHon SMT uArch Changes: –Multiple PCs –Control to decide how to fetch from –Separate return stacks per thread –Per-thread reorder/commit/flush/trap –Thread id w/ BTB –Larger register file More things outstanding

CALTECH cs184c Spring DeHon Performance [Tullsen et. al. ISCA ’96]

CALTECH cs184c Spring DeHon Optimizing: fetch freedom RR=Round Robin RR.X.Y –X – threads do fetch in cycle –Y – instructions fetched/thread [Tullsen et. al. ISCA ’96]

CALTECH cs184c Spring DeHon Optimizing: Fetch Alg. ICOUNT – priority to thread w/ fewest pending instrs BRCOUNT MISSCOUNT IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96]

CALTECH cs184c Spring DeHon Throughput Improvement 8-issue superscalar –Achieves little over 2 instructions per cycle Optimized SMT –Achieves 5.4 instructions per cycle on 8 threads 2.5x throughput increase

CALTECH cs184c Spring DeHon Costs [Burns+Gaudiot HPCA’99]

CALTECH cs184c Spring DeHon Costs [Burns+Gaudiot HPCA’99]

CALTECH cs184c Spring DeHon Not Done, yet… Conventional SMT formulation is for coarse-grained threads Combine SMT w/ TAM ? –Fill pipeline from multiple runnable threads in activation frame –?multiple activation frames? –Eliminate thread switch overhead?

CALTECH cs184c Spring DeHon Thought? SMT reduce need for split-phase operations?

CALTECH cs184c Spring DeHon Big Ideas Primitives –Parallel Assembly Language –Threads for control –Synchronization (post, full-empty) Latency Hiding –Threads, split-phase operation Exploit Locality –Create locality Scheduling quanta